Oncology Clinical Trials
This page intentionally left blank
Oncology Clinical Trials
Wm. Kevin Kelly, DO Associate Professor Medicine and Surgery Yale School of Medicine Director of the Clinical Research Services Co-director of Prostate and Urologic Malignancy Yale Comprehensive Cancer Center New Haven, Connecticut
Susan Halabi, PhD Associate Professor Department of Biostatistics and Bioinformatics Duke University Medical Center Duke University Durham, North Carolina
New York
Acquisitions Editor: Richard Winters Cover Design: Steve Pisano Compositor: Publications Services Inc. Printer: King Printing Visit our website at www.demosmedpub.com
© 2010 Demos Medical Publishing, LLC. All rights reserved. This book is protected by copyright. No part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Medicine is an ever-changing science. Research and clinical experience are continually expanding our knowledge, in particular our understanding of proper treatment and drug therapy. The authors, editors, and publisher have made every effort to ensure that all information in this book is in accordance with the state of knowledge at the time of production of the book. Nevertheless, the authors, editors, and publisher are not responsible for errors or omissions or for any consequences from application of the information in this book and make no warranty, express or implied, with respect to the contents of the publication. Every reader should examine carefully the package inserts accompanying each drug and should carefully check whether the dosage schedules mentioned therein or the contraindications stated by the manufacturer differ from the statements made in this book. Such examination is particularly important with drugs that are either rarely used or have been newly released on the market.
The views expressed in this book are solely those of the contributors and do not represent those of the organizations or of the universities that the authors are affiliated with. In addition the authors accept all responsibility for any errors or omissions in this work.
Library of Congress Cataloging-in-Publication Data Oncology clinical trials / [edited by] William Kevin Kelly, Susan Halabi. p. ; cm. Includes bibliographical references and index. ISBN 978-1-933864-38-9 (hardcover) 1. Cancer—Research—Statistical methods. 2. Clinical trials. I. Kelly, William Kevin. II. Halabi, Susan. [DNLM: 1. Neoplasms—drug therapy. 2. Clinical Trials as Topic. QZ 267 O578 2010] RC267.O53 2010 362.196’994061—dc22 2009044203
Special discounts on bulk quantities of Demos Medical Publishing books are available to corporations, professional associations, pharmaceutical companies, health care organizations, and other qualifying groups. For details, please contact: Special Sales Department Demos Medical Publishing 11 W. 42nd Street, 15th Floor New York, NY 10036 Phone: 800–532–8663 or 212–683–0072 Fax: 212–941–7842 E-mail:
[email protected]
Made in the United States of America 09 10 11 12 13 54321
Contents
Foreword Preface Contributors
1. Introduction: What Is a Clinical Trial?
ix xi xiii
1
Susan Halabi and Wm. Kevin Kelly
2. Historical Perspectives of Oncology Clinical Trials
5
Ada H. Braun and David M. Reese
3. Ethical Principles Guiding Clinical Research
11
Sandra L. Alfano
4. Preclinical Drug Assessment
21
Cindy H. Chau and William Douglas Figg
5. Formulating the Question and Objectives
29
Lauren C. Harshman, Sandy Srinivas, James Thomas Symanowski, and Nicholas J. Vogelzang
6. Choice of Endpoints in Cancer Clinical Trials
35
Wenting Wu and Daniel Sargent
7. Design, Testing, and Estimation in Clinical Trials
43
Barry Kurt Moser
8. Design of Phase I Trials
57
Anatasia Ivanova and Leslie A. Bunce
v
vi
CONTENTS
9. Design of Phase II Trials
65
Hongkun Wang, Mark R. Conaway, and Gina R. Petroni
10. Randomization
73
Susan Groshen
11. Design of Phase III Trials
83
Stephen L. George
12. Multiple Treatment Arm Trials
93
Susan Halabi
13. Noninferiority Trials in Oncology
101
Suzanne E. Dahlberg and Robert J. Gray
14. Bayesian Designs in Clinical Trials
109
Gary L. Rosner and B. Nebiyou Bekele
15. The Trials and Tribulations of Writing an Investigator Initiated Clinical Study
119
Nicole P. Grant, Melody J. Sacatos, and Wm. Kevin Kelly
16. Data Collection
131
Eleanor H. Leung
17. Reporting of Adverse Event
141
Carla Kurkjian and Helen X. Chen
18. Toxicity Monitoring: Why, What, When?
151
A. Dimitrios Colevas
19. Interim Analysis of Phase III Trials
163
Edward L. Korn and Boris Friedlin
20. Interpretation of Results: Data Analysis and Reporting of Results
179
Donna Niedzwiecki and Donna Hollis
21. Statistical Considerations for Assessing Prognostic Factors in Cancer
189
Susan Halabi
22. Pitfalls in Oncology Clinical Trial Designs and Analysis
197
Stephanie Green
23. Biomarkers and Surrogate Endpoints in Clinical Trials
215
Marc Buyse and Stefan Michiels
24. Use of Genomics in Therapeutic Clinical Trials
227
Richard Simon
25. Imaging in Clinical Trials Binsheng Zhao and Lawrence H. Schwartz
239
vii
CONTENTS
26. Pharmacokinetic and Pharmacodynamic Monitoring in Clinical Trials: When Is It Needed?
251
Ticiana Leal and Jill M. Kolesar
27. Practical Design and Analysis Issues of Health Related Quality of Life Studies in International Randomized Controlled Cancer Clinical Trials
261
Andrew Bottomley, Corneel Coens, Murielle Mauer
28. Clinical Trials Considerations in Special Populations
267
Susan Burdette-Radoux and Hyman Muss
29. A Critical Reader’s Guide to Cost-Effectiveness Analysis
277
Greg Samsa
30. Systemic Review and Meta-Analysis
285
Steven M. Brunelli, Angela DeMichelle, James Guevara, and Jesse A. Berlin
31. Regulatory Affairs: The Investigator-Initiated Oncology Trial
299
Maria Mézes and Harvey M. Arbit
32. The Drug Evaluation Process in Oncology: FDA Perspective
307
Steven Lemery, Patricia Keegan, and Richard Pazdur
33. Industry Collaboration in Cancer Clinical Trials
315
Linda Bressler
34. Defining the Roles and Responsibilities of Study Personnel
321
Fred De Pourcq
35. Writing a Consent Form
327
Christine Grady
36. How Cooperative Groups Function
335
Edward L. Trimble and Alison Martin
37. Adaptive Clinical Trial Design in Oncology
343
Elizabeth Garrett-Mayer
38. Where Do We Need to Go with Clinical Trials in Oncology?
357
Andrea L. Harzstark and Eric J. Small Index
363
This page intentionally left blank
Foreword
C
linical trials are the engine of progress in the development of new drugs and devices for the detection, monitoring, prevention and treatment of cancer. A well conceived, carefully designed and efficiently conducted clinical trial can produce results that change clinical practice overnight, deliver new oncology drugs and diagnostics to the marketplace, and expand the horizon of contemporary thinking about cancer biology. A poorly done trial does little to advance the field or guide clinical practice, consumes precious clinical and financial resources and challenges the validity of the ethical contract between investigators and the volunteers who willingly give their time and effort to benefit future patients. Critical elements of clinical trials are those that address the scientific, ethical, technical and regulatory
aspects of human subjects research. What are we trying to learn? What clinical trial design will best address our objectives? How do we identify, recruit and protect the population of interest? How do we assure the quality of our data and the validity of our results? What regulatory standards must be met to obtain marketing approval for the drug or device we are studying? How do we maximize the information obtained from every trial so as to have the greatest impact on advancing science and improving care? Richard L. Schilsky, MD Professor of Medicine University of Chicago Chairman, Cancer and Leukemia Group B ASCO President 2008–2009
ix
This page intentionally left blank
Preface
I
n 2006, a group of colleagues were discussing the days events at ASCO and the topic turned to people that were retiring at the end of the year or changing career directions. Many of us were initially taken back, yet, also excited, to learn that some of our long time colleagues were turning a new leaf in their lives. These individuals were giants in the field who had not only helped shape modern oncology, but also had long and generously provided guidance to us. Many had been our mentors and we always felt that they would be there to ask: “How did you design this pivotal trial in colon cancer? Why did you use this endpoint for this lung trial? Can we really use this as a surrogate marker?” They were always were a voice of reason which reflected years of experience in doing clinical research and they played a vital role in our own scientific and intellectual development. Having the benefit of their experience underscored the importance, and declining prevalence, of mentoring the next generation of clinical trialists. Many of us have now inherited these perilous challenges.
This book is a collaborative effort and is based on the knowledge and expertise of leading oncologists, statisticians and clinical trial professionals from academia, industry and government who have years of experience in designing, conducting, analyzing, and reporting clinical trials in cancer. This collective experience was created to allow these seasoned investigators to pass on their knowledge to those who are entering the field. In so doing, our mission is to enhance the successful design, development, management and analysis of oncology clinical trials for the next generation. While this book focuses on oncology clinical trials, the fundamental concepts and basic principles are applicable to all trials in many medical disciplines. We hope that this work will aid the junior investigator’s academic, industry or government career in order to improve the quality of clinical trials. In so doing, their discoveries can be quickly and efficiently translated into improved patients outcomes and future care. Wm. Kevin Kelly Susan Halabi
xi
This page intentionally left blank
Contributors
Sandra L. Alfano, PharmD Chair, Human Investigational Committee Associate Research Scientist Department of Internal Medicine Yale University School of Medicine New Haven, Connecticut
Andrew Bottomley, PhD Assistant Director Head, Quality of Life Department European Organization for Research and Treatment of Cancer Brussels, Belguim
Harvey M. Arbit, BS Pharm, PharmD, MBA Director IND/IDE Assistance Program Clinical and Translation Science Institute Academic Health Center University of Minnesota Minneapolis, Minnesota
Ada H. Braun, MD, PhD Clinical Research Medical Director Department of Hematology/Oncology Amgen, Inc. Thousand Oaks, California
B. Nebiyou Bekele, PhD Associate Professor Department of Biostatistics The University of Texas MD Anderson Cancer Center Houston, Texas Jesse A. Berlin, ScD Vice President Department of Epidemiology Johnson & Johnson Pharmaceutical Research and Development Titusville, New Jersey
Linda Bressler, PharmD Director of Regulatory Affairs Cancer and Leukemia Group B Chicago, Illinois Steven M. Brunelli, MD, MSCE Assistant Professor Associate Physician Renal Division Brigham and Women’s Hospital Harvard Medical School Boston, Massachusetts
xiii
xiv
Leslie A. Bunce, MD Consultant Leslie A. Bunce, LLC Chapel Hill, North Carolina Susan Burdette-Radoux, MD Associate Professor of Medicine Hematology/Oncology Unit Vermont Cancer Center University of Vermont Burlington, Vermont Marc Buyse, ScD Chairman International Drug Development Institute Louvain-la-Neuve, Belgium Associate Professor of Biostatistics Hasselt University Hasselt, Belgium Cindy H. Chau, PharmD, PhD Medical Oncology Branch National Cancer Institute National Institutes of Health Bethesda, Maryland Helen X. Chen, MD Associate Chief Investigational Drug Branch Cancer Therapy Evaluation Program National Cancer Institute Bethesda, Maryland Corneel Coens, MSc Quality of Life Department European Organization for Research and Treatment of Cancer Brussels, Belgium A. Dimitrios Colevas, MD Associate Professor Department of Medicine Stanford University Stanford, California Mark R. Conaway, PhD Professor Public Health Services University of Virginia Charlottesville, Virginia
CONTRIBUTORS
Suzanne E. Dahlberg, PhD Reserach Scientist Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Havard School of Public Health Boston, Massachusetts Angela DeMichele, MD, MSCE Associate Professor of Medicine and Epidemiology Department of Medicine, Biostatistics and Epidemiology University of Pennsylvania Philadelphia, Pennsylvania William Douglas Figg, PharmD, MBA Medical Oncology Branch National Cancer Institute National Institutes of Health Bethesda, Maryland Boris Friedlin, PhD Mathematical Statistician Biometric Research Branch Division of Cancer Treatment and Diagnosis National Cancer Institute Bethesda, Maryland Stephen L. George, PhD Professor of Biostatistics and Bioinformatics Duke University School of Medicine Durham, North Carolina Christine Grady, MSN, PhD Acting Chief and Head of Section on Human Subjects Research Department of Bioethics National Institutes of Health Clinical Center Bethesda, Maryland Nicole P. Grant, BS Director, Protocol Development Office Yale Cancer Center New Haven, Connecticut Robert J. Gray, PhD Professor of Biostatistics Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Havard School of Public Health Boston, Massachusetts
CONTRIBUTORS
Stephanie Green, PhD Senior Director Clinical Biostatistics Pfizer, Inc. New London, Connecticut Susan Groshen, PhD Professor Department of Preventive Medicine Keck School of Medicine University of Southern California Los Angeles, California James Guevara, MD, MPH Associate Professor Department of Pediatrics University of Pennsylvania School of Medicine University of Pennsylvania Philadelphia, Pennsylvania Susan Halabi, PhD Associate Professor Department of Biostatistics and Bioinformatics Duke University Medical Center Duke University Durham, North Carolina Lauren C. Harshman, MD Instructor of Medicine Department of Medicine Division of Oncology Stanford University School of Medicine Stanford, California Andrea L. Harzstark, MD Assistant Clinical Professor Department of Medicine Univerisity of California, San Francisco San Francisco, California Donna Hollis, MS Senior Statistician Duke Comprehensive Cancer Center Duke University Durham, North Carolina Anastasia Ivanova, PhD Associate Professor Department of Biostatistics The University of North Carolina at Chapel Hill Chapel Hill, North Carolina
Patricia Keegan, MD Division Director Division of Biologic Oncology Products Office of Oncology Drug Products Center for Drug Evaluation and Research Food and Drug Administration Silver Spring, Maryland Wm. Kevin Kelly, DO Associate Professor Medicine and Surgery Yale School of Medicine Director of the Clinical Research Services Co-director of Prostate and Urologic Malignancy Yale Comprehensive Cancer Center New Haven, Connecticut Jill M. Kolesar, PharmD Professor School of Pharmacy University of Wisconsin-Madison Madison, Wisconsin Edward L. Korn, PhD Mathematical Statistician Biometric Research Branch Division of Cancer Treatment and Diagnosis National Cancer Institute Bethesda, Maryland Carla Kurkjian, MD Assistant Professor Department of Internal Medicine Section of Hematology/Oncology University of Oklahoma Health Sciences Center Oklahoma City, Oklahoma Ticiana Leal, MD Clinical Assistant Professor Department of Medical Oncology Carbone Cancer Center University of Wisconsin Madison, Wisconsin Steven Lemery, MD Medical Officer OODP/CDER Food and Drug Administration Silver Spring, Maryland
xv
xvi
CONTRIBUTORS
Eleanor H. Leung, PhD Data Coordinator Cancer and Leukemia Group B Statistical Center Duke University Durham, North Carolina
Donna Niedzwiecki, PhD Assistant Professor Biostatics and Bioinformatics Duke University Durham, North Carolina
Alison Martin, MD President and CEO Melanoma Research Alliance Washington, District of Columbia
Richard Pazdur, MD Director Office of Oncology Drug Products Food and Drug Administration Silver Spring, Maryland
Murielle Mauer, PhD Biostatistician Quality of Life Department European Organization for Research and Treatment of Cancer Brussels, Belgium Elizabeth Garrett-Mayer, PhD Associate Professor Department of Medicine Division of Biostatistics and Epidemiology Medical University of South Carolina Charleston, South Carolina Maria Mézes, BA, BEd Director INDMSU and Connecticut Clinical Trials Network Coordinator Yale Comprehensive Cancer Center Yale University School of Medicine New Haven, Connecticut Stefan Michiels, PhD Unit of Biostatistics and Epidemiology Institut Gustave Roussy Villejuif, France Barry Kurt Moser, PhD Associate Research Professor Department of Biostatistics and Bioinformatics Duke University Medical Center Duke University Durham, North Carolina Hyman B. Muss, MD Professor of Medicine University of North Carolina Director of Geriatric Oncology Lineberger Comprehensive Cancer Center Chapel Hill, North Carolina
Gina R. Petroni, PhD Professor Public Health Services University of Virginia Charlottesville, Virginia Fred De Pourcq, MS Special Projects Administrator Yale Center for Clinical Investigations Yale University New Haven, Connecticut David M. Reese, MD Executive Medical Director Hematology/Oncology Medical Sciences Amgen, Inc. Thousand Oaks, California Gary L. Rosner, ScD Professor Department of Biostatistics The University of Texas MD Anderson Cancer Center Houston, Texas Melody J. Sacatos, BA Yale Center for Clinical Investigation Yale University School of Medicine New Haven, Connecticut Greg Samsa, PhD Associate Professor Department of Biostatistics and Bioinformatics Duke University School of Medicine Duke University Durham, North Carolina
CONTRIBUTORS
Daniel Sargent, PhD Professor of Biostatistics Department of Health Sciences Research Mayo Clinic Rochester, Minnesota Lawrence H. Schwartz, MD James Picker Professor of Radiology Chairman, Department of Radiology Columbia University College of Physicians and Surgeons New York, New York Richard Simon, DSc Chief Biometric Research Branch National Cancer Institute Bethesda, Maryland Eric J. Small, MD Professor Department of Medicine University of California, San Francisco San Francisco, California Sandy Srinivas, MD Associate Professor of Medicine Department of Medicine Division of Oncology Stanford University of Medicine Stanford, California James Thomas Symanowski, PhD Head of Biostatistics Department of Biostatistics Nevada Cancer Institute Las Vegas, Nevada
Edward L. Trimble, MD, MPH Head, Gynecologic Cancer Therapeutics and Quality of Cancer Care Therapeutics Clinical Investigations Branch Cancer Therapy Evaluation Program Division of Cancer Treatment and Diagnosis National Cancer Institute Bethesda, Maryland Nicholas J. Vogelzang, MD Chair and Medical Director Developmental Therapies Department of Medical Oncology Comprehensive Cancer Center of Nevada Las Vegas, Nevada Hongkun Wang, PhD Assistant Professor Public Health Services University of Virginia Charlottesville, Virginia Wenting Wu, PhD Assistant Professor of Biostatistics Department of Health Sciences Research Mayo Clinic Rochester, Minnesota Binsheng Zhao, DSc Associate Professor Department of Radiology Columbia University College of Physicians and Surgeons New York, New York
xvii
This page intentionally left blank
1
Introduction to Clinical Trials
Susan Halabi Wm. Kevin Kelly
“Learn from yesterday, live for today, hope for tomorrow. The important thing is to not stop questioning” —Albert Einstein, 1879–1955 The number of cancer cases diagnosed daily continues to increase around the world, and we urgently need to develop more effective therapies for this disease. Although there are a plethora of new agents that have shown promise in preclinical cancer models, clinical trials in patients remain the hallmark for clinical research in oncology and the key to developing more effective therapies for patients with cancer. We define clinical trials as scientific investigations that evaluate the safety and\or particular outcome(s) of a therapeutic or nontherapeutic intervention in a defined group of patients. According to ClinicalTrials.gov, “a clinical trial is a research study to answer specific questions about vaccines or new therapies or new ways of using known treatments.” Clinical trials (also called medical research, or research studies) are used to determine whether new drugs or treatments are both safe and effective; and they comprise the main conduit the Federal Drug Agency (FDA) uses to approve agents for use in humans. Over the last several decades, clinical trial methodology has evolved from simple, small, prospective studies to large, sophisticated studies that incorporate many correlative-science and quality-of-
life objectives. Although studies have become more complex, they still can be broken down broadly into four categories or phases: “phase I tests a new drug or treatment in a small group to evaluate dose and safety; phase II expands the study to a larger group of similar patients with a defined treatment or intervention; phase III expands the study to an even larger group of people; and phase IV takes place after the drug or treatment has been licensed and marketed.” (1) Phase III clinical trials are usually the definitive trials providing evidence for or against a new experimental therapy, and they have become the gold standard in assessing the efficacy of a new experimental arm or a device (2, 3). Friedman et al., define a phase III clinical trial as “a prospective controlled evaluation of an intervention for some disease or condition in human beings” (2). There are generally three purposes of randomized phase III trials: (i) to determine the efficacy of a new treatment compared to an observation/placebo arm, (ii) to determine the efficacy of a new treatment versus a standard therapy, or (iii) to test whether a new treatment is more effective relative to a standard therapy, but is associated with less morbidity (3). The main objectives of a clinical trial are to obtain reliable answers to important clinical questions and, more importantly, to change medical practice. Results from a single phase III trial are not sufficient for the intervention to be considered definitive or to change medical practice. When considering strength of evidence
1
2
ONCOLOGY CLINICAL TRIALS
1. Anecdotal case reports 2. Case series without controls 3. Series with literature controls 4. Analyses using computer databases 5. “Case-control” observational studies 6. Series based on historical control groups 7. Single randomized controlled clinical trials 8. Confirmed randomized controlled clinical trials
FIGURE 1.1 Hierarchy of strength of evidence. Printed with permission from Green & Byar, Statistics in Medicine, Vol. 3, 1984
of data, investigators should interpret data from other sources including other phase III trials and results from epidemiologic and other meta-analyses. As presented in Figure 1.1, Green and Byar argued that confirmed randomized controlled phase III trials form the strongest evidence of support for an intervention (4). The basic principles of design are to minimize bias and increase precision of the treatment effect, which will improve the delivery of treatment and eventually improve care for oncology patients.
SCOPE The landscape of conducting clinical trials in oncology is quickly changing, and many investigators are now exposed to clinical trials without a deep appreciation for or understanding of the basic principles and practical issues of conducting clinical research. Although historically this knowledge has been passed down from mentor to student, that practice is increasingly not the case in today’s educational environment. The goal of this book is to provide an understanding of and a sound foundation for clinical trials and to pass on the decades of experience from seasoned investigators concerning a wide range of topics that are critical to formulate, write, conduct, and report clinical trials. This book is intended for investigators with minimal or some experience in clinical trials in oncology, who are interested in pursuing a career in academia or industry. In this sense, it seeks to be a guide, if not a mentor. In addition, this book provides a comprehensive, integrated presentation of principles and methodologies for clinical trials to enable readers to become active, competent investigators. Clinical trials are expensive and time-consuming, and a lot of thought goes into their planning, execution, and reporting. The time involved from concept
development to study activation varies, depending on the phase of the trial and whether it is a single or multiinstitutional study. In recent reports, Dilts et al., indicated that there were 296 processes from concept inception to study activation in phase III trials sponsored by the Cancer Therapy Evaluation Program (CTEP), and the median time that the trials were activated was 602 days (interquartile range, 454 to 861 days) (5-7). This shows the complexity of clinical trials and highlights the fact that there are many areas where insufficiency and errors can occur if one does not have the experience, the guidance, or the appropriate personnel to aid in trial development and execution. The development and conduct of a trial require a multidisciplinary approach that involves physician, scientists, biostatisticians, research nurses, experts in regulatory affairs and contract negotiations, data coordinators, and research technicians, who are all critical for the success of the study. In particular, biostatisticians play a central role in clinical trials, and collaborating with biostatisticians in the early design stage ensures that the trial will yield valid and interpretable results. Close collaboration with biostatisticians results in trials with clearly defined objectives, study designs well-suited to address the hypotheses being posited, and appropriate analyses. This book is unique because it has contributions from a broad range of members of the multidisciplinary team, who provide their experience and expertise to guide investigators to a successful study. Altman describes the general sequence of steps in a research project as follows: planning, design, execution (data collection), data processing, data analysis, presentation, interpretation, and publication (8). The book is arranged in similar order, and it focuses on studies in humans, with emphasis on safety consideration in trial design. The early chapters discuss historical perspectives, along with ethical issues that have been raised with oncology clinical trials, which give one a basis for understanding the evolution that has occurred over the last several decades. The next several chapters deal with the most difficult obstacles that a new investigator encounters. The first such obstacle is choosing the correct agents to study. Chau and collogues review preclinical drug assessment. That chapter highlights the technological advances and preclinical models used to determine the pharmacological profile of the drugs, which are critical as you design your trial. Once your agents are chosen, defining the questions, objectives, and endpoints of the trial are crucial, and these aspects of planning often do not receive the critical consideration they deserve. The chapters by Stark et al., and Wu et al., provide first-hand experience from multiple, seasoned clinical investigators on their approach to these issues.
3
1 INTRODUCTION TO CLINICAL TRIALS
Input from biostatisticians is important at each stage of the process of protocol development, and the next series of chapters highlights basic statistical concepts, such as estimation, hypothesis testing, and design considerations for phase I through III trials. Throughout these chapters, the authors provide their extensive experience in trial design and provide actual examples to demonstrate their points. More advanced statistical issues such as multiple arm trials, noninferiority trials, Bayesian designs, meta-analysis, and adaptive design are discussed later in the book. In the past, there were few standards for writing a protocol, but through immense work from CTEP and other agencies, writing a protocol has been greatly simplified by the use of standardized templates. Although these templates have simplified the writing of a study, there is an art to writing a study, and there are many obstacles that investigators need to consider. Grant et al., take you through the writing of an investigatorinitiated study, and Leung highlights important issue in data collection; Kurkjian et al., describe the reporting of adverse events, and Colevas outlines the intricacies of toxicity reporting. Once the study is underway and near completion, Korn and Friedlin discuss the importance of interim analysis, and Niedzwiecki and Hollis give us an overview on how to interpret study results. There are many pitfalls that can be encountered in conducting and interpreting a clinical trial, which is well illustrated by Green; and Halabi discusses the importance of assessing prognostic factors in cancer studies. In the past decade, there has been increased emphasis in decreasing the time for a new drug to be approved. Thus, biomarkers, surrogate endpoints, novel imaging techniques, pharmacokinetic and pharmacodynamic assessments, genetics, and intensive quality-of-life monitoring have been incorporated into investigational studies to help move the development of drugs along. The authors of these chapters update the reader on the pros and cons of these approaches. Burdette-Radoux and Muss identify that there are populations of patients who have special clinical trial considerations, and Samsa gives us a dose of reality by guiding us through the cost analysis of investigational studies. At times a single trial may not give enough information to make a conclusive result, and Brunelli and colleagues review the advantages and limitations of performing a meta-analysis. The last several chapters are dedicated to some of the more practical issues that novice investigators encounter, but for which they cannot always get a straight answer. These include chapters that describe the requirements of regulatory affairs in studies: how does the FDA review and approve new agents; how
should one interact with industry; how do we define the role of study personnel; how ought we to write an informed consent form; or how do the cooperative oncology groups operate? Developing an understanding of all these issues is imperative for successful clinical trialists. Finally, Harzstark and Small look into the future to give us an idea of where we should be in the next decade and how clinical trials will evolve further.
RESOURCES Several books that have been written on clinical trials have focused on randomized phase III trials (9–12). The available books emphasize statistical or clinical principles and concepts, whereas in this book we present a balanced perspective on clinical trials. Our intention is to enhance statistical thinking and understanding among a wider professional audience. Unlike other books that focus only on randomized phase III clinical trials, we include topics that emerge earlier in the traditional paradigm, such as preclinical drug assessment (Chapter 4), design of phase I trials (Chapter 8), and phase II trials (Chapter 9). There are many resources dedicated to clinical trials. These include web-based resources for clinical trials. The list is not inclusive but we present some valuable links in Table 1.1. In addition, the Society of Clinical Trials (www.sctweb.org) is an organization dedicated to the study, design, and analysis of clinical trials, with a peer reviewed journal (Controlled Trials). There are other educational resources, including workshops, offered to junior faculty members in academic centers, such as the ASCO workshops, with the
TABLE 1.1
Web-Based Resources. http://www.cancer.gov/ www.cochrane.org www.clinicaltrials.gov www.consort-statement.org www.controlled-trials.com http://grants.nih.gov/grants/guide/notice-files/not 98-084.html http://deainfo.nci.nih.gov/grantspolicies/datasafety .htm http://www.emea.eu.int/pdfs/human/ewp/587203en .pdf http://www.acrpnet.org/chapters/belg/whotdr _guidelines.doc
4
ONCOLOGY CLINICAL TRIALS
purpose of training junior faculty members in the United States and around the world.
SUMMARY There is an ever increasing need for educational resources and this book will serve as a roadmap for the next generation of clinical trialists in oncology. The objective is to enable the reader to understand the different stages involved in the design, conduct, and analysis of clinical trials. In addition, this book can be used as an aid in teaching clinical fellows in clinical training programs complemented by lectures and discussion. This book may be of interest for public health students and public health workers and for contract research organizations and departments of medicine, where people are involved with clinical trials. Our hope is that the reader will find this book valuable, especially the practical issues we have encountered in doing clinical trials exemplified by citing real life examples of clinical trial failures and success. Rigorous clinical trials can address important questions relevant to a patient population in which one can make valid inferences about the therapy being tested. Such studies should be designed starting with a hypothesis, an explicit definition of endpoints, appropriate identification and selection of the patient population, and a sufficiently large sample size with high power to detect small to moderate clinical effect sizes. In addition, these studies should be monitored for terminating a trial early so that patients can benefit from
a promising treatment or are spared from a harmful regimen.
References 1. www.clinicaltrials.gov. 2. Friedman LM, Furberg CD, DeMets DL. Fundamentals of Clinical Trials. 3rd ed. New York, Springer-Verlag; 1998. 3. Simon RS. Design and conduct of clinical trials, in DeVita VT, Hellman S, Rosenberg SA (eds.). Cancer: Principles and Practice of Oncology. Philadelphia, PA, J.B. Lippincott, 1993; pp 418–440. 4. Green SB, Byar DP. Using observational data from registries to compare treatments: The fallacy of omnimetrics (with discussion). Statistics in Medicine, 1984;3:361–373. 5. Dilts DM, Sandler A, Cheng S, Crites J, Ferranti L, Wu A, Gray R, MacDonald J, Marinucci D, Comis R. Development of clinical trials in a cooperative group setting: the eastern cooperative oncology group. Clin Cancer Res. 2008;14:3427–3433. 6. Dilts DM, Sandler A, Cheng S, Crites J, Ferranti L, Wu A, Gray R, MacDonald J, Marinucci D, Comis R. Processes to activate phase III clinical trials in a cooperative oncology group: the case of cancer and leukemia group B. J Clin Oncol 2006;24: 4553–4557. 7. Dilts DM, Sandler AB, Cheng SK, Crites JS, Ferranti LB, Wu AY, Finnigan S, Friedman S, Mooney M, Abrams J. Steps and time to process clinical trials at the Cancer Therapy Evaluation Program. J Clin Oncol 2009;27:1761–1766. 8. Altman, DG. Practical Statistics for Medical Research. 1st ed. p. 1–9. London; New York: Chapman and Hall; 1991. 9. Everitt BS, Pickles A. Statistical Aspects of the Design and Analysis of Clinical Trials (2nd ed.). London, Imperial College Press; 2004. 10. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology (2nd ed.). New York, Chapman and Hall; 2002. 11. Piantadosi S: Clinical Trials. A Methodologic Perspective (2nd ed.). New York, John Wiley & Sons; 2005. 12. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Wiley & Sons; 2004.
2
Historical Perspectives of Oncology Clinical Trials
Ada H. Braun David M. Reese
ORIGINS OF ONCOLOGY: ANTIQUITY, GREECE, AND ROME Cancer is older than human life, and in fact it is intrinsic to terrestrial biology. The first evidence of tumors was found in bones from dinosaurs of the Jurassic period that lived some 200 million years ago (1). All vertebrate and many invertebrate species can develop cancer (2). Early human evidence of neoplasia includes both primary tumors and metastatic lesions; this evidence spans over 5,000 years, from predynastic Egyptian mummies to early Christian times, and over multiple continents, from the Far East to South America (3–5). The Babylonian “Code of Hammurabi” (1750 BCE), Chinese folklore (“Rites of the Zhou Dynasty”; 1100–400 BCE), as well as medical documents from India (“Ramayana”; 500 BCE) and Egypt, attest to the early recognition of cancer. Perhaps most famously, the George Ebers and Edwin Smith papyri (Egypt, ca. 1550 and 1600 BCE) describe several tumor ailments and their treatment. The Ebers papyrus, which may be considered the oldest textbook of medicine, recommends operations for certain accessible tumors and outlines palliative treatment for inoperable disease, including topical applications (6). The Smith papyrus provides more prosaic case reports; portraying what is likely a tumor of the breast, the scribe annotates, “there is no treatment” (7).
Although these ancient documents describe what we now recognize as malignant tumors, for millennia there was no attempt to study cancer systematically, until Greek physicians founded what we consider Western medicine. These doctors regarded all diseases—cancer included—as having natural (as opposed to supernatural) causes. Traditionally, Hippocrates (ca. 460 to 377 BCE) first described cancer as a biological process, a disease entity with both local and distant consequences. Based on observation of the growth patterns of tumors directly visualizable, such as breast cancers, he coined the term karkinoma, from the Greek word for crab; the term was later Latinized to the familiar carcinoma. According to legend, Galen of Pergamon (ca. 129 to 200 CE) thought the disease was “so called because it appears at length with turgid veins shooting out from it, so as to resemble the figure of a crab; or as others say, because like a crab, where it has once got, it is scarce possible to drive it away” (8). If Hippocratic physicians laid the foundation of empiric medicine by replacing supernatural concepts of disease with meticulous observation and logical inference, Galen has been recognized as the founder of experimental science. The last prominent physician of the Greco-Roman school, he combined results of animal dissections, experiments in physiology, and clinical observation to construct models of human physiology and disease, which he recorded and taught systematically (9).
5
6
ONCOLOGY CLINICAL TRIALS
Following the Hippocratic physicians, Galen surmised that cancer originated in an imbalance of the four humors: specifically, an excess of melan chole (black bile) over yellow bile, phlegm, and blood was thought to drive the formation of malignant tumors. Because excess black bile was the cause of cancer, efforts to remove black bile were the logical treatment. Bloodletting, purgatives, and emetics thus entered the armamentarium of physicians attempting to treat the disease. No one as yet, however, thought to systematically record the results of treatment in a group of patients, or to directly compare one cancer treatment (or no treatment at all) with another. These astute healers did understand that most of their therapies were ineffective, though, and one teaching summarized a view dating to the time of Hippocrates: superficial tumors could sometimes be treated with surgery, but deep-seated tumors should be left alone, as patients often died more quickly with treatment compared with when they were left alone (8). When he died, Galen left behind a formidable literary legacy comprising over 10,000 pages of authoritative treatises. These were a combination of science blended with Greek philosophy, and they profoundly influenced medicine for 1,500 years, cancer medicine included. The eminent Canadian doctor Sir William Osler (1849 to 1919) best described what followed: “fifteen centuries stopped thinking and slept, until awakened by the De Fabrica of Vesalius” (10). Cancer medicine slumbered with the rest of the profession.
THE DAWN OF CANCER SCIENCE: ADVANCES IN PATHOLOGY In the sixteenth century, advances in anatomy heralded a new era of empiricism. Physicians such as Antonio Benivieni (1443 to 1502; Florence) pioneered the use of autopsy to understand the causes of death, correlating clinical conditions with postmortem findings. In 1543, based on hundreds of dissections, Andreas Vesalius of Brussels (1514 to 1564) published “De Humani Corporis Fabrica” (On the Fabric of the Human Body), a groundbreaking first complete depiction of human anatomy, lavishly illustrated with detailed drawings of the body. Within another century, Italy’s Giovanni Battista Morgagni (1682 to 1771) inaugurated the field of pathological anatomy with his masterpiece “De Sedibus et Causis Morborum per Anatomen Indagatis” (The Seats and Causes of Diseases Investigated by Anatomy). Thereafter, a succession of investigators used increasingly specialized technology to localize disease with an ever-increasing clarity. Marie François Xavier Bichat (1771 to 1802;
France) identified tissues underlying recognizable organ systems with the naked eye, thus laying the groundwork for histology. With the introduction of improved microscopes in the 1800s, Rudolf Virchow (1821 to 1902; Germany) honed in on the newly discovered building blocks of life: the cell. Cell theory revolutionized the understanding of cancer, making possible for the first time the systematic study of the disease in the laboratory and the clinic. Virchow defined cancer as a disease of abnormal cells emanating from other cells through division. As he famously stated, “from every cell a cell” (omnis cellula e cellula) (11). Early in his career, Virchow described an abnormal proliferation of malignant white blood cells in a patient. Based on this case study, he coined the term “leukemia” (‘white blood’; from Greek leukos, white, and aima, blood), shortly after Thomas Hodgkin (1798 to 1866; Great Britain) characterized the proliferation of malignant cells in lymph glands as “lymphoma” (12). In 1863, the German pathologist Wilhelm von Waldeyer-Hartz (1836 to 1921) further outlined the fundamentals of malignant transformation and carcinogenesis. He postulated that cancer cells originate from normal cells, multiply by cell division, and metastasize (spread to distant sites; from Greek methistanai, to place away) through lymph or blood (13). Observing a nonrandom pattern of metastatic growth in hundreds of autopsy records, Stephen Paget (1855 to 1926) subsequently proposed that the predilection of cancer cells to metastasize to certain organs was “not a matter of chance.” In 1889, the British surgeon planted the groundbreaking “seed and soil” hypothesis that prevails to date: “when a plant [cancer] goes to seed, its seeds [the cancer cells] are carried in all directions; but they can live and grow only if they fall on congenial soil [a conducive organ microenvironment]” (14). Another hundred years passed before experimental evidence substantiated the theory. Indeed, though nineteenth-century pathology provided an increasingly accurate description of cancer, it offered little pathobiologic insight. The transition of oncology from a largely descriptive art to an experimental science finally occurred around the turn of the twentieth century.
ONCOLOGY IN THE MODERN ERA: A VERY BRIEF OVERVIEW The 20th century opens as the experimental era with the systematic study of tumors throughout the animal kingdom, and it . . . promises to widely separate many neoplastic
2 HISTORICAL PERSPECTIVES OF ONCOLOGY CLINICAL TRIALS
diseases formerly held to be closely related. It may thereby prove to be the era of successful therapeutics and prophylaxis. —James Ewing, 1919 The notion that cancer is not a single entity but rather hundreds of biologically distinct illnesses really dates to James Ewing’s (1866 to 1943, United States) monumental textbook Neoplastic Diseases, in which he classified tumors according to the tissue they arose from or resembled (15). Ewing, who became the first director of what is now Memorial Sloan-Kettering, tirelessly catalogued tumor cells according to their microscopic features, but he was more than a brilliant laboratory researcher. He recognized the substantial clinical implications his work could have, and he wrote in the preface to the third edition of his text, Up to a very recent time the practical physician or surgeon has been content to regard all fibromas, sarcomas, or cancers [carcinomas] as equivalent conditions . . . and on this theory to treat the members of each class alike. Upon this theory it was also legitimate to conceive of a universal causative agent of malignant tumors and thus to subordinate many very obvious differences which clinical experience has established in the origin and behavior of different related tumors” (16). Ewing’s insight that different tumor types might arise from distinct sources and might require specific treatment approaches represented a breakthrough in thinking about cancer. In essence, it envisioned targeted treatment and personalized medicine. Ewing’s achievement arose in an era of great excitement about the promise of scientific medicine. Claude Bernard (1813 to 1878; France) had propagated a stringent scientific method and, through intricate experiments in live animals (vivisection), achieved major advances in physiology (17). The fields of radiology and radiation therapy were born with the fortuitous discovery of X-rays by Wilhelm Conrad Röntgen in 1895, and the discovery of natural radioactivity by Henri Becquerel and Pierre and Marie Curie (18). Surgery matured, spurred by technical improvements and, foremost, by innovations in aseptic techniques and anesthesiology (19). Nursing became a skilled profession and a key component in the fight against disease (20). Alongside and contributing to these advances, hospitals were transformed from charitable asylums for the sick to medical institutions (21). All of these developments together helped lay the scientific groundwork
7
for clinical research as we have come to practice it today.
THE ORIGINS OF ONCOLOGY CLINICAL TRIALS Although there are references to what may be loosely considered clinical studies dating back at least to Biblical times, the first true medical trials depended on a specific breakthrough in medical thinking, namely the acceptance of quantitative methods as a fundamental component of clinical research. The notion that simple counting could be a useful tool in medical research arose, as with so many other things, with the ancient Greeks. Epidemiology (from the Greek epi demios, among the people) began with Hippocrates and other Greek physicians, who made rudimentary generalizations about infectious epidemics, such as their seasonal nature. Understanding was limited, however, as the number of cases of specific diseases in defined populations was not collected. The birth of modern epidemiology can be traced to the haberdasher John Graunt (1620 to 1674; Great Britain), who among his varied pursuits studied patterns of death among residents of various London parishes, using numbers and causes reported in the parish clerks’ weekly burial lists (22). By tracking outbreaks of fever and other common causes of death, Graunt demonstrated in stark terms the uses to which simple statistics could be put. Quantitative observation was first introduced to experimental medicine by the groundbreaking work of William Harvey (1578 to 1657) in the seventeenth century, with the description of the circulation of blood on observational and mathematical grounds (23). One of the reasons Harvey’s arguments carried such great weight—were ultimately irresistible—was because of simple calculations he made. Based on the anatomy of the heart (the volume of the left ventricle) and the normal heart rate, Harvey estimated that the average person pumped approximately 540 pounds of blood in an hour. By simple inference, Galen’s theory that the blood supply was replenished daily in the liver could not be true. What human could manufacture, in a 24-hour period, the staggering amount of blood required? A little bit of arithmetic put the lie to 1,500 years of dogma. The first use of simple statistics in a clinical-trial setting occurred in the eighteenth century. Anecdotal reports dating to the 1600s had suggested that citrus fruits could prevent scurvy, which was extraordinarily common among sailors undertaking long ocean voyages. Drawing on his personal observations, the
8
ONCOLOGY CLINICAL TRIALS
Scottish naval surgeon James Lind determined to test the hypothesis that citrus fruits were effective antiscorbutics. In 1854, on board the HMS Salisbury, he selected 12 patients with scurvy, and applied to them sets of two various treatments, including oil of vitriol, vinegar, sea water, oranges and lemons, cider, and a combination of garlic, radish, balsam, and myrrh. Those receiving the fruit recovered within a week (24). Once the Navy adopted these findings and began issuing lemon juice to all its sailors, scurvy was essentially eliminated from the fleet. Despite successes as pioneered by Lind, it was not until well into the twentieth century that the randomized clinical trial—enshrined as the gold standard by which medical therapies are assessed—was formally introduced. The landmark study that launched the modern era of clinical trials evaluated the effectiveness of streptomycin as an antituberculosis agent, and owed its design and execution in large part to the efforts of a statistician who wanted to introduce physicians gently to the concepts of randomization and experimental design, as outlined by colleagues such as Fisher. Austin Bradford Hill, a professor at the London School of Tropical Medicine and Hygiene, was the primary driver behind the study and later the author of a landmark textbook of medical statistics. At the time of the study, there was a limited amount of streptomycin available in Great Britain. This drug scarcity, coupled with the variable natural history of the disease, led Hill and his coinvestigators to believe that a randomized study in which half of the patients would not receive the experimental medication could be ethically justified. Anticipating our contemporary studies, the trial had strict eligibility criteria, including bilateral, progressive lung infiltrates, bacteriologically documented disease, and age between 15 and 30 years. Patients were randomly assigned to bed rest (standard therapy) or streptomycin. Standardized case reports were developed and used, and radiologists were blinded to treatment assignment–assessed serial chest X-rays. When the data were examined, the results could not have been more clear-cut. Substantially greater numbers of patients receiving streptomycin experienced radiographic improvement, and at the end of the 6-month observation period only 8% of those receiving the antibacterial had died, compared with 51% in the bed-rest arm of the study (25). The feasibility and practical utility of a randomized trial had been demonstrated beyond a shadow of a doubt. Additional evidence for the utility of a statistical approach in medicine came from the field of cancer itself, with the publication in the 1950s of the landmark studies correlating smoking with the development of lung cancer. As early as 1761, a linkage between the
development of cancer and exposure to an external agent (a carcinogen) was first postulated, when the London physician and polymath John Hill issued his pamphlet “Cautions Against the immoderate Use of Snuff: Founded on the known Qualities of the Tobacco Plant; and the Effects it must produce when this Way taken into the Body: and Enforced by Instances of Persons who have perished miserably of Diseases, occasioned, or rendered incurable by its Use.” Hill associated heavy snuff use with nasal tumors and, bucking the tide of medical opinion, recommended against its use (26). It was not until the middle of the twentieth century, however, with the development of sophisticated statistical techniques, that the causative relationship between tobacco and cancer became irrefutable. Commissioned by the British Medical Research Council in 1947, Austin Bradford Hill and Richard Doll analyzed potential causes for the dramatically rising mortality from lung cancer. Their comprehensive case-control study of over 2,400 patients identified unequivocally that “smoking is a factor, and an important factor, in the lung” (27). Ernst Wynder and Evarts Graham in the United States published a similar, large survey in over twelve hundred patients the same year, again identifying “tobacco smoking as a possible etiologic factor in bronchiogenic carcinoma” (28). Finally, in the 1960s, the link between smoking and cancer was officially recognized. The streptomycin trial, along with the lung cancer epidemiologic studies, powerfully established the value of a statistical approach to medical research. How was this new thinking incorporated into the justdeveloping field of oncology? In the remainder of this chapter we briefly review the rise of clinical studies in cancer medicine, with a particular emphasis on the development of chemotherapy as illustrative of the wholesale adoption of controlled trials. The notion that chemicals might control cancer actually has an ancient history. Hippocratic doctors treated superficial tumors with ointments containing toxic copper compounds. Later, in the first century BCE, the physician and compounder Dioscorides, one of the patron saints of pharmacy, employed autumn crocus, which has as an active ingredient colchicine, later shown to be possessed of mild antitumor effects. Arsenicals in particular enjoyed widespread use, mostly as topical applications, from ancient Egypt, through Galen and Falloppio, until the early nineteenth century (29). The first successful systemic cancer chemotherapy was published by Heinrich Lissauer, who reported remissions in two patients with leukemia using Fowler’s solution, a then common cure-all based on arsenic (30). In spite of these anecdotal reports, though, and given a profound lack of evidence to support the use of drugs
2 HISTORICAL PERSPECTIVES OF ONCOLOGY CLINICAL TRIALS
or chemicals, standard treatment for cancer in the early twentieth century remained either surgery or radiation therapy. The terminology and concept of modern chemotherapy, the use of chemicals to treat disease, was coined by Paul Ehrlich (1854 to 1915) in the early 1900s. Ehrlich introduced the use of laboratory animals to screen chemicals for their potency against diseases, leading to the development of arsenicals to treat syphilis and trypanosomiasis. He investigated aniline dyes and the first alkylating agents as potential drugs to treat cancer, and summarized his observations in what is known to be the first textbook of chemotherapy (31). In experimental oncology, in the first four decades of the twentieth century, the development of adequate models for cancer drug screening then took center stage (32). A major breakthrough was achieved by George Clowes of Roswell Park Memorial Institute, who developed the first transplantable tumor systems in rodents, allowing for standardized testing of a larger number of drugs (33). Initiation of clinical studies of modern chemotherapy can be traced to World Wars I and II. Use of mustard gas in WW I and an accidental spill of sulfur mustards (Bari Harbor, Italy) in WW II were observed to cause severe lymphoid hypoplasia and myelosuppression in exposed soldiers (34, 35). In 1942, Alfred Gilman and Louis S. Goodman were commissioned by the U.S. State Department to examine the potential therapeutic use of toxic agents developed for chemical warfare (36). When they observed marked regression of lymphoid tumors in mice, they convinced their colleague, thoracic surgeon Gustav Lindskog, to treat a patient with non-Hodgkin’s lymphoma (NHL) with a closely related compound, nitrogen mustard. Significant, albeit temporary, tumor remission was observed. The investigators and colleagues went on to treat several dozen more patients, with variable success. Although nitrogen mustard was clearly no magic bullet for hematologic malignancies, for the first time in history a systemic chemical agent had been shown, under controlled clinical conditions, to combat cancer cells. The principle was established that cancer cells may be more susceptible to certain toxins than are normal cells. In 1946, after wartime secrecy restrictions had been lifted, the clinical data were published, and the era of cancer chemotherapy had arrived (37, 38). In the two decades that followed, improved alkylating agents were developed (e.g., cyclophosphamide, chlorambucil) that became key components of leukemia and lymphoma treatment regimens. More chemotherapeutic approaches were to follow. Sydney Farber observed that folic acid, the vitamin deficient in megaloblastic anemia, stimulated proliferation
9
of acute lymphoblastic leukemia (ALL) cells in children. In collaboration with industry, antifolates (aminopterin, amethopterin [methotrexate]) were synthesized and were the first drugs to induce remissions in children with ALL (39). Methotrexate displayed activity against a variety of other malignancies, including breast cancer, ovarian cancer, and head and neck cancer. Most remarkably, single-agent methotrexate was found to be the first chemotherapy agent to cure a solid tumor, choriocarcinoma, a germ cell malignancy originating in the placenta. Methotrexate was also the first agent to demonstrate benefit of adjuvant chemotherapy treatment—to prevent recurrence of osteosarcoma following surgery. Additional anticancer drugs entered clinical trials in the 1950s, including the inhibitor of adenine metabolism 6-mercaptopurine (6-MP), vinca alkaloids, and 5-fluorouracil, an inhibitor of DNA synthesis (32). Natural products such as taxanes (e.g., paclitaxel; 1964; from the bark of the Pacific Yew tree) or camptothecins (e.g., irinotecan; 1966; from a Chinese ornamental tree) were developed under the auspices of C. Gordon Zubrod at the NCI (40). Many more were to follow, including platinum compounds (e.g., cisplatin, carboplatin) or topoisomerase II inhibitors (e.g., anthracyclines or epipodophyllotoxins). In the development of all of these drugs, controlled clinical studies were essential to establish their effectiveness, and it can be argued that oncology has more systematically used the randomized study than any other field in medicine. It has been only 150 years since cancer was recognized as a disease of cells, and a mere six decades since the introduction of the randomized clinical trial. Today, multimodality treatment, often incorporating molecular markers or targeted therapy, has become standard treatment for many malignancies. Our task for the future will be to retain the essential features of the randomized study, while developing new clinical trial methodologies that allow the most efficient investigation of novel therapeutics.
References 1. Greaves MF. Cancer: The Evolutionary Legacy. Oxford; New York: Oxford University Press; 2000. 2. Huxley J. Biological Aspects of Cancer. London: Allen & Unwin; 1958. 3. Urtega OB, Pack GT. On the antiquity of melanoma. Cancer and Metastasis Reviews 1966;19:607–10. 4. Strouhal E. Tumors in the remains of ancient Egyptians. Am J Physical Anthropol 1976;45:613–20. 5. Weiss L. Observations on the antiquity of cancer and metastasis. Cancer and Metastasis Reviews 2000;19(3–4): 193–204. 6. Bryan CP. The Papyrus Ebers. New York: Appleton; 1931. 7. Breasted JH. The Edwin Smith Surgical Papyrus. Special ed. Chicago, Ill.: University of Chicago Press; 1930.
10
ONCOLOGY CLINICAL TRIALS
8. Kardinal CG, Yarbro JW. A conceptual history of cancer. Sem Oncol 1979;6(4):396–408. 9. Siegel RE, Galen. Galen’s System of Physiology and Medicine. Basel, New York: Karger; 1968. 10. Osler W, Camac CNB. Counsels and Ideals from the Writings of William Osler. Boston: Houghton Mifflin; 1906. 11. Rather LJ. Rudolf Virchow’s views on pathology, pathological anatomy, and cellular pathology. Arch Pathol 1966;82:197–204. 12. Hodgkin T, Lister JJ. Notice of some microscopic observations of the blood and animal tissues. Philos Mag 1827;2:130–8. 13. Triolo VA. Nineteenth century foundations of cancer research: advances in tumor pathology, nomenclature, and theories of oncogenesis. Cancer Res 1965;25:75–106. 14. Paget S. The distribution of secondary growths in cancer of the breast. Lancet 1889;133(3421):571–3. 15. Ewing J. Neoplastic Diseases: A Treatise on Tumors. Philadelphia and London: W.B. Saunders Company; 1919. 16. Ewing J, Raney RB. Neoplastic Diseases: A Treatise on Tumors. 3d ed. rev. and enl., with 546 illustrations. Philadelphia, London: W.B. Saunders; 1928. 17. Bernard C. Introduction à l’étude de la médecine expérimentale. Paris: Baillière; 1865. 18. Hayter CR. The clinic as laboratory: the case of radiation therapy, 1896–1920. Bull Hist Med 1998;72(4):663–88. 19. Wangensteen OH, Wangensteen SD. The Rise of Surgery: From Empiric Craft to Scientific Discipline. Minneapolis: University of Minnesota Press; 1978. 20. Maggs C. A General History of Nursing: 1800—1900. In: Bynum WF, Porter R, eds. Companion Encyclopedia of the History of Medicine. London: Routledge; 1993:1300–20. 21. Risley M. House of Healing: The Story of the Hospital. New York: Doubleday, 1961. 22. Rothman KJ. Lessons from John Graunt. Lancet 1996;347 (8993):37–9. 23. Harvey W. An Anatomical Disputation Concerning the Movement of the Heart and Blood in Living Creatures. Oxford: Blackwell Scientific; 1976. 24. Porter R. The Greatest Benefit to Mankind: A Medical History of Humanity. New York: Norton; 1997. 25. Medical Research Council Investigation. Streptomycin treatment of tuberculosis. Br Med J 1948;2:769–782.
26. Petrakis NL. Historic milestones in cancer epidemiology. Sem Oncol 1979;6:433–44. 27. Doll R, Hill AB. Smoking and carcinoma of the lung: preliminary report. Br Med J 1950;2:739–48. 28. Wynder EL, Graham EA. Tobacco smoking as a possible etiologic factor in bronchiogenic carcinoma: a study of 684 proved cases. JAMA 1950;143:329–36. 29. Burchenal JH. The historical development of cancer chemotherapy. Sem Oncol 1977;4:135–46. 30. Lissauer H. Zwei Fälle von Leucaemie. Berl Klin Wochenschr 1865;2:403–5. 31. Ehrlich P. Beiträge zur experimentellen Pathologie und Chemotherapie. Leipzig: Akademischer Verlag; 1909. 32. DeVita VT, Jr., Chu E. A history of cancer chemotherapy. Cancer Res 2008;68:8643–53. 33. Clowes GHA. A study of the influence exerted by a variety of physical and chemical forces on the virulence of carcinoma in mice. Br Med J 1906:1548–54. 34. Krumbhaar EB, Krumbhaar HD. The blood and bone marrow in yellow cross gas poisoning. Journal of Medical Research 1919;40:497–506. 35. Einhorn J. Nitrogen mustard: the origin of chemotherapy for cancer. Int J Radiation Oncol Biol Physics 1985;11:1375–8. 36. Gilman A. The initial clinical trial of nitrogen mustard. American J Surg 1963;105:574–8. 37. Goodman LS, Wintrobe MM, Dameshek W, Goodman MJ, Gilman A, McLennan MT. Nitrogen mustard therapy. Use of methyl-bis(beta-chloroethyl)amine hydrochloride and tris (beta-chloroethyl)amine hydrochloride for Hodgkin’s disease, lymphosarcoma, leukemia and certain allied and miscellaneous disorders. JAMA 1946;132:126–32. 38. Gilman A, Philips FS. The biological actions and therapeutic applications of the B-chloroethyl amines and sulfides. Science 1946;103:409–36. 39. Farber S, Diamond LK, Mercer RD, Sylvester RF, Wolff JA. Temporary remissions in acute leukemia in children produced by folic acid antagonist, 4-aminopteroyl-glutamic acid (aminopterin). N Engl J Med 1948;238:787–93. 40. Chabner BA, Roberts TG. Chemotherapy and the War on Cancer. Nature Rev Cancer 2005;5:65–72.
3
Ethical Principles Guiding Clinical Research
Sandra L. Alfano
HISTORICAL PERSPECTIVES
ETHICAL PRINCIPLES
Research involving humans is recognized as essential to the advancement of the practice of medicine. Particularly in oncology, the need for new and innovative treatments remains high, necessitating robust, sound scientific exploration of new treatments. With this fundamental need comes a corollary need: to value and protect the humans who participate in this research, and to pursue sound, ethical research which incorporates that protection. Over the past century, and even into this new millennium, there have been numerous examples of researchers compromising ethical principles in their pursuit of new knowledge. Striking examples, such as the World War II Nazi medical experiments on detainees, the U.S. Public Health Service’s Tuskegee Syphilis Experiment, and the death of Jesse Gelsinger in a gene transfer study have been summarized and analyzed elsewhere (1, 2). As a result of these numerous events, governments and professional organizations have developed codes of conduct or guiding documents to remind us of moral obligations and factors that must be considered when involving humans in research experiments. In particular, for biomedical research, three guidance documents are often referred to in providing a frame of reference for researchers, regulators, and research participants alike.
The Nuremberg Code (3) was generated in 1949 from the Trials of War Criminals before the Nuremberg Military Tribunals in response to worldwide outrage at the use of World War II concentration camp prisoners in cruel human experiments. This simple code reflects a belief that voluntary consent of the human subject is absolutely essential. The fundamental concept of voluntariness was articulated as necessary because detainees had been forced to take part in the experiments, and often put in grave danger, and this was viewed as entirely unacceptable. From this basic concept flows the need for informed decision making, necessitating that the volunteer must be competent, and of sound mind. This requirement for informed consent directly from the subject or volunteer is the foundation of the Nuremberg Code. In 1964, the World Medical Association issued the first Declaration of Helsinki (4), which is intended to be a broader document encompassing obligations of biomedical researchers in the world community. Especially in international clinical research trials, this code provides guidance for ethical conduct of medical research in a variety of clinical settings. The Declaration is regularly revisited and updated, with the latest complete update conducted in 2008.
11
12
ONCOLOGY CLINICAL TRIALS
In 1974, in response to national outrage in the United States about revelations of the Tuskegee Syphilis Experiment, the U.S. government enacted the National Research Act, which established the Institutional Review Board (IRB) system for regulating research in the United States. IRBs were thus charged with responsibility for protecting the rights and welfare of human subjects participating in research studies, and for ensuring that research is conducted in accordance with accepted ethical standards. The Act also established the “National Commission for Protection of Human Subjects of Biomedical and Behavioral Research.” This national commission met over several years and ultimately issued, in 1979, the Report of the National Commission for the Protection of Human Subjects of Research, which is referred to as the Belmont Report (5). The Belmont Report contains the ethical principles upon which the U.S. federal regulations for protection of human subjects are based: respect for persons, beneficence, and justice. Each of these fundamental principles will be explored in depth in the following sections of this chapter, as they are the underpinnings of our work with human subjects of research. The Belmont Report was used as guidance to Congress in enacting regulations for the protection of human subjects. These regulations, codified in the Code of Federal Regulations (CFR), at 45 CFR 46, “Protection of Human Subjects,” govern the review of research, and are known as the Common Rule (6). The reader should note that there is controversy about how we should refer to the people that enroll in our studies. Table 3.1 provides an overview of this controversy.
RESPECT FOR PERSONS The guiding ethical principle of respect for persons sets out the belief that individuals should be treated as autonomous agents, and that they have the right to selfdetermination. This belief, that subjects have the right to choose what will or will not happen to them, entails the concepts of informed consent and voluntariness. A corollary principle is that those with diminished autonomy should be protected. This introduces the concept of vulnerable subjects (which is explored below), and leaves us with the need to understand that the vulnerability of a given population or person can change over time, depending on the context and situation. This principle is demonstrated through the process of informed consent, requiring attention to capacity, information, comprehension, and voluntariness. There should be thoughtful planning of the process of consent, with the intent of incorporating four elements: informing (conveying information), assessing comprehension, assessing autonomy (capacity to make one’s own decisions), and getting consent (agreement to be a subject). The essence of consent, and the resultant consent form that is used, should be information sharing. Researchers should keep in mind that the intent of a consent form is not to serve as a legal document, but rather to inform the participant. Thus efforts should be made to avoid the use of legalese— contract based terms that serve to focus on legal issues and may tend to diminish comprehension. The process of consent should be conducted as a collaborative dialogue, a negotiation, a give and take, that is intended to facilitate understanding. This of course involves discussion as an educational interchange (not simply
TABLE 3.1
How Should We Refer to Those Who Enroll in Clinical Trials? Harkening back to the National Commission, and the attitudes about research in the 1970s, Belmont and the resultant regulations refer to human subjects. In contrast, perhaps in light of wanting to be more politically correct, some now advocate for referring to those enrolled as participants or volunteers. While some research entails negotiation and collaboration between researchers and those who participate, classic biomedical clinical trials largely do not. There are relative power relationships, with researchers designing and driving the research protocols, and the enrollees agreeing to follow directions. Especially in oncology clinical trials, the person enrolled does not share in development of the protocol nor in research activities in a participatory sense; rather, they are the subjects of the research techniques and the outcomes they endure are the measures of the research. So it is argued that, if we are honest, we should refer to these persons in a truer sense, without establishing or crediting them with more than they are allotted, hence referring to them as human subjects. There is, however, general agreement that those who enroll in clinical trials should not be referred to as patients, in an endeavor to remind all team members that the researcher-subject relationship is quite different than the physician-patient relationship. It is also best to remember that the individual agreeing to enroll in a study must demonstrate a willingness to assume the role; it seems best to be clear about what that role is.
3 ETHICAL PRINCIPLES GUIDING CLINICAL RESEARCH
13
TABLE 3.2
How to Conduct the Consent Discussion. · Ensure autonomy. Begin with an invitation to participate, and ensure there is no coercion or undue influence (remove any impediments). · Allow ample time for a considered response. Typically, it is best to solicit consent in advance of commencing research activities, when at all possible. The consent form is not a confidential document; encourage that it be taken home, and shared with family and providers. · Consider breaking up the material into multiple sessions. · Verify understanding, comprehension, and pay attention to verbal and nonverbal cues. · Use witnesses or consent monitors, if necessary. · Consider use of a quiz. · Provide a copy of the signed consent form for the subject to keep; the document may be viewed as a reference. · Consent forms should be written at a 6th or 8th grade reading level and written in the second person (first person language is more contract-oriented).
handing the person the form to read). Efforts should be made to avoid using jargon and medical acronyms which may not be understood by a lay person. There is not universal agreement about the amount of information that should be shared. Too much information may overwhelm the subject and obscure meaningful information. Too little may leave researchers open to criticism for failure to disclose. Perhaps a compromise is to follow the reasonable person standard. The reasonable person is a hypothetical, rational, reasonably intelligent individual who is intended to represent a sort of average citizen. When writing a consent form, it might be helpful to ask, “What would a reasonable person want to know?” The process of consent should be described in the protocol, including a description of where the session will take place, how much time is allotted for consideration, and who will conduct the session. Table 3.2 provides some practical tips on conducting the consent discussion and additional information about the consent document itself, while Table 3.3 emphasizes the ongoing nature of consent. The required elements of consent are detailed in Table 3.4. Obtaining truly informed consent can be challenging, especially when dealing with special
populations, such as children, non-English speaking individuals, and those with decisional impairment. For children, we largely rely on permission granted by the parent(s), but should also demonstrate respect for the child by soliciting their assent in most cases. For older children, or for long-term studies, a plan should also be developed to re-negotiate their personal consent upon reaching the age of majority. Non-English speaking individuals present a challenge, and researchers are reminded that there is a need for initial consent, as well as ongoing communication in a language understood by the subject. This applies to both verbal and written communication, hence leading to the requirement to provide a translated version of the consent document. For verbal translation, it is recommended to avoid using family members who themselves may not understand very well, or who may filter the information to the subject. For the decisionally impaired, generally surrogate consent is solicited from a legally authorized representative (LAR). However, because this situation involves someone other than the participant deciding about enrollment, the decision to allow surrogate consent must be based on degree of risk of harm and probability of direct benefit.
TABLE 3.3
Ongoing Nature of Consent, and Obligation to Share New Information. · Consent process continues throughout study participation. · An informed subject is better able to comply with/complete the study. · Share new information as it becomes available, and renegotiate consent with existing subjects (use consent addendum to direct attention to whatever new information is being shared). · Ask, “Do you want to continue to participate?” · After completion of the study, inform subjects of the results.
14
ONCOLOGY CLINICAL TRIALS
TABLE 3.4
Required Elements of Consent. · Explain that the proposal is research. Describe the purpose, the expected duration of participation, and the required procedures in lay terms (differentiating between standard of care and research). · Describe reasonably foreseeable risks/discomforts. · Describe potential benefits (compensation is not a benefit). · Disclose alternatives, if any. · Promise confidentiality/describe limits (who will have access to study records). · Explain provisions in case of injury, if applicable. · Identify whom to contact with questions. · Explain voluntary participation and right to withdraw (no waiver of rights, no effect on relationships). · Address economic considerations, number of subjects to be enrolled, etc.
BENEFICENCE The second fundamental ethical principle of beneficence tells us that persons are treated in an ethical manner not only by respecting their decisions and protecting them from harm, but also by making efforts to secure their well-being. This Belmont principle generally encompasses two rules: (1) do not harm, and (2) maximize possible benefits/minimize possible harms. This principle translates into a focus on the risk/benefit relationship that is encompassed in the clinical trial. Researchers and their IRBs must make decisions about the currently unknown: when is it justifiable to seek certain benefits despite the risks involved, versus when are the potential benefits insufficient because of the risks? In developing a clinical trial protocol, the researcher should continually ask, “Are the risks presented justified?” This must be addressed both in the initial analysis as part of the development and approval of the proposed protocol, as well as through ongoing monitoring of risks and benefits throughout the study (via a data and safety monitoring plan). An acceptable risk/benefit relationship must exist for a protocol to be approved, and in order for the research to be allowed to continue. Examining the risk side of the relationship requires understanding fundamentally that research by its very nature involves risk; all must accept that subjects may be exposed to risk and may be harmed. The principle reminds us that we have an obligation to minimize the probability of harm, to maximize the potential benefits and to never knowingly cause (permanent) injury. Investigators must identify risks and objectively estimate magnitude and likelihood, both in the protocol for IRB review, as well as in the consent form for potential subjects. The risk/benefit analysis should be presented to prospective subjects in the consent form. A difficult issue to address is the level of
detail required in the consent form when presenting risks. What should be disclosed? One view is that all possible risks, anything that has been seen in trials to date, should be listed in the spirit of full disclosure. Another view is that emphasis should be placed on the most common, most important, most serious risks, so that subjects will be careful in their decision making. A possible compromise is to list at least what a reasonable person would find important to make a decision (see earlier discussion of the Reasonable Person Theory.) Safety of human subjects is of paramount importance; researchers address this through development of appropriate research design, establishment of inclusion/exclusion criteria, and feedback from subjects throughout the study. Despite best efforts, however, it must be acknowledged that not all potential risks are known. Every research protocol should have a data and safety monitoring plan (DSMP) for how data will be reviewed, how safety will be monitored, and how reporting and stopping rules will be accomplished. The plan should be commensurate with risks, size, and complexity of the protocol. Researchers should provide an explicit statement of risk along with the rationale (supported by previous work done). There should be an adverse event grading and attribution scheme that is developed prior to study onset. There must be a plan for reporting unanticipated serious adverse events to appropriate persons/bodies, within an established (prompt) time frame. And there should be an adequate plan for regular safety and data review and reporting. In some instances, a Data and Safety Monitoring Board (7) (DSMB) (also referred to as a Data Monitoring Committee [DMC]) may be established, primarily to provide a broad context for safety monitoring. A DSMB usually looks at global data, including all adverse event reports either completely unblinded or categorized by treatment arm. As such, the DSMB is able to determine whether a clear effect exists in one
3 ETHICAL PRINCIPLES GUIDING CLINICAL RESEARCH
15
TABLE 3.5
Designing a Sound Scientific and Ethical Research Study. · · · · · · · ·
Start with a highly competent research team. Use a good design, plan for data and safety monitoring, and select least susceptible subjects. Identify opportunities for risk exposure and procedures for minimizing risks. Utilize research procedures that have the least likelihood of harm. Ensure adequate monitoring so that adverse events are quickly identified, managed, and reported. Ensure privacy and confidentiality are protected. Maximize benefits, including direct, indirect, and societal benefits. Goal of all research is to produce generalizable knowledge.
arm of the study versus the other(s). The DSMB is thus able to apply stopping rules, either because of emerging toxicity or because of futility concerns. There are no universal requirements for DSMBs, but they are required by the National Institutes of Health (NIH) for all phase III trials. They may be appropriate for phase I or phase II trials if there are multiple sites, the study is blinded, there are high risks, or vulnerable populations are involved. They may also be required by the IRB if a potential for conflict of interest exists. The benefit assessment is equally challenging, and must be reasonable in relation to the phase of the trial, and what is known to date about effects of the intervention. There may be potential direct benefits to the enrolled subjects, or there may be future benefits to society. There may also be indirect benefits that accrue from participation in research in general. While all types of benefits are legitimate, and may weigh in the risk/benefit assessment, researchers must be careful to avoid overstating potential benefits. Table 3.5 provides considerations in the design of a research study, to emphasize sound scientific design in ethical research.
JUSTICE The Belmont Report tells us, “An injustice occurs when some benefit to which a person is entitled is denied without good reason or when some burden is imposed unduly. . . .” This sets up an ethical obligation: the fair sharing of burdens and benefits, with the corresponding requirement to ensure equitable selection of research subjects. This is manifest in fairness in inclusion and exclusion criteria, asking the questions, “Does the research involve individuals who are unlikely to benefit from the results of the research?” and “Who is likely to benefit? What connection do they have to the research subjects?” It is important to note that when the Belmont Report was written, there was a national
attitude of protectionism, which focused on the potential risks or burdens of research, and the need to protect subjects from the unfair burden of research. In the late 1980s, there began a movement that focused more on fairness in access to the potential benefits as a justice issue. There was attention given to expanded access to clinical trials and earlier access to investigational agents, which continues today. Researchers need to be attuned to both sides of this issue, ensuring appropriate inclusion to distribute possible benefits, while also avoiding targeting one group to bear the risks that will offer benefits to others. A critical aspect of the principle of justice focuses on development of inclusion and exclusion criteria for the clinical trial. These criteria embody the attributes necessary to accomplish the purpose of the research. Well defined criteria will increase the likelihood of producing reliable and reproducible results, decrease the likelihood of harm, and guard against exploitation of vulnerable populations. The concept of vulnerability deserves some consideration. Federal regulations define such subjects as: [V]ulnerable to coercion or undue influence, such as children, prisoners, pregnant women, mentally disabled persons, or economically or educationally disadvantaged persons . . . [and require that] . . . additional safeguards have been included in the study to protect the rights and welfare of these subjects. Protections that must be in place for the aforementioned groups are specified in special subparts of the federal regulations. But it is important to recognize that vulnerability extends beyond these defined groups, and really reflects a condition in which there is a substantial inability to protect one’s own interests. This condition thus interferes with autonomy or decision-making capacity, and may involve personal circumstances which expose subjects
16
ONCOLOGY CLINICAL TRIALS
TABLE 3.6
IRB Approval Considerations, in the Context of Ethical Principles. · · · · · · ·
Is the risk/benefit relationship reasonable? Are risks minimized? Is selection of subjects equitable? Is appropriate informed consent planned? Will the data collected be adequately monitored? Are there adequate provisions to protect privacy and maintain confidentiality of data? Are there additional safeguards, if needed (children, prisoners, decisionally impaired, etc.)?
to intimidation or exploitation. Examples of potentially vulnerable populations include children, prisoners, pregnant women, critically ill, decisionally impaired (beyond those with cognitive deficits, consider those with brain metastases, those who have just received a devastating diagnosis), homeless who need money, and some clinic populations. Researchers should consider implementing additional safeguards for these populations. For example, the researcher or the IRB may place limits on level of risk to which such subjects may be exposed, or may require or allow surrogate consent, or use of a consent monitor, subject advocate, or quizzes to assess comprehension. Some would also advocate introduction of a delay in the consenting process, to allow adequate time for decision making, and for some populations it could be useful to consider use of an independent evaluator, or incorporation of a DSMB. Recruitment of subjects is an important activity which starts the subject selection process, and thus must be sound and ethical. The IRB must review and approve methodology and content, since this activity is considered the beginning of the consent process. Methodologies might include advertisements, internet postings, registries, use of targeted letters, phone calls, and so forth. Efforts must be in place to ensure that recruitment activities are accurate and truthful, and do not overemphasize benefits or payments, and do not underestimate possible risks.
IRB APPROVAL ISSUES The IRB is responsible for protecting the rights and welfare of human subjects participating in research studies and ensuring that research is conducted in accordance with accepted ethical standards. Table 3.6 and Table 3.7 review approval considerations for research protocols, and focus on evaluation of the protocol application and the consent form. For initial IRB review, there will be scrutiny of the risk profile, the plan
for data and safety monitoring, and the consent process. For continuing IRB review, the focus will be on whether there is any new information which might alter the risk/benefit ratio, and whether unanticipated problems have occurred. Accrual will also be examined, to be certain the trial is proceeding according to plan, and to allow project completion and generation of reliable results. SPECIAL TOPICS FOR CONSIDERATION Ethical Issues with Phase I Oncology Trials In biomedical research, the early work that is done in translation of laboratory (preclinical) research into the clinical arena in humans is referred to as phase I research. It is important to recognize that the major objective of a phase I study is to characterize the investigational agent’s toxicity profile, and to determine a dose and schedule appropriate for phase II testing.
TABLE 3.7
How Are the Principles Applied? · Careful review of the protocol, especially the · Research hypothesis, scientific rationale, and study design · Inclusion/exclusion criteria · DSMP and stopping rules · Risks/benefits · Consent process · Confidentiality provisions · In case of injury section · Careful review of the consent form, ensuring that the · Purpose and research procedures are well described · Risks and anticipated benefits are reasonable · Confidentiality and privacy are addressed · Alternative treatments are explained · Voluntariness is stressed
3 ETHICAL PRINCIPLES GUIDING CLINICAL RESEARCH
While traditional phase I studies use healthy volunteers, phase I oncology trials typically use patients with cancer who have exhausted standard therapy. This approach, which is deemed scientifically appropriate, and which is intended to minimize toxicity exposure for healthy volunteers, may nonetheless be ethically suspect. Concerns center on design (designed to characterize toxicity), benefit (little to no benefit to participants), and risks (unknown risks, often felt therefore to be potentially high risk.) Indeed, older data estimates that the benefit, as measured by response rate, is only about 1.5% to 5%, while the risk of toxicity is substantial, with an actual mortality rate of approximately 0.5% (8). The relatively low clinical benefit coupled with a small but definite risk of death and serious but unquantified adverse effects leads to concerns over the risk/benefit balance being too heavily weighted on the risk side. These concerns are compounded by the need for a substantial time commitment from participants, often at the end of life. And there are concerns that perhaps informed consent is given under the cloud of the therapeutic misconception. This is a misunderstanding that participating in research is the same as receiving individualized treatment from a physician. Research subjects fail to appreciate that the aim of research is to obtain scientific knowledge, and that any benefit that may accrue is a by-product of the research. The question is, if a cancer patient really knew/understood the intentions of a phase I trial, how could they possibly agree to participate in a phase I oncology trial? There are concerns with deficient disclosure, exaggeration of benefits, and minimization of risks. Critics argue that these people either are not given accurate information, or fail to understand the information they are provided. Most patients have deficient understanding of the objectives of phase I research, as cancer patients legitimately hope for stabilization, improvement, or even cure. Being potentially vulnerable subjects, their thinking may be clouded, and some say they may be unable to make their own decisions. In addition to concerns about adequate consent, there are ethical concerns about the risk/benefit balance. Is there risk? Certainly, but hopefully it is minimized to the greatest extent possible. Is there benefit? Maybe, but certainly it is minimal due to study design. So, how do we assess the risk/benefit ratio? What standard is used to calculate the answer? Who gets to decide? Several authors have turned to more recent data to try to answer these questions. Roberts et al. (9) examined American Society of Clinical Oncology (ASCO) data from 1991 to 2002, and found results
17
supportive of the older data mentioned above. There were 243 objective responses among 6,474 patients (3.8% response rate), 137 deaths from any cause, 35 of which were classified as fatal toxicity (0.54%) and 670 nonfatal serious grade 3 or 4 toxic events (for an overall serious toxicity rate of 10.3%). Adding a different perspective, however, was the review by Horstmann and colleagues (10) which examined Cancer Therapy Evaluation Program (CTEP) data from the National Cancer Institute (NCI) from 1991 to 2002. They reported a 10.6% response rate (7.5% partial, 3.1% complete), and also noted that 34.1% had stable disease or less-than-partial response. This response was accompanied by 58/11,935 deaths (0.49%) at least possibly related, but 18 definitely related and 7 probably related (0.21% fatal toxicity), and 14.3% had grade 4 toxic effects in a subset of studies, but overall, 5,251 grade 4 toxic effects were reported in 11,935 participants (44%). So certainly there are risks such as death due to the agent being tested (fatal toxicity) as well as grade 4 serious adverse events and a substantial time commitment at end of life. But, these significant risks might be considered offset by the better response proportions noted in the Horstmann review. In addition, there truly are several different types of benefits, as described by Glannon (11): Direct benefit: direct physiologic effect from the intervention; Collateral (indirect) benefit: inclusional benefit from participating in the research, and Aspirational benefit: benefit to society and future patients from results of the study. Glannon describes rationality and decision making, contrasting the therapeutic misconception (a belief in a direct benefit without much, if any, consideration of risk) against rational therapeutic optimism (weighing low probable benefit against risk when one is facing death). Agrawal et al. (12) described four areas of the decision-making process in phase I oncology trials: how subjects perceive their options and alternatives, what pressures they feel, how they understand the purpose and risks, and how they assess benefits. They interviewed 163 subjects who had enrolled in a phase I oncology trial, and found that the majority were well aware of alternatives but largely did not consider them. Subjects did not feel a lot of pressure to participate from researchers or family, but 75% felt pressure because their cancer was growing. They reported that the research purpose to kill cancer cells was most important in their decision making, and that even a 10% chance of death would not dissuade their participation. These authors went on to conclude that these phase I participants might be viewed as therapeutic optimists: they hoped to benefit although they recognized that others would not. So it seems that while
18
ONCOLOGY CLINICAL TRIALS
there may be ethical concerns with phase I oncology trials, limited empirical data support the continued practice of enrolling end-stage oncology patients rather than healthy volunteers. Ethical Issues with Tissue and Data Banking Biomedical research is increasingly recognizing the value of the samples and the data that are collected as part of clinical trials. This has led to a push in recent years to bank samples in various repositories to allow continued future research use. Several ethical concerns arise when considering repositories, involving concerns that touch all three ethical principles: · Respect for persons concerns involve adequacy of informed consent, ensuring autonomy, ensuring the right to withdrawal, and privacy and confidentiality of the information. Recommendations are that the purpose of future research be specified in the consent process to the extent possible. Donors should be told who will have access to their information, and what identifiers will be associated with the sample. Where human genetic research is anticipated, the consent form must describe the possible consequences. Consents should specify conditions whereby subjects may withdraw their participation, and whether this will involve destruction of the samples or simple anonymization. · Beneficence concerns arise, as it is agreed that there is generally no direct benefit for the donor, but only indirect or societal benefit, thus necessitating attention to minimizing risk. The main risks involve a potential breach of confidentiality, along with a perceived fear of harm of discrimination, and a possible effect on access to or retention of benefits or entitlements (health or life insurance, employability, etc.). Also feared are possible stigmatization and the possibility of altered family relationships. These risks may be minimized by coding or de-identifying samples, using secure storage and computerized systems that utilize secure servers or encryption. · Justice concerns are centered primarily on ownership issues (13). In the limited cases to date, it seems the courts say that subjects forfeit ownership upon donation (signing informed consent) and voluntary withdrawal from the research (repository) does not equate to directing use or transferring ownership. It is, however, recognized that proprietary rights belong to the subject if the cells are still within the subject. The ownership and justice issues are not resolved yet and may evolve over time. Guidance for managing all of these issues has been developed by the NIH (14)
and the Office for Human Research Protections (OHRP) (15). Conflict of Interest Increasing national scrutiny is being given to those situations in which financial or other personal considerations may compromise, or appear to compromise, the investigator’s professional judgment in conducting or reporting research. It is thought that financial interests held by those conducting research may compromise or appear to compromise the fulfillment of ethical obligations regarding the well-being of the research subjects. Obviously, other issues beyond financial matters may affect conduct, such as seeking prestige, promotion, publication, or prizes. But it is typically the financial interests which garner the most attention and scrutiny, and which may undermine the credibility of the research. When conflicts of interest do arise, they must be recognized, disclosed, and either eliminated or properly managed. Disclosure is a key concept. Financial interests determined to be a conflict of interest may be managed by eliminating them or mitigating their impact. A variety of methods (or combinations) may be effective to minimize risks to subjects. Consider, “Would the rights and welfare of human subjects be better protected by any (or a combination) of the following”: Reduction or elimination of the financial interest; disclosure of the financial interest to the prospective subject; separation of responsibilities; additional oversight or monitoring of the research; modification of the roles of particular research personnel; or change in location of specific research related activities? Unsettled issues remain, and may be explored through the following questions: Is it enough to inform human subjects of the investigator’s financial or potential financial interests? Will disclosing the information to a research volunteer affect their decision to participate? Will it make the process any safer for them? Can financial conflicts be managed in a way that doesn’t adversely affect patient safety or influence the objectivity of the research conclusions? It is critical to recognize that conflicts of interest reflect a situation, not a behavior, nor an indictment. Current approaches involve disclosure, followed by elimination, or mitigation, and management.
3 ETHICAL PRINCIPLES GUIDING CLINICAL RESEARCH
FUTURE ISSUES The IRB system as described in this chapter has been in place for decades, and some call for a major overhaul of the system. Some advocate for a system of regional or national review, in lieu of the local institutional review system that predominates now. The National Cancer Institute has developed the Central IRB (CIRB). Information is available at its Web site, http://www .ncicirb.org. Institutions wishing to use this mechanism must enter into an agreement with the National Cancer Institute (NCI). While there may be criticism of the current IRB system, another perspective calls for heightened scrutiny of conflicts and enhancement of existing protections for human subjects. The tension between too much and too little oversight is likely to continue to grow in a regulated industry such as clinical research. Another future issue will be the tailoring of protocols to small molecule targets, or genetics of receptors, which may raise new ethical issues that clinical researchers will need to grapple with in developing appropriate research designs. Finally, there already is a strong advocacy for registration of clinical trials in a public forum, such as the NIH Web site (http://clinicaltrials.gov). In coming years, this registration will evolve to a requirement for the posting of results, in an effort to better disseminate the results of research.
References 1. Levine RJ. Ethics and regulation of clinical research. 2nd ed. Baltimore: Urban and Schwarzenberg; 1986. 2. Dunn CM, Chadwick GL. Protecting study volunteers in research. 3rd ed. Boston: Thomson Centerwatch; 2004.
19
3. Nuremberg Code. Trials of War Criminals before the Nuremberg Military Tribunals under Control Council Law No. 10, Vol. 2, pp. 181–182. Washington, D.C.: U.S. Government Printing Office, 1949. (Accessed August 15, 2008, at http://www.hhs.gov/ohrp/references/nurcode.htm.) 4. World Medical Association. Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects. Adopted by the 18th WMA General Assembly, Helsinki, Finland, June 1964, and as revised by the 52nd WMA General Assembly, Edinburgh, Scotland, October 2000. (Accessed June 16, 2008, at http://www.wma.net/e/policy/b3.htm.) 5. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. The Belmont Report, Ethical Principles and Guidelines for the Protection of Human Subjects of Research, April 18, 1979. (Accessed August 15, 2008 at http://www.hhs.gov/ohrp/humansubjects/ guidance/belmont.htm.) 6. Code of Federal Regulations. Title 45 Public Welfare Department of Health and Human Services; Part 46 Protection of Human Subjects, June 23, 2005. (Accessed August 15, 2008, at http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm.) 7. Slutsky AS, Lavery JV. Data and safety monitoring boards. N Engl J Med. 2004;350:1143–1147. 8. Decoster G, Stein G, Holdener EE. Responses and toxic deaths in phase I clinical trials. Ann Oncol. 1990;1:175–181. 9. Roberts TG, Goulart BH, Squitieri L, et al. Trends in the risks and benefits to patients with cancer participating in phase 1 clinical trials. JAMA. 2004;292:2130–2140. 10. Horstmann E, McCabe MS, Grochow L, et al. Risks and benefits of phase 1 oncology trials, 1991–2002. NEJM. 2005;352:895–904. 11. Glannon W. Phase I oncology trials: why the therapeutic misconception will not go away. J Med Ethics. 2006;32:252–255. 12. Agrawal M, Grady C, Fairclough DL, Meropol NJ, Maynard K, Emanuel EJ. Patients’ decision-making process regarding participation in phase I oncology research. J Clin Onc. 2006;24:4479–4484. 13. Charo RA. Body of research-ownership and use of human tissue. N Engl J Med. 2006;355:1517–1519. 14. Research Involving Private Information or Biological Specimens. (Accessed June 16, 2008, at http://grants.nih.gov/ grants/policy/hs/PrivateInfoOrBioSpecimensDecisionChart.pdf.) 15. Issues to Consider in the Research Use of Stored Data or Tissues November 7, 1997. (Accessed June 16, 2008, at http://www.hhs .gov/ohrp/humansubjects/guidance/reposit.htm.)
This page intentionally left blank
4
Preclinical Drug Assessment
Cindy H. Chau William Douglas Figg
The oncology drug development process spans from the discovery and screening phase to preclinical and clinical testing (Fig. 4.1). Preclinical drug assessment remains one of the most important aspects of a successful drug development program. Preclinical research must show that a drug candidate not only possesses therapeutic benefit but also establish that it will not expose humans to unreasonable risks when used in limited, early-stage clinical testing. Well-designed pharmacology, toxicology, and pharmacokinetic studies are important to support safe use conditions for human trials with oncology drugs. An investigational new drug (IND) exemption (which includes information on animal pharmacology and toxicology studies, manufacturing information, and clinical protocols and investigator information) is filed with the Food and Drug Administration (FDA) before an experimental drug might proceed to human studies. The regulatory aspect (which includes details of the IND application) of the drug development process will be discussed in a separate chapter. The focus of this chapter will be on preclinical testing, which includes a discussion of various technological advances and preclinical models used to determine the pharmacological profile of the drug and evaluate the acute and short-term toxicity of the drug in at least two species of animals. Recent advances in drug discovery programs have identified a large number of drug targets utilizing sophisticated methods such as genomic/proteomic
analyses, high-throughput compound screening, and structure-based drug design. Developing appropriate preclinical models that are predictive of human malignancies will enable the selection of potential drug candidates for further clinical investigation. One of the challenges lies in overcoming the issue where agents that looked promising in the preclinical phase of development ultimately failed in pivotal phase 3 clinical trials. In this regard there is a critical need for better approaches to preclinical drug assessment so that only the most promising agents are advanced for further clinical evaluation. Preclinical models theoretically should define whether a particular potential therapy has activity against tumors with the appropriate drug target, and whether or not activity might be expected in patients at tolerated doses. The interpretation of data from preclinical models is often perceived as a bottleneck in drug development, and it is hoped that the selection of appropriate animal models with reproducible activity can portend to a subsequent successful clinical development pathway. Existing models that have been used in the development of traditional cytotoxic drugs will need to be reevaluated and refined for the newer molecularly targeted drugs. Indeed the emergence of targeted therapy has resulted in a shift in preclinical drug assessment strategies in an attempt to bridge the gap between preclinical models and clinical efficacy. This chapter will describe recent
21
22
ONCOLOGY CLINICAL TRIALS
Preclinical Drug Assessment
Phase of Development
Primary Objectives
Discovery
Target identification & validation; screening; lead discovery & optimization
Clinical Drug Assessment
Preclinical
Clinical Trials
Testing
Post-marketing Surveillance
Laboratory & animal studies; assess safety & biological activity
Determine safety & efficacy in Phase 0-3 studies
Monitor for safety & side effects in Phase 4 studies
FIGURE 4.1 Stages of the drug discovery and development process.
technological advances in in vitro and in vivo models used in cancer drug development that enable drug assessment at the preclinical phase and evaluate their strengths and weaknesses. The chapter will address how preclinical drug assessment can be best implemented to make the drug development process more efficient and accurate.
UNDERSTANDING TECHNOLOGICAL ADVANCES IN CANCER DRUG DISCOVERY AND DEVELOPMENT The availability of complete genome sequence information and most recently the first report of the cancer genome (1) have provided an abundance of new opportunities for discovering novel cancer-related gene expressions and/or mutations. This pioneering work sets the stage for the use of a more comprehensive, genome-wide approach to unravel the genetic basis of cancer and is the foundation for developing more effective therapies for cancer treatment. These advances have become the basis in preclinical research for a variety of high-throughput screens to identify potential drug targets, for lead optimization, and to determine the extent to which lead compounds affect particular molecular pathways. Perhaps the roots of much of the progress and promise in cancer drug discovery programs lies in the advent of gene expression microarray analyses and related technologies such as genomics and proteomics (2). Indeed, the integration of various technologies proves pivotal to not only target identification, characterization, and validation, but also lead to
optimization. In fact, the platform applied in target discovery can also be further developed and used in analytical validation, such as in the case for biomarker discovery and validation platforms. Target measurements can be assessed at different molecular and biological levels with different technologies. Selection of an appropriate assay largely depends on the feasibility of preclinical data interpretation and the limitations of the respective technology. A genomics approach consists of various methods that measure gene expression analysis, such as in microarrays, which has become the standard technology used for target identification and validation. Reverse transcription-polymerase chain reaction is a very sensitive, reproducible technology and often times used to validate microarray-generated data. Comparative genomic hybridization can be used to detect chromosomal alterations associated with certain cancers. Proteomics involves global protein profiling to provide information about protein abundance, location, modification, and protein-protein interactions, as well as to determine the functional relationships among these proteins and how protein complexes are altered in cancer cells or in response to therapy. As such, proteomics provide researchers with valuable information much more rapidly than traditional methods (e.g., cloning and sequencing of genes) for establishing functional relationships among proteins (3). While proteomics is a discovery technology, immunoassays are routinely used for protein assessments due to its straightforward clinical application and translation into a potential diagnostic assay. The multiplexing of protein assays can increase the throughput for simultaneous analysis of several proteins;
4 PRECLINICAL DRUG ASSESSMENT
however, it is limited by the need to standardize assay conditions, the loss of sensitivity over single assays, and the quality control of each analyte in the complete multiplex panel (4). Metabonomics (or metabolomics) is the profiling of endogenous metabolites in biofluids or tissue for characterization of the metabolic phenotype. The analytical platforms used are based on nuclear magnetic resonance spectroscopy and the combination of liquid chromatography with mass spectroscopy. It is principally used in drug discovery, although by definition it is the ultimate end-point measurement of biological events. Yet the technology is limited by the lack of comprehensive metabolite databases and throughput both of which affect data analysis and interpretation. The integration of these technologies lends to a new field that blends molecular biology and computer science, or bioinformatics, where linking expression data derived from genomic/proteomic approaches to target biological pathways can provide a comprehensive understanding of the disease biology and further validating the molecular target (5). Progress in both genomics and proteomics research has resulted in the need for bioinformatics research to develop increasingly sophisticated analytical software, powerful statistical methods, and user interfaces for database management and experimental data mining. Screening of drug candidates may rely on improved and accelerated methods for determining structure-activity relationships that involve both traditional and modified high-throughput screening of very large combinatorial libraries of compounds as well as in silico molecular modeling (or rational drug design). To further identify or validate therapeutic targets, and to assess drug efficacy and toxicity in in vitro or in vivo systems, researchers may employ animal models with precisely defined genetic backgrounds. The creation of gene knockouts in a wide variety of organisms can be achieved by using RNA interference technology. A knockout enables scientists to examine experimental animals without the gene of interest under various experimental conditions, and reveal what would happen if an agent against that target were completely effective. As such, the advent of new technology (genomics, proteomics, etc.), combinatorial chemistry, and highthroughput screening for the identification of potential lead compounds has markedly expanded the cancer drug pipeline. Progress in understanding the genetic and molecular basis of cancer has intensified efforts to identify more selective and targeted anticancer compounds; hence, altering the preclinical models and methods used during the evaluation process of these drugs.
23
PRECLINICAL EVALUATION PROCESS Proper preclinical evaluation of drug candidates can improve the predictive value, lessen the time and cost of launching new products, and accelerate the drug development process. The preclinical phase of drug development ranges from lead candidate selection, to establishing proof of concept and toxicity testing, to the selection of the first human doses. The preclinical evaluation process includes establishing safety and toxicity endpoints, the selection of relevant species, pharmacological characterization, and appropriate analysis and interpretation of preclinical data. Current preclinical safety assessments vary among cytotoxics, small molecules, and biologics. Preclinical efficacy testing occurs in various in vivo models of cancer to evaluate anticancer activity and preclinical oncology assessments of drug candidates. These animal models include the traditional syngeneic and human tumor xenografts, as well as the orthotopic and transgenic tumor models. The application of imaging technologies makes all of these model systems more quantitative, particularly enhancing the efficiency of orthotopic and transgenic models. Additional functional imaging studies provide an integrated and quantitative correlation of drug efficacy with mechanism of action.
PRECLINICAL MODELS IN CANCER DRUG DISCOVERY AND DEVELOPMENT Preclinical screening of anticancer drugs involves testing in two different types of systems: in vitro biochemical screens, cell-based assays, or tissue culture systems, and in vivo animal models. The National Cancer Institute (NCI) has been involved in the discovery and development of many anticancer agents over the years. To support the preclinical development of novel therapeutic modalities for cancer, NCI established the Developmental Therapeutics Program (DTP) to provide in vitro and in vivo screening services to select and advance active agents in preclinical models toward clinical evaluation. The next section describes various preclinical models used to advance drug candidates from preclinical to clinical testing. The NCI Human Tumor Cell Line (60-Cell) Screen DTP initiated an in vitro screen for potential anticancer drugs utilizing a panel of 60 human tumor cell lines derived from various tissue types, representing
24
ONCOLOGY CLINICAL TRIALS
leukemia, melanoma, and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney (6). Natural products collected in an NCI repository are also a major source of chemical entities screened. The aim of the screen is to aid in the selection of a lead compound or for further development of a particular class of synthetic or natural compounds that demonstrate selective growth inhibition or cell killing of particular tumor cell lines. For a given compound, a dose response data set is produced that gives a biological response pattern that can be utilized in pattern recognition algorithms to either assign a putative mechanism of action or to determine that the response pattern is unique. Compounds with similar mechanisms of action tend to have similar patterns of growth inhibition in the 60 cell line screen (7). The pattern of response of the cell lines as a group can be used to rank a compound according to the likelihood of sharing common mechanisms. The COMPARE computer algorithm quantifies this pattern (8), and COMPARE searches the database of screened agents to compile a list of the compounds that are most similar (9). This approach has been used to identify novel tubulin-interacting compounds and topoisomerase poisons (10, 11). Moving beyond growth inhibition and cell killing to characterizing mechanisms of action through the expression of molecular targets in the 60 cell lines (8), NCI has developed collaborations with the cancer research community to establish a Molecular Targets database. Following characterization of various molecular targets in the 60 cell lines, it may also be possible to select compounds most likely to interact with a specific molecular target using the COMPARE algorithm. Data available in the database include mutation status of genes important in cancer, quantitation of proteins, and RNA levels within cells, enzyme activity, and microarray data measuring the baseline expression of thousands of genes (7). While the DTP databases are resourceful to the discovery and screening process, it is important to emphasize that any correlations generated with COMPARE analysis or between compounds and Molecular Targets need to be verified experimentally. Additionally, a potential weakness in the use of human tumor cell lines used in the discovery and characterization of new therapeutic drugs is that they may have lost the important properties originally possessed in vivo. In fact, cellular differences in biological properties were demonstrated in a comparison of human tumor cell lines and of primary cultures of tumor material taken directly at surgery from cancer patients (12). Therefore, it is especially important to identify cell lines that preserve potential targets and hence research using primary cultures and early-passage cell lines may prove useful in drug discovery.
Nonetheless, NCI is also reclassifying the cells in the panel according to the types of genetic defects the cells carry. This would enable drugs that address the specific defects or targets to be identified and theoretically be matched to a patient’s tumor cell makeup. This should prove useful and important in the current era of personalized medicine. Hollow Fiber Technology Once compounds are identified as possessing some evidence of anti-proliferative activity in in vitro assays, further evaluation in in vivo models would be the next step. However, the cost and time of running the laborintensive conventional xenograft models with empirical dosing strategies for all possible lead compounds would greatly reduce the rate at which compounds could be evaluated. A short term in vivo assay, the hollow fiber model, was developed by NCI given the fact that in vitro screening efforts exceeded the available capacity for traditional xenograft model testing. These assays used semipermeable hollow fibers implanted in animals. The fibers allow tumor cells to grow in contact with each other, and more than one tumor can be implanted into a single animal providing greater efficiencies than would be obtained through a single in vivo experiment (13). A standard panel of 12 tumor cell lines is used for the routine hollow-fiber screening of compounds with in vitro activity, and alternate lines can be used for specialized testing on a nonroutine basis. The premise of this technique is that advancing potential anticancer agents identified in an in vitro screen to preclinical development requires a demonstration of in vivo efficacy in one or more animal models (14). Hollow fiber screens can indeed define compounds that go on to show evidence of in vivo activity in traditional xenograft models and appear to correlate well with clinical results (15). The hollow fiber assay is not intended to replace detailed biological models such as transgenic or knockout models. Rather it is used as an initial point of triage to determine and prioritize compounds that should be further studied in detailed in vivo models for further pharmacological and mechanistic studies.
Human Tumor Xenografts Human tumor xenografts grown subcutaneouly (s.c.) in nude or in mice with severe combined immunodeficiency representing all major tumor types have played a significant role in preclinical anticancer drug development testing in vivo. Their use has been validated for
4 PRECLINICAL DRUG ASSESSMENT
cytotoxics as predictive indicators of probable clinical activity, playing a pivotal role in late preclinical agent optimization and guiding the selection of candidates for phase I trials (15). In parallel with efficacy determinations, the xenograft model is useful in determining the agent’s pharmacokinetic and pharmacodynamic markers of response for subsequent clinical applications in that it provides a renewable and readily accessible source of target human tumor cells. Limitations to this model system include the time and expense (relative to the hollow fiber assay and in vitro testing), ethical issues around animal experimentation, a general lack of metastatic spread from primary s.c. implanted xenografts and thus not a good model for studying anti-metastatic strategies, and instances where the model is inappropriate as a likely predictor of clinical outcome (16). Orthotopic Model Systems Orthotopic models of cancer were developed in attempt to address one of the major drawbacks of s.c. tumor xenografts in that they do not reproduce the primary site of the common human cancers, nor do they represent the common sites of metastasis. Considerable effort has been made to develop more clinically relevant models by using orthotopic transplantation of tumors in rodents from a variety of tumor types into the appropriate anatomical site and often these tumors will metastasize in a similar manner as the same tumor type will in human cancer. It is now clear that the process of metastasis is more efficient in orthotopically implanted tumors and mimics human metastasis (17). A number of clinically relevant targets will be better represented by orthotopic model systems that mimic the morphology, microenvironment, and growth and metastatic patterns of human cancer. Limitations to orthotopic models include the technical aspect of the procedures, which are more difficult and time-consuming, and, hence, more expensive than conventional s.c. models. In addition, the endpoints for determining the therapeutic effects are more complex than the normal tumor measurements in s.c. models (18). Although imaging studies have indicated the potential for monitoring tumor growth noninvasively, it is still unclear whether the use of orthotopic versus s.c. tumors results in a better prediction of clinical response. For example, the matrix metalloproteinase inhibitor Batimastat was shown to reduce tumor progression in an orthotopic model of colon cancer (19), but this compound has subsequently failed clinical testing; thus, orthotopic tumors may overestimate potential clinical efficacy.
25
Spontaneous and Genetically Engineered Animal Models Alternatives to tumor transplantation models are animals that naturally develop cancers with features relevant to human disease. These models include mice that are genetically engineered to develop cancer and companion (pet) animals that naturally develop cancers. A detailed discussion of the strengths and weaknesses of these models and their appropriate use in integrated drug development programs is reviewed in Hansen and Khanna (20). Briefly, genetically engineered mice are generally immunocompetent, and the tumors they develop are genetically mouse and localized in the usual sites. Limitations of these models include the requirements for breeding the animals that make these models high cost and the tumors usually develop late in the animal’s lifespan; so these models are slow in developing. There are few tumor types available, hence making it difficult to obtain enough animals to establish reliable statistics. Moreover, very few of these models have been validated as representative of human disease. IMAGING TECHNOLOGY Advances in novel imaging approaches have profound implications for drug development, especially imaging methods that are suitable for rodents as they offer opportunities for anticancer efficacy models. The application of imaging makes these in vivo model systems more quantitative, particularly enhancing the efficiency of orthotopic and transgenic models. Molecular and functional imaging technologies are used to assess cell proliferation and apoptosis (e.g., 18F-fluoro-L-thymidine and 99mTc-annexin imaging), cellular metabolism (e.g., 18 F-fluorodeoxyglucose positron emission tomography), and angiogenesis and vascular dynamics (e.g., dynamic contrast-enhanced computed tomography and magnetic resonance imaging). In addition to assessment of anticancer activity, additional functional imaging studies provide an integrated and quantitative correlation of drug efficacy with mechanism of action. Bioluminescence offers an opportunity to develop rodent models for efficacy evaluations that are more sensitive, more specific and of shorter duration than those traditionally used as well as with sensitive endpoints in the luciferase read-out (21, 22). These include models using orthotopic implant sites that were previously difficult to monitor for tumor growth. Transgenic mice bearing a luciferase reporter mechanism can be used to monitor the tumor microenvironment and to signal when transforming events occur. The limitation to using bioluminescence as an endpoint in efficacy studies is the requirement for tumor cell lines that
26
ONCOLOGY CLINICAL TRIALS
express luciferase, although some commercial sources of luciferase-expressing tumor cell lines are slowly becoming available (22).
TOXICOLOGY EVALUATION The Toxicology and Pharmacology Branch of the NCI performs pharmacological and toxicological evaluations of new oncology agents. Once a compound of interest is identified, animal models are critical to assess preclinical toxicology. The FDA requires that preclinical toxicology studies be conducted in two species, a rodent and a nonrodent for all small molecules, with the determination of maximum tolerated dose and drug behavior in the animal (pharmacokinetics) in both species (23). This requirement for safety/toxicity data results in a variety of studies (pharmacokinetics, pharmacodynamics, range-finding toxicity, IND-enabling toxicity studies) usually being conducted in rats and dogs for most new chemical entities (NCE). Mice, while favored for preclinical efficacy studies, are not typically used as the rodent species for toxicology studies because they tend to be poorer predictors of human toxicity, and their small size precludes serial blood sampling (24, 25). For both traditional cytotoxic and molecular target-based NCEs, there is the need to develop sensitive methodology to determine pharmacokinetics in various species, including plasma protein binding as well as determine whether metabolism is important and to identify metabolic pathways. If possible, appropriate biomarkers are selected to assess target modulation and develop sensitive methodology to determine the impact of drug treatment on targets in tumors and selected normal tissues. Finally, there is the need to determine the maximal tolerated doses and dose limiting toxicities in single dose studies in both a rodent and nonrodent using abbreviated study designs as a prelude to other repeated dose range-finding studies or definitive IND-enabling studies (26). In the current era of molecular target-based therapies, methodology to determine whether modulating the tumor molecular target is also responsible for toxicity is developed by correlating plasma drug levels and/or biomarkers with safety and toxicity across species.
incorporation of target or molecular endpoints in early clinical trials to follow logically from the preclinical experience. More safety testing would be required for those compounds that proceed to phase I clinical trials using the biological, pharmacological, and toxicological properties to define the optimal dose and schedule conditions for human studies. The question that remains to be addressed is how do we optimize the preclinical/clinical interface to ensure smoother transitions as the drug candidate moves from the preclinical to clinical phase of development (Fig. 4.1). The increasing number of molecularly targeted agents has called for better preclinical models to facilitate the development of biomarkers that may make predictive correlations with early clinical endpoints. Optimal evaluation of these molecularly targeted drugs requires the integration of PD (pharmacodynamic) assays into early phase trials. The incorporation of the exploratory phase 0 or target-development clinical trial design that focuses on extensive compound characterization and target assay development (including molecular imaging studies) in a limited number of patients could expedite the drug development process for these targeted agents. For the purposes of this chapter in discussing preclinical drug assessments, our interest lies in the preclinical modeling of the phase 0 trial. Basic standards of a phase 0 trial include: (a) validating targets or biomarkers in preclinical models and then in human tissue prior to initiating the clinical trial; (b) defining standard operating procedures for handling of tissues and biospecimens prior to initiating the clinical trial; (c) demonstrating drug target or biomarker effect in preclinical models; and (d) determining the relationship between the pharmacodynamics and the pharmacokinetics (27). Essential to this process is the pharmacodynamic assay that has been validated for analytical performance and proven to be therapeutically relevant in preclinical studies. An example is the qualification of a PD assay of poly (ADP-Ribose) polymerase (PARP) in tumor biopsies of mouse xenografts that has facilitated the design of a phase 0 trial of ABT888, a PARP inhibitor, and serving as a model for developing proof-of-principle clinical trials of molecularly targeted drugs (28).
CONCLUSIONS NEW APPROACHES TO CANCER DRUG DEVELOPMENT: PRECLINICAL MODELING OF THE PHASE 0 TRIAL An effect on the particular molecular target becomes a constant signal after which pharmacologic, scheduling, and toxicologic studies follow, allowing for the
Recent advances in cancer drug discovery have resulted in the increasingly rapid identification of therapeutic targets and improvements in validating these targets through refined in vitro systems and more sophisticated in vivo models of cancer provide an important foundation for developing anticancer agents with the
4 PRECLINICAL DRUG ASSESSMENT
potential to be highly specific, potent, and nontoxic. Adapting to advancements in novel technologies and cancer science requires modifying drug screens, developing new in vitro and in vivo models and exploring more effective toxicological evaluations. Thus, there will always be a critical need to find better approaches to preclinical drug assessment so that only the most promising agents are advanced for further clinical evaluation. Caution should be taken when halting developmental programs prematurely due to lack of efficacy in available preclinical models, thereby discarding potentially useful agents in the process. The focus on how preclinical development can be improved to reduce the number of false positives and/or false negatives remains a challenge to oncology drug development. Despite the progress that has been made in every stage of cancer drug discovery and development, the success rate for oncology agents remains disappointing. There is a growing need to improve the efficiency in moving agents from the preclinical phase to the market with more efficient and rational translational approaches and close collaboration between laboratory and clinical scientists.
10. 11.
12. 13. 14. 15.
16. 17. 18. 19.
References 1. Ley TJ, Mardis ER, Ding L, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. 2. Clarke PA, te Poele R, Wooster R, Workman P. Gene expression microarray analysis in cancer biology, pharmacology, and drug development: progress and potential. Biochem Pharmacol. 2001;62:1311–1336. 3. Petricoin EF, Zoon KC, Kohn EC, Barrett JC, Liotta LA. Clinical proteomics: translating benchside promise into bedside reality. Nat Rev Drug Discov. 2002;1:683–695. 4. Kingsmore SF. Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nat Rev Drug Discov. 2006;5:310–320. 5. Ilyin SE, Belkowski SM, Plata-Salaman CR. Biomarker discovery and validation: technologies and integrative approaches. Trends Biotechnol. 2004;22:411–416. 6. Monks A, Scudiero D, Skehan P, et al. Feasibility of a high-flux anticancer drug screen using a diverse panel of cultured human tumor cell lines. J Natl Cancer Inst. 1991;83:757–766. 7. Holbeck SL. Update on NCI in vitro drug screen utilities. Eur J Cancer. 2004;40:785–793. 8. Monks A, Scudiero DA, Johnson GS, Paull KD, Sausville EA. The NCI anti-cancer drug screen: a smart screen to identify effectors of novel targets. Anticancer Drug Des. 1997;12: 533–541. 9. Paull KD, Shoemaker RH, Hodes L, et al. Display and analysis of patterns of differential activity of drugs against human
20. 21. 22. 23. 24. 25. 26. 27. 28.
27
tumor cell lines: development of mean graph and COMPARE algorithm. J Natl Cancer Inst. 1989;81:1088–1092. Leteurtre F, Sackett DL, Madalengoitia J, et al. Azatoxin derivatives with potent and selective action on topoisomerase II. Biochem Pharmacol. 1995;49:1283–1290. Solary E, Leteurtre F, Paull KD, Scudiero D, Hamel E, Pommier Y. Dual inhibition of topoisomerase II and tubulin polymerization by azatoxin, a novel cytotoxic agent. Biochem Pharmacol. 1993;45:2449–2456. Baguley BC, Marshall ES. In vitro modelling of human tumour behaviour in drug discovery programmes. Eur J Cancer. 2004;40:794–801. Decker S, Hollingshead M, Bonomi CA, Carter JP, Sausville EA. The hollow fibre model in cancer drug screening: the NCI experience. Eur J Cancer. 2004;40:821–826. Hollingshead MG, Alley MC, Camalier RF, et al. In vivo cultivation of tumor cells in hollow fibers. Life Sci. 1995;57: 131–141. Johnson JI, Decker S, Zaharevitz D, et al. Relationships between drug activity in NCI preclinical in vitro and in vivo models and early clinical trials. Br J Cancer. 2001;84: 1424–1431. Kelland LR. Of mice and men: values and liabilities of the athymic nude mouse model in anticancer drug development. Eur J Cancer. 2004;40:827–836. Killion JJ, Radinsky R, Fidler IJ. Orthotopic models are necessary to predict therapy of transplantable tumors in mice. Cancer Metastasis Rev. 1998;17:279–284. Bibby MC. Orthotopic models of cancer for preclinical drug evaluation: advantages and disadvantages. Eur J Cancer. 2004;40:852–857. Wang X, Fu X, Brown PD, Crimmin MJ, Hoffman RM. Matrix metalloproteinase inhibitor BB-94 (batimastat) inhibits human colon tumor growth and spread in a patient-like orthotopic model in nude mice. Cancer Res. 1994;54:4726–4728. Hansen K, Khanna C. Spontaneous and genetically engineered animal models; use in preclinical cancer drug development. Eur J Cancer. 2004;40:858–880. Contag CH, Spilman SD, Contag PR, et al. Visualizing gene expression in living mammals using a bioluminescent reporter. Photochem Photobiol. 1997;66:523–531. Hollingshead MG, Bonomi CA, Borgel SD, et al. A potential role for imaging technology in anticancer efficacy evaluations. Eur J Cancer. 2004;40:890–898. DeGeorge JJ, Ahn CH, Andrews PA, et al. Regulatory considerations for preclinical development of anticancer drugs. Cancer Chemother Pharmacol. 1998;41:173–185. Grieshaber CK, Marsoni S. Relation of preclinical toxicology to findings in early clinical trials. Cancer Treat Rep. 1986;70: 65–72. Olson H, Betton G, Robinson D, et al. Concordance of the toxicity of pharmaceuticals in humans and in animals. Regul Toxicol Pharmacol. 2000;32:56–67. Tomaszewski JE. Multi-species toxicology approaches for oncology drugs: the US perspective. Eur J Cancer. 2004;40: 907–913. Kummar S, Kinders R, Rubinstein L, et al. Compressing drug development timelines in oncology using phase ‘0’ trials. Nat Rev Cancer. 2007;7:131–139. Kinders RJ, Hollingshead M, Khin S, et al. Preclinical modeling of a phase 0 clinical trial: qualification of a pharmacodynamic assay of poly (ADP-ribose) polymerase in tumor biopsies of mouse xenografts. Clin Cancer Res. 2008;14:6877–6885.
This page intentionally left blank
5
Formulating the Question and Objectives
Lauren C. Harshman Sandy Srinivas James Thomas Symanowski Nicholas J. Vogelzang
There are no bad anticancer agents, only bad clinical trial designs. —DD Von Hoff, MD Formulating a relevant question and designing a clinical trial can be daunting to the young investigator. Most interesting questions arise from the ability to identify deficiencies in treatment whether it be in the therapies for a particular type of patient, such as colorectal cancer patients with an activating mutation in the KRAS (v-Ki-ras2—Kirsten rat sarcoma viral oncogene homolog) gene, or in the line of treatment such as the treatment of refractory pancreatic cancer, where randomized controlled trials have been notably lacking (1, 2). The ability to identify unmet needs is enhanced by experience, which equates to time in the field, a trait most young investigators aspire to but do not yet possess. This chapter will delineate some of the key steps in formulating a clear, succinct question with answerable objectives. Common criticisms of clinical trials are failure to define a transparent question with clear objectives or to develop a strategy that has achievable outcomes. A central theme throughout this chapter will be to eliminate ambiguity. Unambiguous descriptions of objectives and outcomes will produce more interpretable and reproducible results that can be readily applied to clinical practice. The goals of clinical trials
are manifold, but generally include answering a question that will change clinical practice in terms of therapy, diagnosis, or the utility of a prognostic or predictive surrogate marker of response. Without a thoughtful trial design, the primary question may not be properly answered and could result in erroneous acceptance that a therapy is successful (a type I error) or in failure to show that a good therapy works (failing to reject the null hypothesis, a type II error.) The terms question, objectives, and endpoints or outcomes can become muddled in the young investigator’s mind as they develop a clinical trial. To define the question, first determine the nature of the answer sought. Does the study aim to evaluate a drug’s efficacy or is the goal to assess the utility of a surrogate biomarker or imaging study? The objectives, in turn, will delineate what needs to be accomplished in order to answer the question. The endpoints or outcomes can be deduced from this approach, and might more understandably be described as “how the objectives will be achieved or measured.” The fundamental components of formulating a good clinical question, answering it, and changing practice (i.e., making it relevant) include: · Thorough background knowledge of the disease, therapy, and/or biomarkers (or correlative study endpoints) that pertain to the primary question · Clearly defined objectives
29
30
ONCOLOGY CLINICAL TRIALS
· · · ·
Early consultation with a biostatistician Statistically sound trial design A suitable and accessible patient population to study A rationale drug or combination, dose, surrogate markers or imaging studies · Appropriate endpoints that will measure the objectives · Proper statistical analysis · Effective communication and dissemination of the results
THE IMPORTANCE OF BACKGROUND KNOWLEDGE When evaluating a new therapy or determining the proper use of a new molecular marker or imaging study, a comprehensive review of the background on the disease or patient population is critical. Review prior investigations and studies that have determined the current standard treatment. Evaluate past unsuccessful trials and assess the reasons for failure. This systematic appraisal will highlight deficiencies in existing treatments, biomarkers, or imaging modalities for the disease. By thoroughly understanding the mechanism of action of a new drug or the etiological pathway that triggers the disease, you may identify the right intervention to investigate. In addition to confirmation that your question is relevant, this review of the background of the disease, intervention, biomarkers, or other secondary or correlative study endpoints may elicit additional questions that require resolution. Further, the appropriate dosing or process to study can be ascertained by discussion with the company owning the rights to the drug or process, speaking with other investigators who have worked with the agent or process, evaluation of the preclinical data, and review of the results of phase I studies. Finally, in the case of rare diseases, review of the background data or lack thereof may highlight a potential niche for the young investigator, or offer the opportunity for a translational project when no prior clinical studies have been performed.
CLEARLY DEFINE THE OBJECTIVES The objectives assert the goals of the study. Objectives should be clearly defined such that they can be investigated by a quantitative assessment of appropriate outcomes (3). Gebski and colleagues devised a checklist for objectives in clinical trials (3, 4): · Are the intervention and control (e.g., usual care) described in detail?
· Has the target patient population been specified? · Has the degree of benefit from the intervention on a particular outcome, and the time frame, been specified? · Have any secondary outcomes been prespecified in similar detail? Avoid Ambiguity Ambiguous objectives should be avoided as they can lead to skepticism about trial results. Vague objectives may cause reviewers to wonder if definitions have been created post hoc and adjusted to fit the data (3). The need for clear definitions is especially important in the era of the targeted agents, where controversy has arisen regarding what constitutes appropriate objectives and endpoints and how best to assess them. Speaking with respect to second line therapy for hormone refractory prostate cancer, Dahut asked what is a meaningful measure of clinical benefit? (W.L. Dahut, American Society of Clinical Oncology Annual Meeting 2008). He asserted that it is tumor growth that causes morbidity and mortality, not the lack of tumor response. As such, perhaps oncologists should only stop therapy for objective progression, not for a lack of tumor response. This concept is somewhat difficult to reconcile with the historic goal of “shrinking the tumor” or “making it disappear.” While such goals have been validated as surrogate endpoints for many types of cancers, such as leukemia, lymphoma, and germ cell tumors, and have been formalized using the accepted Response Evaluation Criteria in Solid Tumors (RECIST) method of determining response (RECIST defines therapeutic success in terms of either partial or complete responses) (5), the vast majority of solid tumors do not disappear when treated with chemotherapy. Rather, radiologically and clinically they appear to stabilize or slow down. This dilemma of how to appropriately characterize clinical efficacy is further exemplified by the use of surrogate biomarkers (e.g., PSA, LDH, Ca19–9) and the cytostatic targeted agents. With both surrogate biomarkers and the cytostatic agents, using RECIST to assess response has come under question. RECIST does not include biomarker response, but it is generally agreed that normalization or decreases in these markers can aid in identifying therapeutic efficacy. In terms of the cytostatic agents, which tend to stabilize disease rather than dramatically decrease tumor size, the concept of incorporating clinical benefit (CB) as a measure of response has arisen. Here, clinical benefit incorporates disease stabilization (SD) in addition to partial response (PR) and complete response (CB) (CB = SD + PR + CR) as a measure of therapeutic success.
5 FORMULATING THE QUESTION AND OBJECTIVES
Proponents of including CB as a measure of response propose that SD is an acceptable and achievable goal when managing incurable diseases that are refractory to cytotoxic agents. In summary, in formulating an unambiguous question, be certain to evaluate carefully the role of endpoints with biomarkers versus endpoints with clinical or radiological examinations, and clearly define what constitutes clinical efficacy. This topic will be discussed in more detail in Chapters 6 and 7. Primary versus Secondary Objectives A study often has one primary objective and several secondary objectives. On rare occasions there can be co-primary endpoints. The primary objective is the focus of the study and directs the answer to the question or hypothesis, whereas the secondary objectives are often derivative or ancillary. In order for a study to be clinically meaningful and lead to a change in clinical practice, it must achieve its primary objective. Secondary objectives are certainly important; they can enhance the robustness of the results observed for the primary objective and can lead to additional studies, but are not of the same level of importance as that of answering the primary question. Safety assessment is a notable exception to this general rule (3). Even when safety is listed as a secondary objective, it is of equal importance to that of the primary objective in human subjects. In addition, be cautious not to overload the study with too many secondary objectives and methods of assessment as it may dilute the impact of the study and confuse the Institutional Review Board (IRB), Federal Drug Agency (FDA), and your eventual reviewers and readers. Determination of the primary and secondary objectives also depends on the type of trial planned. There are some common objectives that distinguish phase I trials versus those of phase II or phase III trials. In a phase I study of a new agent, the primary objective might be to define the maximally tolerated dose (MTD) or the optimal biological dose. In the United States, the MTD is generally the dose below that which induced dose-limiting toxicity at an unacceptably high rate and is the dose that will be investigated in the phase II study. Secondary objectives of a phase I trial commonly include describing the toxicities and pharmacokinetics of the agent. In phase II trials, assessing whether a drug or intervention has antitumor activity or can stabilize disease in the form of response or progression-free survival is commonly the primary objective, whereas safety or prolongation of overall survival might be rational secondary objectives. The goal of phase III trials is often to determine which therapy or intervention
31
is more clinically effective. The investigational arm is usually compared to a control arm in the form of the current standard of care intervention, or, if none, best supportive care or placebo. When assessing whether the new therapy is more effective, the primary objective can be statistically designed to determine whether the new therapy is superior, equivalent, or not inferior. Noninferiority designs can be utilized if it is hypothesized that the experimental interval has similar primary clinical efficacy (usually in the form of overall survival) and additionally has secondary advantages such as a better safety profile or a more convenient drug administration. This primary objective is frequently measured using endpoints of prolongation of overall survival or progression-free survival. In phase II and phase III trials, the study will be powered to prove the primary objective, so it is especially important to distinguish between the primary and secondary objectives. Choose Feasible Objectives Understand whether the objectives of the study are feasible. For example, evaluating improvement in survival using a new therapy in locally advanced prostate cancer is rarely practical as patients may live 5 to 10 years after diagnosis. In these cases, a primary objective other than survival should be pursued. For example, a molecular endpoint using tissue from the resected specimen or a change in biomarkers, such as prostate-specific antigen (PSA) or circulating tumor cells, may be more reasonable. Also, a clear understanding of the number of patients who are potentially eligible to enroll in the trial from your site or cooperative group will be necessary. Seek assistance in developing accurate assessments of patients likely to be eligible and willing to enter the trial from the tumor registrar and the clinical trials office at your institution. Only by taking the time to adequately assess the available patient resources for the trial will your question be answered effectively. Choose Pharmaceutical Company— and FDA—Pleasing Objectives It is important to understand what the FDA and the pharmaceutical industry consider important when evaluating a therapy or intervention for approval. In some disease states, especially when there is a surfeit of agents for one indication, demonstrating an improvement in response rate is insufficient, and, instead, proof that the agent prolongs overall survival is required. In disease states where it is not feasible to demonstrate an improvement in overall survival, proof that the agent prolongs progression-free survival may be the basis of approval, provided that improvement is recognized as
32
ONCOLOGY CLINICAL TRIALS
a clinical benefit to patients. In other rarer, more treatment-resistant diseases, such as renal cell carcinoma (RCC), where no therapy existed that consistently improved overall survival, two of the three drugs approved since 2005 were approved based on prolongation of progression-free survival (6–8). The recent demonstration of improvements in overall survival with temsirolimus and sunitinib (granted that the survival benefit for sunitinib had a p value of 0.051) portends that the standard for future FDA approvals for RCC therapies will be higher (7, 9).
TRIAL DESIGN AND BIOSTATISTICAL INPUT Perhaps of paramount importance in designing a good clinical trial is early consultation with a biostatistician. It is essential to clearly communicate the hypothesis and background information supporting your assumptions, objectives, and outcome measures to your biostatistical colleague. A good biostatistician will verify the assumptions by evaluating pre-existing data and in the context of phase II and phase III trials calculate the necessary sample size required to achieve the primary objective or endpoint. Once an appropriate sample size has been determined, confirm whether your institution has adequate patient numbers to successfully accrue to the study. If insufficient, a change in the trial design or the primary objective upon which the study is powered may be necessary. Collaboration with other institutions may permit larger sample sizes and enhance the generalizability of your study. In the case of randomized trials, the biostatistician will ensure that adequate baseline factors are incorporated into the randomization scheme in order to minimize bias and variability as it pertains to treatment group comparisons. The statistician will provide an a priori plan for the statistical analysis of the objectives and trial. It is critical to ensure that the trial design is statistically sound and, when applicable, adequately powered to answer your question and objectives. As discussed above, phase III trials often appraise efficacy of a new agent by assessing whether it is equivalent, superior, or not inferior to the standard of care therapy. These three different primary objectives of equivalency, superiority, or noninferiority require different statistical analyses and sample sizes. A second example of a primary objective necessitating different trial designs is whether the purpose is to evaluate response to an agent or whether the goal is to assess prolongation of progression-free survival. In this case, a randomized design should be considered for assessment of a benefit in progression-free survival, whereas, a single-arm
design may be sufficient for investigating a primary objective of response. In the latter case, a Simon Two Stage design may be utilized as it exposes fewer patients to the investigational agent and halts the trial if a prespecified number of patients do not meet response criteria. A third example that highlights the importance of choosing an appropriate trial design would be with efficacy studies of agents such as histone deacetylase inhibitors (HDACi, e.g. vorinostat) or DNA methyltransferase I inhibitors (DNMTi, e.g. decitabine). It may not be of interest to show antitumor activity or disease stabilization with these agents as monotherapies, but rather to determine whether they act synergistically with standard of care agents or radiation. Trial design will be discussed more thoroughly later in this book.
CHOOSING THE RIGHT PATIENT POPULATION When crafting the question and objectives, a target patient population should be clearly defined. Which patients with what disease will your drug or intervention treat? What line of therapy are you evaluating (e.g., treatment-naïve versus refractory)? Understand the population available in your clinic or institution and what questions can be adequately answered in that milieu. While it may be interesting to study a third line treatment for colorectal cancer with a novel targeted agent in combination with standard chemotherapy, such as irinotecan and 5-Fluorouracil, if the standard first and second line treatments at your center include irinotecan, this is not an ideal population to target. In some instances, the appropriate patient population may arise after identification of a signal of efficacy in a particular tumor type during a phase I study where several different types of solid tumors were treated with the same drug. The novel drug should then be assessed further in a phase II study in a larger population of patients with the responsive disease. Other times, the right patient population may be identified translationally whereby a known chromosomal mutation exists in a certain disease, and a novel agent that targets this mutation is available for human investigation. The reverse concept (or bedside to bench) example could be to identify a rare disease in which there are few treatment options and investigate a rationale agent based on the pathophysiology of the disease. Once the appropriate patient population is identified and the sample size for accrual is reasonable from the standpoint of your clinic, institution, multi-institution collaboration, or cooperative group, the next step is to establish eligibility criteria for the study with
5 FORMULATING THE QUESTION AND OBJECTIVES
the goal of optimizing the number of patients who can be accrued while maintaining generalizability. This goal requires a fine balance. If criteria are too relaxed, patients with complicated medical issues or unusual disease pathology may confound the study results, whereas homogenous criteria may result in limited generalizability.
CHOOSING APPROPRIATE ENDPOINTS Clearly defining a question and objectives requires forward thinking of achievable endpoints or outcome measures. These measures are often referred to as hard and surrogate endpoints. Examples of hard or objective endpoints include response rate as defined by RECIST criteria, progression-free survival, or overall survival. Surrogate endpoints tend to be more tangible and easily attainable measures than direct assessment of response or survival for a particular disease that may have a long latency to achievable hard endpoints. Examples of surrogate endpoints include blood or tissue biomarkers and imaging correlates. Surrogate endpoints are especially useful in chronic diseases such as prostate cancer, where primary endpoints using survival may take 10 or more years, or when evaluating cytostatic targeted therapies that may not be best assessed by traditional criteria. As mentioned above, it is recognized that antiangiogenic agents such as the VEGF antibodies and tyrosine kinase inhibitors evoke less tumor regression and more stabilization of disease. Use of endpoints that require traditional anatomic-based assessment such as RECIST might recognize these agents as failures in terms of percentage of PR and CR, when in fact they may elicit a significant clinical benefit in terms of disease stabilization and progression-free survival. Other novel and potentially more useful methods of efficacy assessment for these cytostatic agents include metabolic- or perfusion-based criteria and are the objects of current investigations, especially with the antiangiogenic agents (10). When defining objectives and how they will be achieved using certain endpoints, keep in mind the individual limitations of the outcome measures involved. Overall survival may be confounded by treatment with multiple subsequent agents, whereas time to progression (TTP) or response rate can cloud the fact that although the intervention is effective in the short term, it may not prolong survival (11). The latter may be acceptable in some instances where no generalizable agent exists that prolongs survival. For example, gemcitabine was approved for platinum-sensitive, refractory ovarian cancer by prolonging progression-free survival despite the fact that overall survival was not
33
extended, because extending the progression-free survival interval was viewed as a clinical benefit in this patient population (12). Endpoints will be more thoroughly discussed in later chapters but are germane to this chapter as they are critical to the big picture of forming your question and objectives.
APPRECIATE THE END GOAL: FDA APPROVAL Understand the FDA approval process a priori when devising your question, objectives, and trial design. The FDA has specific regulations for initiating clinical trials that assess new agents or interventions on human subjects and generally requires that the primary objective is proven in order to approve the intervention. Often the FDA will agree with the importance of the question but has reservations about the design and whether it will achieve the desired objectives. The Prostate Cancer Clinical TrialsWorking Group 2 recommendations are a good example of consensus guidelines that do not have teeth because the FDA has not affirmed the validity of the endpoints, for instance, PSA (13).
SUMMARY In conclusion, the transparency of the question and objectives in combination with achievable outcomes and a statistically sound trial design are essential to accomplishing a superior clinical trial whether it be molecular, radiological, or therapeutic. Unambiguous descriptions of your objectives and outcomes will produce more interpretable results that can be more easily applied to clinical practice.
References 1. Boeck S, Heinemann V. Second-line therapy in gemcitabinepretreated patients with advanced pancreatic cancer. J Clin Oncol. 2008;26:1178–1179. 2. Kulke MH, Blaszkowsky LS, Ryan DP, et al. Capecitabine plus erlotinib in gemcitabine-refractory advanced pancreatic cancer. J Clin Oncol. 2007;25:4787–4792. 3. Gebski V, Marschner I, Keech AC. Specifying objectives and outcomes for clinical trials. Med J Aust. 2002;176:491–492. 4. Moher D, Schulz KF, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA. 2001;285:1987–1991. 5. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate the response to treatment in solid tumors. European Organization for Research and Treatment of Cancer, National Cancer Institute of the United States, National Cancer Institute of Canada. J Natl Cancer Inst. 2000;92:205–216. 6. Escudier B, Eisen T, Stadler WM, et al. Sorafenib in advanced clear-cell renal-cell carcinoma. N Engl J Med. 2007;356:125–134.
34
ONCOLOGY CLINICAL TRIALS
7. Hudes G, Carducci M, Tomczak P, et al. Temsirolimus, interferon alfa, or both for advanced renal-cell carcinoma. N Engl J Med. 2007;356:2271–2281. 8. Motzer RJ, Hutson TE, Tomczak P, et al. Sunitinib versus interferon alfa in metastatic renal-cell carcinoma. N Engl J Med. 2007;356:115–124. 9. Figlin RA, Hutson TE, Tomczak P, et al. Overall survival with sunitinib versus interferon (IFN)-alfa as first-line treatment of metastatic renal cell carcinoma (mRCC). J Clin Oncol 26: 2008 (May 20 suppl; abstr 5024). 10. Jaffe CC. Response assessment in clinical trials: implications for sarcoma clinical trial design. Oncologist. 2008;13(suppl 2):S14–S18.
11. Schiller JH. Clinical trial design issues in the era of targeted therapies. Clin Cancer Res. 2004;10:4281s–4282s. 12. Pfisterer J, Plante M, Vergote I, et al. Gemcitabine plus carboplatin compared with carboplatin in patients with platinumsensitive recurrent ovarian cancer: an intergroup trial of the AGO-OVAR, the NCIC CTG, and the EORTC GCG. J Clin Oncol. 2006;24:4699–4707. 13. Scher HI, Halabi S, Tannock I, et al. Design and end points of clinical trials for patients with progressive prostate cancer and castrate levels of testosterone: recommendations of the prostate cancer clinical trials working group. J Clin Oncol. 2008;26:1148–1159.
6
Choice of Endpoints in Cancer Clinical Trials
Wenting Wu Daniel Sargent
A critical element to the success of any clinical trial is the choice of the appropriate endpoint(s). In the context of a clinical trial, an endpoint is defined as a characteristic of a patient that is assessed by a protocol specified mechanism. Examples of endpoints commonly assessed in cancer clinical trials include adverse events, measures of tumor growth or shrinkage, or time-related endpoints such as overall survival. The choice of the endpoint for any given trial depends primarily on the trial’s objectives. In this chapter we will first present general considerations related to the choice of an endpoint for a trial, and then focus on specific endpoints that may be appropriate for various trial types, specifically phase I, phase II, or phase III trials.
GENERAL CONSIDERATIONS Each clinical trial should specify a single, primary endpoint. This endpoint is the patient characteristic that will most directly capture whether the therapy being tested is having the desired (or undesired, in the case of an adverse event) effect on the patient. The primary endpoint also typically determines the trial’s sample size, through the process of specifying what effect on the primary endpoint is desired to be detected in the trial to a prespecified degree of accuracy. The primary endpoint must be appropriately justified, and typ-
ically should be a well-defined and commonly accepted endpoint in that cancer type and/or study phase. Once a primary endpoint has been determined, typically multiple additional endpoints, known as secondary endpoints, are specified. Secondary endpoints should be variables that are of interest to study investigators, such that the examination of these endpoints will enhance the utility of the trial to address clinically or biologically relevant questions. Formally, statistical testing for treatment-related effects on these secondary endpoints should only occur if there is a statistically significant effect on the primary endpoint. In practice, these endpoints are typically examined regardless of the primary endpoint results. When considering possible endpoints for a clinical trial, an important distinction exists between clinical and surrogate endpoints. A clinical endpoint is one with direct clinical and patient relevance, such as patient quality of life or survival. In most cases, if possible and feasible, a clinical endpoint should be preferred for a protocol. However, in many cases, the optimal clinical endpoint may take very long to assess (such as patient survival), require a costly procedure (such as imaging), or be too invasive to be practical or ethical (such as requiring a patient biopsy). In such cases, a surrogate endpoint is often considered. A surrogate endpoint is defined as an endpoint obtained more quickly, at lower cost, or less invasively than the true clinical endpoint of interest (1). The practice of
35
36
ONCOLOGY CLINICAL TRIALS
validating, or confirming the accuracy, of a surrogate endpoint is challenging, and there are many examples of endpoints that were considered promising surrogates only to be subsequently shown to have a poor or even negative association with the true endpoint of interest (2). The topic of surrogate endpoints is addressed in depth in a later chapter (3); we do not consider this further here.
ENDPOINTS FOR PHASE I TRIALS Historically, cancer therapies have been designed to act as cytotoxic, or cell killing, agents. The fundamental assumption regarding the dose-related activity of such agents is that there exists a monotone nondecreasing dose-response curve, meaning that as the dose increases, tumor shrinkage will also increase, which should translate into increasing clinical benefit. Under this assumption, both toxicity and the clinical benefit of the agent under study will increase with increasing dose, and an appropriate goal of a phase I trial will be to find the highest dose with acceptable toxicity. Since the monotone nondecreasing dose-response curve has been observed for most cytotoxic therapies, toxicity has historically been used as the primary endpoint to identify the dose that has the greatest chance to be effective in subsequent testing. In this context, the typical goal for phase I clinical trials has been to determine the maximum tolerated dose (MTD); which has been traditionally defined as the highest dose level where only one of six patients experiences unacceptable or dose-limiting toxicity (DLT). More generally, the highest dose that can be administered with an acceptable toxicity profile is referred to as the MTD. The use of toxicity as an endpoint and thus defining the MTD as the primary goal of a phase I trial has considerable appeal. This endpoint is clinically relevant, is straightforward to observe, easy to explain, has clear intuitive rationale, and considerable historical precedent. However, particularly with newer agents as discussed below, the use of toxicity as the primary endpoint poses increasing challenges in modern clinical trials. A first challenge to toxicity as the primary endpoint is that it may require a long time and many patients to reach the MTD, with many patients treated at suboptimal doses. Goldberg et al. (4) conducted a phase I study to determine the MTD and DLT of CPT-11 in the regimen of CPT-11/5-FU/LV for patients with metastatic or locally advanced cancer, where CPT-11 was administered on day 1 and 5-FU/LV was administered on days 2 to 5 of a 21-day cycle. Fifty-six patients were accrued during a period of 38 months and 13 dose
levels were studied. A different design coupled with a different endpoint may have been able to significantly shorten the study duration. A more significant obstacle to the use of toxicity as a phase I endpoint is posed by noncytotoxic therapies. In cancer therapeutics, many novel targets are currently being pursued. At the level of the cell, targets in cytoplasmic and cell-surface signaling molecules, processes of cell-cycle control and mutations or lack of expression of suppressor genes have all yielded multiple new agents for evaluation. At the extra cellular level, there are large numbers of novel drugs designed to limit tumor penetration and metastasis and agents which directly inhibit angiogenesis. Vaccines to overcome immune tolerance of cancer are also being developed. These cytostatic or targeted therapies, many of which seem to be nontoxic at doses that achieve concentrations with desired biologic effects, have become more common as the biology of cancer has become better understood. In these cases, a dose-escalation trial incorporating a biologic endpoint specific for the agent in addition to toxicity might be appropriate (5, 6). In considering a dose-escalation scheme for a trial with a nontoxicity endpoint, it is important to consider the need to reduce the assumptions about the shape of the dose outcome curve. In theory, the dose response could be monotone nondecreasing, quadratic, or increasing with a plateau. As such, the assumption that increasing dose will always lead to increasing efficacy that is appropriate for cytotoxic agents is not reasonable for such an agent, thus one may want to consider using a dose escalation design that is based not only on occurrence of toxicity, but also based on other endpoints. In a phase I trial with such an agent, an appropriate goal is to estimate the biologically optimal dose (BOD); that is, the dose that has maximal efficacy, with acceptable toxicity (7). As such, the trial must incorporate both toxicity and efficacy as endpoints while estimating the BOD. Several endpoints that could serve as efficacy endpoints to supplement toxicity as the single endpoint for trials of cytostatic/targeted therapies are discussed below. The use of this kind of endpoint should have reproducible assays available, and optimally be shown to correlate with a clinical endpoint, such as tumor response, in human or at minimum animal models. An ideal efficacy endpoint in this setting would be an endpoint that represents a measurement of the effect of the agent on the molecular target. However, in practice, several challenges exist for such an endpoint. These include that a reliable assay for measurement of the drug effect must be available, that the relevant tissue in which to measure target inhibition is readily available, that serial tumor sampling is usually invasive
6 CHOICE OF ENDPOINTS IN CANCER CLINICAL TRIALS
and associated with sampling error, and that at this early stage of a drug’s development it may be difficult to define the appropriate measure of achieved target effects for a specific drug. The issue of ready availability of tissue may be addressed by restricting patient enrollment to those with accessible disease for assessment of the drug effect on the tumor, but this may severely restrict the number of patients eligible for the trial. Other possible measurements for activity of an agent, and thus potential endpoints, include pharmacokinetic analysis, which would be appropriate if sufficient preclinical data exist demonstrating a convincing pharmacokinetic-pharmacodynamic relationship. More specifically, an endpoint of assessing whether the minimum effective blood concentration level of the agent has been attained could be considered. Again, such an endpoint would require preclinical data that have demonstrated that the target blood or serum level correlates with clinically relevant efficacy. To date, few published phase I trials have used nontoxicity endpoints such as we have described, for both biologic and practical reasons. First, at this point in an agent’s development, it may be difficult to define the desired target biological effect. Even if the target is known and an effect level known, it may be difficult to define and validate an appropriate measurement for that endpoint. Practical difficulties in measuring target levels once they have been defined include the lack of reliable, real-time assays and the difficulty in obtaining the required tumor specimens. The real-time nature of the assay is critical, as if dosing decisions are to be made based on the endpoint, turnaround must be rapid, and batch processing is not likely acceptable. Finally, statistical trial designs for identifying a dose that maximizes the biologic response would likely require more patients than are typically studied in phase I trials.
ENDPOINTS FOR PHASE II AND III TRIALS Phase II and phase III clinical trials, as opposed to phase I trials, are designed to obtain a preliminary (phase II) or definitive (phase III) determination of a new agent’s efficacy. As such, the endpoints for these trials tend to be clinical in nature, designed to directly assess the impact of a therapy on a patient-relevant phenomena. The four endpoints most commonly used in phase II and phase III oncology trials are tumor response rate, patient progression-free survival (PFS) (in the advanced disease setting)/disease-free survival (DFS) (in the adjuvant setting), overall survival (OS), and quality of life (QOL). Historically, response rate has been the most common endpoint for phase II trials, and overall
37
survival for phase III trials. However, in the last 5 to 10 years, both PFS/DFS and OS have been used increasingly as an endpoint in the phase II setting. Thus, in this section we discuss these endpoints, as they are relevant for both phase II and phase III trials. Response Rate The response rate in a trial is defined as the proportion of responders (complete or partial) among all eligible patients. Since the establishment of the Response Evaluation Criteria in Solid Tumors (RECIST) Criteria in 2000 (8), this standard has become widely accepted as the preferred method to assess tumor shrinkage. Under RECIST, all measurable lesions up to a maximum of 10 are identified as target lesions and the baseline measurements for these lesions are recorded. During treatment, for target lesions, complete response (CR) is defined as the disappearance of all target lesions, and partial response (PR) as a decrease of at least 30% in the sum of the longest diameter (LD) of target lesions, taking as reference the baseline sum of the LD. Progressive disease (PD) is defined as at least a 20% increase in the sum of the LD of target lesions, taking as reference the smallest sum LD recorded since the treatment started or the appearance of one or more new lesions. Stable disease (SD) is defined as neither sufficient shrinkage to qualify for PR nor sufficient increase to qualify for PD, taking as reference the smallest sum LD since the treatment started. Lesions that are not measurable per RECIST (typically due to poorly defined dimensions) are classified as nontarget lesions. For nontarget lesions, CR is defined as the disappearance of all nontarget lesions. Incomplete Response/Stable Disease (SD) is defined as the persistence of one or more nontarget lesions or/and maintenance of tumor marker level above the normal limits, with PD defined as appearance of one or more new lesions and/or unequivocal progression of existing nontarget lesions. Under RECIST, the best overall response at the patient level is the best response recorded from the start of the treatment until disease progression/recurrence, and it is evaluated as per Table 6.1. To be assigned a status of PR or CR per RECIST, changes in tumor measurements must be confirmed by repeat assessment performed no less than 4 weeks after the criteria for response are first met. Response rate has been the major primary endpoint for phase II trials in the last 40 years. The use of response rate as a primary endpoint has substantial biological plausibility: as tumors rarely shrink by themselves, a tumor response can be considered as a clear signal of activity of a new therapy. In addition, in most solid tumors, response occurs quickly after the
38
ONCOLOGY CLINICAL TRIALS
TABLE 6.1
Patient Overall Response Based on Target Lesions, Nontarget Lesions, and New Lesions. TARGET LESIONS CR CR PR SD PD Any Any
NONTARGET LESIONS
NEW LESIONS
CR Incomplete response/SD Non-PD Non-PD Any PD Any
No No No No Yes or No Yes or No Yes
initiation of therapy, most often within 3 months. As such, tumor response provides an endpoint that can be assessed rapidly, allowing a timely determination of whether an agent is sufficiently promising to warrant phase III testing. In the last 5 to 10 years, the appropriateness of tumor response as a trial endpoint has been challenged (9, 10, 11). The RECIST criteria for response were designed primarily to assess cytotoxic agents. For drugs that might be active in slowing the cancer disease process, but without consistent achievement of tumor shrinkage, such as the epidermal growth factor tyrosine kinase inhibitors (gefitinib and erlotinib), or the multiple new agents targeting the vascular endothelial growth factor pathway (bevacizumab and sorafenib), the tumor response endpoint does not consider durable modest regressions or prolonged disease stability as activity, which we now know is an effect of those agents. For example, consider a randomized trial of bevacizumab, an anti-vascular endothelial growth factor antibody, in the setting of metastatic renal cancer (12). This randomized, double-blind, phase II trial was conducted comparing placebo with bevacizumab at doses of 3 and 10 mg/kg of body weight with time to disease progression and response rate as primary endpoints. In this trial, only 4 out of 116 patients demonstrated tumor response (all PRs), and all of them were in the high dose arm. However, the time to progression of disease in the high-dose group was significantly superior to that in the placebo group, with a median TTP of 4.8 versus 2.5 months, p<0.001 by the log-rank test. The clinical benefit of bevacizumab in this trial was modest, only a few months extension of time to progression, but the likelihood is high that this difference was due to true biologic activity, which might have been missed if response rate was the only endpoint considered. In addition, the assessment of response rate may be complicated by factors other than the treatment intervention. For instance, in central nervous system
OVERALL RESPONSE CR PR PR SD PD PD PD
(CNS) tumors, there are several factors that increase the potential for false-positive and false-negative responses. The blood-brain barrier and slow debris-clearing mechanisms in the brain make the brain tumor a special case. The use of steroids in CNS tumors may improve patient symptoms, maintain clinical improvement for extended periods of time even at low or reduced doses, and decrease the size of some malignant gliomas as assessed by computed tomography (CT) scan. These benefits mimic treatment effects, thus a Neurooncologist’s assessment of the tumor’s response to concurrent therapies could be a false positive. Further, distinguishing tumor shrinkage or growth from surgical changes and radiation effects is challenging. Delayed radiation effects may mimic tumor progression, and this pseudo tumor progression could confound interpretation of clinical trial results. Those effects may subside spontaneously over time without intervention. But at present time, there is no demonstrated technology that can discern late radiation changes from tumor progression. Thus radiation effects may lead the investigator to conclude the intervention is either ineffective or effective when in reality the patient may not have experienced any impact of the intervention whatsoever. Other factors mimicking response, thus contributing to false-positive responses, include spontaneous resolution of postoperative or hemorrhage-related CT enhancement, spontaneous resolution of CT enhancing radiation effects, and apparent CT improvement due to changes in scanning technique (13, 14). Finally, imaging technique may also impact response rate as an endpoint. If patients undergo different types of imaging on different scans, such as CT at baseline but magnetic resonance imaging (MRI) at follow-up, interpretation of results may be difficult. Interobserver variability in scan interpretation has also been demonstrated to be possibly substantial (15–17). Despite these shortcomings, many of which have been known for years, response rate has been
6 CHOICE OF ENDPOINTS IN CANCER CLINICAL TRIALS
considered as the primary endpoint of phase II trials primarily because tumor shrinkage was believed to be linked to the drug’s biological activity, and drugs that induced tumor response were felt to be those most likely to lead to improved survival or decreased symptoms. Indeed response rate continues to be used, with some success, as a primary endpoint for some phase II trials of targeted agents (18). However, recent research in many diseases has shown that an improved response rate may not translate to a survival improvement. Significant association (or correlation) between response and progression (or survival) does not imply that response is a viable endpoint for chemotherapy trials in the current era. For example, in colorectal cancer, the meta-analysis conducted by Buyse et al. (10) showed that “an increase in tumor response rate translates into an increase in overall survival for patients with advanced colorectal cancer, but in the context of individual trials, knowledge that a treatment has benefits on tumor response does not allow accurate prediction of the ultimate benefit of survival.” Sekine et al. (19) noted that response rate was significantly associated with median survival time as the primary endpoint for evaluating new cytotoxic agents in phase II trials of non-small-cell lung cancer, but they were concerned that “the apparently positive correlation between tumor response and the increased survival is in part a reflection of the fact that patients who enjoy greater survival have greater opportunity to response” and they recommend “the use of other parameters or endpoints that directly indicates patient survival.” At the present time, the use of response rate as a phase II endpoint is trial specific. For a trial with an agent thought to be cytotoxic, and a need for a rapid initial assessment of activity, tumor response may be an appropriate primary endpoint. However, as discussed in the next section, progression free survival allows a possibly more complete assessment of an agent’s potential to improve patient outcome in a phase III trial. With the possible exception of tumors that are highly symptomatic, where tumor shrinkage results in meaningful patient symptomatic benefit, response rate is not appropriate as a primary endpoint for phase III trials, due to (a) inherent measurement issues, (b) unclear clinical relevance, and (c) poor prediction of ultimate survival benefit. Progression-Free Survival/ Disease-Free Survival Progression- and disease-free survival (PFS/DFS) are time-related endpoints that in principle measure the time from initiation of therapy to disease worsening. In a patient with existing disease (for example, with
39
unresectable metastases), the event of interest for this endpoint is growth or progression of disease. Thus the term PFS is used in this setting to refer to the time to disease worsening in the advanced disease setting. In a patient with a complete surgical resection (as judged by standard surgical and pathological parameters), the event of interest is recurrence of disease. Thus in the adjuvant setting the term is DFS, measuring the time from surgery (or start of chemotherapy or randomization, depending on the study) to disease recurrence. PFS/DFS, and OS which will follow, are known as time-to-event endpoints. In a typical clinical trial, there is an accrual period plus an additional follow-up time prior to analysis of the data. At the time of the final analysis, some patients will have had the event of interest (a progression, for example), while some patients will remain event free. For those who remain event free, the total time of observations will vary, depending upon when in the accrual period the patient was registered to the trial. The actual event time for these patients is unknown, but it is known at least that they were event free at least from registration until the date of their last known contact. For this reason, the Kaplan-Meier method (product-limit estimate) (20) is the most common method used to estimate time-toevent endpoints in clinical trials. For both endpoints of PFS and DFS, a death from any cause is considered an event, to remove any possible bias due to unknown disease status at death. Currently, PFS is increasingly being used as the primary endpoint for both phase II and phase III trials in advanced disease, and DFS is increasingly being used as a primary endpoint in adjuvant phase III trials. Here we discuss advantages and challenges with these endpoints. Because PFS requires a consistent definition as to what constitutes disease progression, similarly to response rate, this endpoint does depend on imaging techniques. However, the imaging subtleties for PFS are much less problematic than for the response endpoint, which requires agreement on definitions of what constitutes clinically important tumor shrinkage as well as what constitutes tumor progression. A more important advantage to PFS and DFS is that these endpoints are not affected by therapies administered subsequent to the therapy under study. In disease settings in which there are clearly beneficial therapies that may be administered subsequent to the interventional therapy, PFS/DFS provide an assessment that can isolate the effect of the specific intervention being tested. In contrast, overall survival is impacted by both the therapy in the specific trial as well as any therapies the patient may subsequently receive (21). The clinical relevance of PFS/DFS as an endpoint has been the subject of debate. Indeed, detection of
40
ONCOLOGY CLINICAL TRIALS
progression or recurrence is typically based on radiologic measurements, as opposed to increase in patient symptoms, thus it may not be considered directly relevant to patient well-being. However, based on current knowledge, progression or recurrence is clearly a signal of increasing disease burden at the patient level, and also results in substantial additional, most likely toxic, therapy. Based on these factors, the Food and Drug Administration (FDA) has recently approved multiple agents (e.g., bevacizumab in breast cancer [22], sorafinib in renal cancer [23], and gemcitabine in second line ovarian cancer [24]) based on an improvement in PFS in spite of no improvement in OS. In the setting of adjuvant colon cancer, the Oncologic Drugs Advisory Committee (ODAC) of the FDA voted unanimously that DFS was an endpoint of clinical relevance regardless of its surrogacy with OS (25). PFS/DFS as an endpoint does present multiple challenges, as might be expected. First, advanced disease is often not followed for progression after a patient is taken off treatment because of toxicity or refusal. As such, the specific rules for censoring patients for PFS and DFS in the case of the addition of a new treatment prior to the PFS or DFS event must be specified in a protocol. Even in the absence of the introduction of a new therapy, some patients stop treatment and/or follow-up due to symptomatic worsening, with no radiologic confirmation of progression. The method of analysis for such clinical progressions must also be prospectively specified. Additional items that must be considered when using a PFS or DFS endpoint include the necessity that the assessment frequency is identical between the treatment arms (26), and the possible need for confirmation of a tumor progression by an independent radiology review (27). Such an independent review might increase the confidence in a PFS endpoint, but may also complicate the statistical analysis due to informative censoring (28). These issues have potential major consequences; see Saltz et al. (29) for an example of analyses when seeming subtle differences in censoring definitions or scan timings have had a major impact on trial conclusions. These and other examples demonstrate the need for careful sensitivity analyses when PFS or DFS is used as a primary endpoint. Finally, the clinical relevance of minor PFS differences remains questionable. Overall Survival Overall survival (OS) is defined as the time from study registration to time of death due to any cause. OS had historically been the primary outcome for most phase III trials, and some phase II trials. The virtues of OS as a clinical trial endpoint are clear. OS is simple to measure, unambiguous, and of unquestionable
clinical relevance. It is the endpoint least susceptible to investigator bias. It is also very noninvasive and cost-effective to collect in a trial database (requiring no special scans), and often can be determined through searches of public databases even if a patient does not return to the trial institution. In spite of these many advantages, modern phase III clinical trials are increasingly moving away from OS as a primary endpoint. Because mortality occurs after a relatively long time in many disease settings, statistically significant differences in OS require large numbers of patients and several years to reliably detect. For example, in a recently reported trial where women with metastatic breast cancer were randomized to receive bevacizumab or not in addition to first line paclitaxel, the survival data for the trial population were not mature until nearly 8 years after the trial was opened— a full 2 years after initial results, based on a PFS endpoint, were made public (22). For reasons discussed previously regarding the advantages of PFS/DFS as an endpoint, OS may also be insensitive to a true effect of a new therapy. For many new agents, patients assigned to the control arm are either allowed to cross over to the investigational agent, or receive the investigational agent off-study upon disease progression or recurrence. This strategy dilutes the difference between arms of an agent on OS, since patients in both arms will receive the new agent at some point. Although a prohibition of such crossover is scientifically appealing, ethical considerations, particularly if earlier trials have found the agent to be promising or the drug is approved in later line settings, often preclude it. For example, consider the case of two trials of similar drugs—cetuximab (31) and panitumumab (30)—in last-line colorectal cancer. In both trials, the agent under study provided a highly significant improvement in patient PFS. In the trial that allowed patients to cross over from best supportive care to active drug after disease progression, 75% did cross over, and no survival improvement was detected (30). In the other trial no crossover was allowed, and in that trial the PFS advantage translated into a significant survival advantage (31). OS is also challenged as an endpoint due to the fact that a prolongation of OS due to a new agent may be more difficult to detect now than in the past as a result of the beneficial effects of the many antineoplastic therapies now available. In the case of breast cancer, over the last two decades, OS has increased substantially for patients with metastatic disease. For example, in retrospective population-based analysis of patients with metastatic breast cancer treated in British Columbia, OS improved from approximately 14.5 months for those diagnosed in 1991 to
6 CHOICE OF ENDPOINTS IN CANCER CLINICAL TRIALS
approximately 22 months for women diagnosed in 2001 (32). As median survival continues to increase, the sample size necessary to conclusively identify the same absolute improvement in survival increases considerably, as the same 3-month absolute improvement results in a hazard ratio closer to unity as the control arm survival increases. Additional challenges with OS as an endpoint are due to a long survival in some cancer patients. While OS is generally the most easily assessed endpoint, high uncertainty and/or bias can result when there are many patients for whom OS status is unknown. Patients on advanced disease studies may go elsewhere if they are not doing well, thus the survival estimate will be biased high if patients are lost to follow-up because they are beginning to fail. Patients on early stage trials who remain disease free for many years may not return to the clinic for appointments; in this case, the survival estimate may be biased low. If most patients are still alive at the time of analysis, survival estimates can be highly variable, or not even defined. Finally, if many patients die of causes other than the cancer under the study, the interpretation of survival can be problematic, since the effect of treatment on the disease under study is of primary interest. Despite these limitations, due to the simplicity and clear relevance of the OS endpoint, if feasible, OS should be the preferred endpoint for phase III trials. In addition, in phase II trials where death occurs quickly and imaging is problematic, OS may be an appropriate endpoint. OS has been used as an endpoint in recent phase II trials in advanced pancreatic cancer and glioblastoma, for example (33, 34). Quality of Life Quality of Life (QOL), typically measured through patient self-assessment, is an important outcome in the evaluation of the net benefit of antineoplastic therapies, particularly when the impact of treatment in terms of survival is expected to be small. QOL has mainly been included as a secondary endpoint in randomized phase III studies. QOL evaluations are not generally included in single arm phase II trials primarily due to relatively small sample size and the absence of comparison groups. On the other hand, in randomized phase II studies that if successful will lead to a phase III trial where QOL will be regarded as an important outcome, QOL assessment should be included during the phase II study to assess feasibility, to possibly choose a most appropriate instrument, and to obtain a preliminary estimate of effect size. Extensive documentation on the use of QOL as an endpoint in cancer trials has recently been published (35).
41
NEW ENDPOINTS FOR EFFICACY As the science of cancer research continues to rapidly expand, novel measurements may become increasingly relevant as endpoints for all phases of cancer trials. These endpoints include measures of the functional status of tumors by imaging (position emission tomography [PET]), measures of circulating tumor cells in blood, tumor specific biomarkers such as PSA for prostate cancer or CA-125 in ovarian cancer, or biological measures of target inhibition of a targeted therapy. While these endpoints are indeed promising, and offer the potential to speed clinical development of new agents, many issues remain with all of them prior to their appropriate use as primary endpoints. These include practical issues such as feasibility, accuracy issues such as test/retest variability, and surrogacy issues related to the ability of the endpoint to predict clinically relevant phenomena. Careful assessment of these endpoints, using the principles outlined in this chapter, should facilitate their evaluation regarding their optimal use as trial endpoints.
References 1. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Medicine. 1989;8:431–440. 2. Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med. 1996;125(7):605–613. 3. Buyse, M. When to use surrogate endpoints. Oncology Clinical Trials: Successful Design, Conduct, and Analysis. 4. Goldberg RM, Kaufmann SH, Atherton P, et al. A phase I study of sequential irinotecan and 5-Fluorouracil/leucovorin. Annals of Oncology. 2002;13:1674–1680. 5. Gelmon KA, Eisenhauer EA, Harris AL, Ratain MJ, Workman P. Anticancer agents targeting signaling molecules and cancer cell environment: challenges for drug development? J Natl Cancer Inst. 1999;91(15):1281–1287. 6. Korn EL, Arbuck SG, Pluda JM, Simon R, Kaplan RS, Christian MC. Clinical trial designs for cytostatic agents: are new approaches needed? J Clin Oncol. 2001;19:265–272. 7. Zhang W, Sargent DJ, Mandrekar S. An adaptive dose-finding design incorporating both toxicity and efficacy. Stat Med. 2006;25(14):2365–2383. 8. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate the response to treatment in solid tumors. J Natl Cancer Inst. 2000;92(3):205–216. 9. Ratain MJ. Phase II studies of modern drugs directed against new targets: if you are fazed, too, then resist RECIST. J Clin Oncol. 2004;22:4442–4445. 10. Buyse M, Thirion P, Carlson RW, Burzykowski T, Molenberghs G, Piedbois P. Relation between tumour response to first-line chemotherapy and survival in advanced colorectal cancer: a meta-analysis. Lancet. 2000;356:373–378. 11. Dowlati A, Fu P. Is response rate relevant to the phase II trial design of targeted agents? J Clin Oncol. 2008;26:1204–1205. 12. Yang JC, Haworth L, Sherry RM, et al. A randomized trial of bevacizumab, an anti-vascular endothelial growth factor antibody, for metastatic renal cancer. N Engl J Med. 2003;349: 427–434. 13. Macdonald DR, Cascino TL, Schold SC Jr, Cairncross JG. Response criteria for phase II studies of supratentorial malignant glioma. J Clin Oncol. 1990;8:1277–1280.
42
ONCOLOGY CLINICAL TRIALS
14. Rajan B, Ross G, Lim CC, et al. Survival in patients with recurrent gliomas as a measure of treatment efficacy: prognostic factors following nitrosourea chemotherapy. Eur J Cancer. 1994;30A:1809–1815. 15. Hopper KD, Kasales CJ, Van Slyke MA. Analysis of interobserver and intraobserver variability in CT tumor measurements. AJR. 1996;167:851–854. 16. Erasmus JJ, Gladish GW, Broemeling L, et al. Interobserver and intraobserver variability in measurement of non-small-cell carcinoma lung lesions: implications for assessment of tumor response. J Clin Oncol. 2003;21(13):2574–2582. 17. Schwartz LH, Ginsberg MS, DeCorato D, et al. Evaluation of tumor measurements in oncology: use of film-based and electronic techniques. J Clin Oncol. 2000;18(10):2179–2184. 18. El-Maraghi RH, Eisenhauer EA. Review of phase II trial designs used in studies of molecular targeted agents: outcomes and predictors of success in phase III. J Clin Oncol. 2008;26: 1346–1354. 19. Sekine I, Kubota K, Nishiwaki Y, Sasaki Y, Tamura T, Saijo N. Response rate as an endpoint for evaluating new cytotoxic agents in phase II trials of non-small-cell lung cancer. Annals of Oncology. 1998;9:1079–1084. 20. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481. 21. Sargent DJ, Hayes DF. Assessing the measure of a new drug: is survival the only thing that matters? J Clin Oncol. 2008;26(12): 1922–1923. 22. Miller K, Wang M, Gralow J, et al. Paclitaxel plus bevacizumab versus paclitaxel alone for metastatic breast cancer. N Engl J Med. 2007;357:2666–2676. 23. Escudier B, Eisen T, Stadler WM, et al. N Engl J Med. 2007;357(2):203. 24. Pfisterer J, Plante M, Vergote I, et al. Gemcitabine plus carboplatin compared with carboplatin in patients with platinumsensitive recurrent ovarian cancer: an intergroup trial of the AGO-OVAR, the NCIC CTG, and the EORTC GCG. J Clin Oncol. 2006;24(29):4699–4707. 25. http://www.fda.gov/ohrms/dockets/ac/04/transcripts/4037T2.htm. 26. Panageas KS, Ben-Porat L, Dickler MN, Chapman PB, Schrag D. When you look matters: the effect of assessment schedule
27.
28.
29.
30.
31. 32.
33.
34.
35.
on progression-free survival. J Natl Cancer Inst. 2007;99: 428–432. Food and Drug Administration. Guidance for industry: clinical trial endpoints for the approval of cancer drugs and biologics. (Accessed October 29, 2009, from http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM071590.pdf) Dodd LE, Korn EL, Freidlin B, et al. Blinded independent central review of progression-free survival in phase III clinical trials: important design element or unnecessary expense? J Clin Oncol. 2008;26(22):3791–3796. Saltz LB, Clarke S, Diaz-Rubio E, . Bevacizumab in combination with oxaliplatin-based chemotherapy as first-line therapy in metastatic colorectal cancer: a randomized phase III study. J Clin Oncol. 2008; 26(12):2013–2019. Van Cutsem E, Peeters M, Siena S, . Open-label phase III trial of panitumumab plus best supportive care compared with best supportive care alone in patients with chemotherapyrefractory metastatic colorectal cancer. J Clin Oncol. 2007;25: 1658–1664. Jonker DJ, O’Callaghan CJ, Karapetis CS, et al. Cetuximab for the treatment of colorectal cancer. N Engl J Med. 2007;357:2040–2048. Chia SK, Speers CH, D’yachkova Y, et al. The impact of new chemotherapeutic and hormone agents on survival in a population-based cohort of women with metastatic breast cancer. Cancer. 2007;110(5):973–979. Alberts SR, Foster NR, Morton RF, . PS-341 and gemcitabine in patients with metastatic pancreatic adenocarcinoma: a North Central Cancer Treatment Group (NCCTG) randomized phase II study. Ann Oncol. 2005;16(10):1654–1661. Ballman KV, Buckner JC, Brown PD, et al. The relationship between six-month progression-free survival and 12-month overall survival end points for phase II trials in patients with glioblastoma multiforme. Neuro Oncol. 2007;9(1):29–38. Sloan JA, Halyard MY, Frost MH, et al. The Mayo Clinic manuscript series relative to the discussion, dissemination, and operationalization of the Food and Drug Administration guidance on patient-reported outcomes. Value Health. 2007; 10(suppl 2):S59–S63.
7
Design, Testing, and Estimation in Clinical Trials
Barry Kurt Moser
After a cancer clinical trial has been well designed, hypothesis testing and parameter estimation are the two most important statistical aspects of any study. Within this chapter several clinical trial designs and test approaches are discussed, including a one-stage Binomial design to test the proportion of one population, Simon’s (1) two-stage optimal design to test the proportion of one population, a design to test the difference in means from two correlated normally distributed populations, and optimal designs to test the differences in time-to-event survival functions from two independent populations (2, 3). Commonly used estimation procedures are then highlighted and include: a Bayesian posterior distribution to estimate the mean of a Binomial population; a Bayesian credible region on a population proportion (4); a maximum likelihood Binomial estimate of a population proportion; a standard normal approximation of a 100(1 − a )% confidence interval on a population proportion; mean difference, variance, and correlation estimates from two correlated normally distributed populations (5); a confidence interval on the mean difference from two correlated normally distributed populations; a log-rank test for comparing the survival curves from two independent populations (6); and confidence intervals on a survival function.
DESIGN, HYPOTHESIS TESTS, AND ESTIMATION PROCEDURES The single most important aspect of a clinical trial is the design stage. If the trial is well designed to address the questions of interest, the study has the best chance of producing interpretable and meaningful results. Note that the trial should be fully designed before any patients are registered to the trial and thus before any analysis is performed on observed data. The results of a poorly designed trial are open to criticism due to such problems as bias, poorly defined hypotheses, sample size and power inadequacies, confounding, and other difficulties. A sound design promotes (a) well chosen and worded objectives, (b) careful selection of endpoints, (c) hypotheses that directly address the objectives, (d) properly selected hypothesis testing and estimation procedures, (e) sample size calculations that provide sufficient power to successfully address a level hypothesis test decisions, and when needed (f) randomization schemes that help clearly answer the objectives. In the next segments of this chapter three phase II and two phase III cancer clinical studies are discussed. Appropriate trial designs are formulated, hypothesis tests are developed and illustrated, and estimation procedures are explained numerically.
43
44
ONCOLOGY CLINICAL TRIALS
EXAMPLE 7-1 A randomized phase II trial of eicosanoid pathway modulators and cytotoxic chemotherapy in advanced nonsmall-cell lung cancer (7).
Discussion The primary objective of the trial was to evaluate the efficacy of carboplatin/gemcitabine with one or two modulators of eicosanoid metabolism (celecoxib, zileuton, or both) for the treatment of patients with advanced non-small-cell lung cancer. The primary endpoint was the proportion of patients alive without disease progression 9 months after treatment initiation (failure-free survival). Although patients in this trial were randomized to the three arms, the objective of the study is not to compare the three arms. Rather, each treatment was examined separately to see if it merits inclusion in a larger phase III trial.
Design and Hypothesis Test Since the three treatment arms are analyzed separately, the following discussion applies to any of them. If the failure-free survival rate at 9 months was less than 30%, there would be little interest in developing the treatment regimen further. However, if the failure-free survival rate at 9 months was 50% or greater, then future treatment development would be of interest. Formally, this comparison can be written as the hypotheses: H0 : p ≤ p0 versus H1 : p ≥ p1, where p is the failure-free survival rate of the population at 9 months for any of the treatment arms, p0 = 0.30 and p1 = 0.50. The objective is to define a design (a test criterion and sample size) that produces a type I error rate less than or equal to a and a type II error rate less than or equal to b (a = b = 0.10 in this trial) where the type I error is calculated as the probability of incorrectly concluding that p = 0.50 when p = 0.30, and the type II error is calculated as the probability of incorrectly concluding that p = 0.30 when p = 0.50. The test criterion takes the form: if fewer than x* of the n patients achieve failure-free survival at 9 months, then the treatment regimen will not merit further study. But if x* or more of the n patients achieve failure-free survival at 9 months, then the treatment regimen will merit further study. Formally stated, Reject H0 : p ≤ 0.30 in favor of H1 : p ≥ 0.50 if X ≥ x* (Eqn. 7-1),
where X is a random variable representing number of failure-free survivors at 9 months in a sample of size n and x* is the critical cut-off point. Note that H0 is rejected in favor of H1 and the treatment regimen is worthy of further study if the number of observed failure free survivors X is large (larger than the critical value x*). Formally, we seek the minimum value of n where there exists a value of x* such that: Type I error requirement: P(X ≥ x* | n, p = 0.3) ≤ a = 0.1 (Eqn. 7-2), and Type II error requirement: P(X < x* | n, p = 0.5) ≤ b = 0.1. Specifically, for increasing values of n the two probabilities in (Eqn. 7-2) are calculated for increasing values of 0 ≤ x* ≤ n until the first x*, n pair meets the preceding type I and type II requirements. In this case, for n = 39 if X ≥ 16 (x* = 16 or more failure-free survivors out of 39 patients) then H0 is rejected in favor of H1 and the treatment regimen has sufficient merit to deserve further study. This test procedure produces a type I and type II error rate equal to 0.094 and 0.099, respectively.
Testing, Estimation, and Confidence Intervals We examine the results of the trial for the third arm where the regimen consists of carboplatin/gemcitabine with both modulators celecoxib and zileuton. Although the design called for 39 patients, a total of 45 eligible patients were enrolled in the third arm. With 45 patients a test criterion: reject H0 : p ≤ 0.30 in favor of H1 : p ≥ 0.50 if X ≥ 18 provides a type I error rate equal to 0.099 and a type II error rate equal to 0.068. After n = 45 eligible patients were enrolled and observed for 9 months, the number of failure-free survivors in the third arm was x = 11 (7). Since x = 11 < 18 the treatment regimen does not have sufficient merit to deserve further study. Using a Bayesian approach, a 100(1 − a)% credible region (9) on p can be constructed through the posterior distribution of binomial distribution using a noninformative uniform prior distribution. Given the observed failure-free survivors x, the parameter p has a beta (x + 1, n − x + 1) posterior distribution, whose functional form is given by: f (p | x, n) =
Γ(n + 2) p x (1 − p)n − x Γ(x + 1)Γ(n − x + 1) for 0 ≤ x ≤ n, 0 ≤ p ≤ 1
(Eqn. 7-3).
45
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
The Bayesian estimator of p is provided by the posterior mean pˆ B = (x + 1) / (n + 2)
(Eqn. 7-4),
for n = 45 and x = 11, pˆ B = 12 / 47. A graph of the Beta (x + 1, n − x + 1) = Beta(12, 35) distribution is provided in Figure 7-1. The 100(1 − a) credible region is then the interval (pL, pU) such that: 1− α =
∫
pU pL
f (p | n, x)dp
pˆ = x / n
(Eqn. 7-6).
A simple formula for the 100(1 − a)% confidence interval is then provided by: pL = pˆ − zα / 2 pˆ (1 − pˆ ) / n , pU = pˆ + zα / 2 pˆ (1 − pˆ ) / n (Eqn. 7-7),
Beta probabilities
Beta(12,35) Probabilities with X = 12, N = 45
Area = 0.95 Posterior mean = 0.255
0.0 0.1
0.2
0.3
0.4
0.5 0.6
pL = 11 / 45 − 1.96 (11 / 45)[1 − (11 / 45)] / 45 = 0.119 (Eqn. 7-8), and pU = 11 / 45 + 1.96 (11 / 45)[1 − (11 / 45)] / 45 = 0.370 .
(Eqn. 7-5),
where f(p | n, x) is given in (Eqn. 7-3). Using (Eqn. 7-5) for n = 45, x = 11, and a = 0.05, a 95% credible region on p is given by (pL, pU) = (0.133, 0.377) and presented in Figure 7.1. These lower and upper bounds of the credible region are constructed so that they are equidistant from the posterior mean pˆ B where the area between the bounds under the posterior equals (1 − a). Alternative, 100(1 − a)% credible region bounds can be constructed so that the area under the posterior to the right of the lower bound and to the left of the upper bound each equal a/2. For large samples, another approach for constructing a 100(1 − a)% confidence interval on p is to use the normal approximation to the binomial distribution. For the binomial distribution, an unbiased estimate of p is given by:
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
where zy is the 100(1-g ) percentile value of a standard normal distribution. Using the large sample, normal approximation from (Eqn. 7-6) and (Eqn. 7-7) with n = 45, pˆ = 11 / 45, and a = 0.05, a 95% confidence interval on p is given by:
0.7
0.8
0.9
1.0
p = Proportion of failure free survivors
FIGURE 7.1 A credible region on the parameter p = the proportion of failure-free survivors from a beta(12,35) distribution.
EXAMPLE 7-2 A nonrandomized phase II trial of gemtuzamab ozogamicin (mylotarg) with high dose cytarabine in patients with refractory or relapsed acute myeloid leukemia (AML).
Discussion The primary objective of the trial was to determine the adequacy of the complete remission rate with complete platelet recovery (CR) or without complete platelet recovery (CRp) when gemtuzamab ozogamicin was administered in two weekly doses following a standard high dose cytarabine course in patients with refractory or relapse AML. The primary endpoint was the CR + CRp rate. Design and Hypothesis Test If the complete remission rate with or without platelet recovery was less than 15%, there would be little interest in developing the treatment regimen further. However, if the complete remission rate with or without platelet recovery was 35% or greater, then future treatment development would be of interest. Formally, this comparison can be written as the hypotheses: H0 : p ≤ p0 versus H1 : p ≥ p1, where p is the complete remission rate with or without platelet recovery, p0 = 0.15 and p1 = 0.35. In Example 7-1, a test criterion was developed for a similar hypothesis using one sample of size n. That is, all n patients had to be observed before an efficacy decision on the treatment could be made. However, historically, most drug regimens that are tested in a phase II setting fail to demonstrate sufficient efficacy to merit further study. For this reason, Simon (1) developed a two-stage phase II procedure whereby the trial can be terminated after the first stage if early results indicated that the treatment is not worthy of further study, thereby eliminating the need to administer ineffective treatment regimens to all n patients before making a conclusion. The two-stage test criterion takes the form:
46
ONCOLOGY CLINICAL TRIALS
1. Observe n1 patients and terminate the trial if r1 or fewer responses occur, 2. If more than r1 of the first n1 patients achieve a response, observe n2 more patients (n = n1 + n2); and if r or fewer responses occur out of the total sample of n patients, then the treatment regimen does not merit further study. But if more than r of the n patients sampled achieve a response, then the treatment regimen merits further study. The optimal design is chosen to minimize the expected sample size when the response rate equals p0, for specified values of p0, p1, a, b. The probability of terminating the trial early (PET) given a response rate p is given by: r1
PET = ∑ p r1 (1 − p)n1 − r1
(Eqn. 7-9).
i =0
a = b = 0.1 requires a sample of n = 32, and if 7 or fewer responses occur out of the total sample of n = 32 patients, then the treatment regimen does not merit further study. But if more than 8 of the n = 32 patients sampled achieve a response, then the treatment regimen merits further study. The exact type I and type II error rates for this design are 0.096 and 0.082, respectively. Note that in this case the one-stage and two-stage designs require almost the exact total sample size n = 33(and 32). The advantage of the one-stage design is that it provides slightly smaller type II error (i.e., slightly higher power) than the two-stage design. However, if H0 is true, then the two-stage design is more efficient than the one-stage design; stopping early and correctly concluding the treatment does not merit further study with probability 0.684, resulting in an expected sample size of 23.4 patients versus a required 32 patients in the one-stage design.
Therefore, the expected sample size EN = n1PET + (1 − PET )n
(Eqn. 7-10).
For a given test criteria to calculate the expected sample size when the response equals p0, simply replace p by p0 in (Eqn. 7-9) to calculate PET and then calculate EN as a function of PET, n1, and n in (Eqn. 7-10). Simon’s optimal two-stage procedure cannot be developed through a closed form equation, but must be calculated iteratively through a computer program. A Web site that provides a program to calculate the optimal design as a function of a, b, p0, p1 is http://linus.nci .nih.gov/brb/samplesize/otsd.html. For Example 7-2, with p0 = 0.15, p1 = 0.35, a = b = 0.1, the optimal Simon two-stage procedure is: 1. Observe n1 = 19, patients and terminate the trial if r1 = 3, or fewer responses occur, 2. If more than r1 = 3 of the first n1 = 19 patients achieve responses, observe n2 = 14 more patients (n = 19 + 14 = 33); and if r = 7 or fewer responses occur out of the total sample of n = 33 patients, then the treatment regimen does not merit further study. But if more than r = 7 of the n = 33 patients sampled achieve a response, then the treatment regimen merits further study. With r1 = 3, n1 = 19, and p = p0 = 0.15 from (Eqn. 7-9) and (Eqn. 7-10), the probability of terminating the trial early and the expected sample size are given by: PET = 0.684 and EN = 19 + (1 − 0.684)14 = 23.4. The exact type I and type II error rates for this design are 0.096 and 0.096, respectively. Simply for comparison purposes, a one-stage design to test H0 : p ≤ 0.15 versus H1 : p ≥ 0.35 with
Testing, Estimation, and Confidence Intervals During the course of the two-stage trial 6 of the first 19 eligible patients achieved complete remission with complete platelet recovery (CR) or complete remission without complete platelet recovery (CRp). Therefore, the trial did continue and 14 more eligible patients were enrolled to the trial. Of those next 14 patients, 4 achieved CR or CRp, for a total of 10 (= 6 + 4) out of 33 responders (8). Therefore, at least 8 of the first 33 patients were responders, and according to the original test criteria the treatment regimen merits further study. From (Eqn. 7-4) a Bayesian estimate of the proportion of CR + CRp patients in the population is pˆ B = (x + 1) / (n + 2) = 11 / 35. Then from (Eqn. 7–5) for n = 33, x = 10, and a = 0.05 a 95% credible region on p is given by (pL, pU) = (0.146, 0.464). Using the large sample, normal approximation from (Eqn. 7–7) with pˆ = 10 / 33 and a = 0.05, a 95% confidence interval on p is given by (pL, pU) = (0.146, 0.460). EXAMPLE 7-3 A phase II trial of thalidomide for patients with relapsed or refractory low grade non-Hodgkin’s lymphoma.
Discussion A secondary objective of the trial was to determine the effect of thalidomide on microvessel density. Antiangiogenesis effects were measured by the mean change in microvessel density between pre- and post-treatment time points.
47
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
Design and Hypothesis Test Antiangiogenesis effects were estimated by the mean observed change in microvessel density between preand post-treatment (6 months) time points on n = 45 patients. Based on preliminary data, a standard deviation of s = 17.5 microvessels per mm2 was anticipated. Since the microvessel data were observed at pre and post times on each patient, it was assumed that these 45 pairs of pre and post measurements were equally correlated with a correlation denoted by r. The population pre- and post-treatment microvessel density means are denoted by m1 and m2. Therefore, it was of interest to produce a two-sided a level test on the hypothesis that no detectable mean difference exists between the pre and post microvessel densities versus that a detectable difference does exist. Formally, this comparison can be written as the hypotheses: H0 : m1 − m2 = 0 versus H1 : m1 − m2 ≠ 0. For any value of r, s, n, the two-sided a level test criterion takes the form: Reject H0 : m1 − m2 = 0 in favor of H1 : m1 − m2 ≠ 0 if | y1 − y2 | ≥ zα / 2σˆ 2(1 − ρˆ ) / n
(Eqn. 7-11),
where y1 and y2 are the observed mean pre and post microvessel densities for the n patients. If it is of interest to provide a p-value rather than an a level test criterion, then the p-value of the two-sided test is given by: ⎡ | y1 − y2 | ⎤ p-value = 2P ⎢ N (0, 1) > ⎥ (Eqn. 7-12), σˆ 2(1 − ρˆ ) / n ⎥⎦ ⎢⎣ or equivalently,
TABLE 7.1
Powers Associated with the Two-sided a Level Test Procedure (Eqn. 7-11) for Various Mean Detectable Differences d and Correlations r Given a = 0.05, n = 45, and s = 17.5. ABSOLUTE MEAN DETECTABLE DIFFERENCE d CORRELATION r 0 0.1 0.5
5
7
9
11
0.273 0.475 0.684
0.298 0.516 0.730
0.483 0.765 0.932
0.858 0.989 >0.999
The sample size of n = 45 for this trial was set to accommodate the power calculations of the primary objective of the study, not to accommodate the power calculations of the microvessel density analysis, which was a secondary objective. However, by solving (Eqn. 7-13) for n, it is possible to calculate the sample size required to attain a specified power for an a level two-sided test for the hypothesis H0 : m1 − m2 = 0 versus H1 : m1 − m2 ≠ 0 for a specified correlation r. The resulting equation is: n=
2σ 2 (1 − ρ)(zα / 2 + zβ )
(Eqn. 7-14), δ2 where the specified power equals 1 − b. From (Eqn. 7-14) for an a = 0.05 two-sided test of H0 : m1 − m2 = 0 versus H1 : m1 − m2 ≠ 0 with s = 17.5, the sample size required to achieve a power of 0.80 is provided in Table 7.2 for various absolute mean detectable differences d and correlations r.
MS(Pre/Post) ⎤ ⎡ p-value = P ⎢ F1,n −1 > MS(Patient*Pre/Post) ⎥⎦ , ⎣ where MS(Pre/Post) and MS(Patient*Pre/Post) are the analysis of variance (ANOVA) mean square due to the pre/post effect and the mean square due to interaction of patients by the pre/post effect, respectively. For any detectable absolute difference under the alternative hypothesis |m1 − m2| = d > 0, the power of the (Eqn. 7-11) procedure is given by: ⎡ ⎤ δ power = 1 − P ⎢ N (0, 1) < zα / 2 − ⎥ σ 2(1 − ρ ) / n ⎥⎦ ⎢⎣ (Eqn. 7-13). Table 7.1 provides powers for various r and d given a = 0.05, n = 45, and s = 17.5.
TABLE 7.2
Sample Sizes (n) Required to Attain a Power of 0.80 Associated with the Two-Sided Hypothesis Test (Eqn. 7-12) for Various Mean Detectable Differences d and Correlations r Given a = 0.05. and s = 17.5. ABSOLUTE MEAN DETECTABLE DIFFERENCE d CORRELATION r 0 0.1 0.5
5
7
9
11
192 98 59
173 88 53
96 49 30
38 20 12
48
ONCOLOGY CLINICAL TRIALS
For a one-sided a level test where the alternative hypothesis is either H1 : m1 − m2 > 0 or H1 : m1 − m2 < 0, equations (Eqn. 7-13) and (Eqn. 7-14) still hold provided a/2 is replaced by a and d continues to represent the absolute difference |m1 − m2 | = d > 0 under the alternative hypothesis. Note that under a normal distribution assumption, the test criterion (Eqn. 7-11), the p-value (Eqn. 7-12), the power function (Eqn. 7-13), and the sample size function (Eqn. 7-14) can also be applied to two independent samples by setting r = 0 Testing, Estimation, and Confidence Intervals Although this trial closed due to poor accrual, Table 7.3 provides simulated pre and post microvessel density data for 45 patients. The observed mean difference between the postand pre-microvessel density averaged over the 45 patients was y1 − y2 = 61.13 − 57.33 = 3.8. The ANOVA mean square for the patients effect with 44 degrees of freedom was 426.7, the ANOVA mean square due to the pre/post time effect with 1 degree of freedom was 324.9, and the ANOVA mean square due to the patient by pre/post time interaction effect with 44 degrees of freedom was 135.3. The mean square due to patients and the mean square due to the patient by pre/post time interaction are ANOVA estimates of s 2(1 + r) and s 2(1 − r), respectively. Setting the s 2(1 + r) equal to 426.7 and s 2(1 − r) equal to 135.3 and solving for s 2, r, we obtain σˆ 2 = 281.3 (σˆ = 16.772) and r = 0.519. Therefore the p-value of the test on H0 : m1 − m2 > 0 versus H1 : m1 − m2 ≠ 0 equals: ⎡ ⎤ | 3.8 | p-value = 2P ⎢ N(0, 1) > ⎥ 16.772 ⋅ 2 ⋅ (1 − 0.519)/45 ⎥⎦ ⎢⎣ = 2P ⎡⎣ N (0, 1) > 1.56 ⎤⎦ = 0.12
(Eqn. 7-15),
or equivalently, 324.9 ⎤ ⎡ p-value = P ⎢ F1,n −1 > = P[ F1,n −1 > 2.4] = 0.12 . 135 .3 ⎥⎦ ⎣ An unbiased estimate of the mean difference m1 − m2 is provided by y1 − y2 = 3.8. Furthermore, a 100(1 − a)% confidence interval on m1 − m2 is given by: y1 − y2 ± zα / 2σˆ 2(1 − ρˆ ) / n
(Eqn. 7-16).
For this example using (Eqn. 7-16), 95% confidence bounds on m1 − m2 are given by (−1.01, 8.61). EXAMPLE 7-4 A randomized phase III trial of gemcitabine plus bevacizumab versus gemcitabine plus placebo in patients with advanced pancreatic cancer.
Discussion The primary objective of the trial was to determine if gemcitabine plus bevacizumab achieved superior overall survival compared to gemcitabine plus placebo in patients with advanced pancreatic cancer. As a secondary objective, the toxicity rates between the two treatment arms were compared. Design and Hypothesis Test To address the primary objective of comparing the overall survival of the two treatments, the patients were randomized with equal probability to the treatment gemcitabine plus bevacizumab or gemcitabine plus placebo. The assumed total enrollment rate was 20 patients per month. For an a = 0.05 two-sided test, the trial was powered to distinguish a difference in the
TABLE 7.3
Pre- and Post-Treatment Microvessel Density Simulated Data on 45 Patients. Patient Pre Post
1 64 36
2 59 67
3 52 72
4 70 78
5 81 64
6 64 56
7 28 42
8 72 75
9 29 56
10 68 50
11 67 66
12 38 46
13 110 68
14 43 62
15 57 39
Patient Pre Post
16 47 13
17 53 44
18 60 47
19 54 45
20 56 99
21 72 89
22 83 73
23 70 65
24 67 64
25 40 33
26 27 43
27 71 64
28 70 57
29 40 54
30 89 57
Patient Pre Post
31 78 75
32 52 53
33 71 64
34 73 71
35 55 46
36 55 45
37 57 47
38 61 54
39 63 56
40 58 64
41 49 53
42 70 44
43 56 40
44 104 91
45 48 53
49
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
survival curves of the two treatments when the median survival of the gemcitabine plus bevacizumab arm was 6 months and the median survival of the gemcitabine plus placebo arm was 8.1 months, producing a 1.35 hazard ratio under an exponential event-time assumption. Using results from George and Desu (2) and Rubenstein et al. (3) (calculated by a computer program DSTPLAN downloaded from http:// biostatistics.mdanderson.org/SoftwareDownload/Single Software.aspx?Software_Id = 41, an enrollment of n = 528 patients was required with a 26.4 months enrollment period (at a 20 patient enrolled/month) and a 12 month follow-up period to attain a 90% power to detect a hazard rate of 1.35 using a two-sided log-rank test. During the enrollment and follow-up period, 470 events were expected. The secondary objective of the trial was to test whether the toxicity rates of the two treatment arms were equal or differed by Δ or more, assuming the toxicity rate of the gemcitabine plus placebo treatment was 0.10. Formally, this comparison can be written as the hypotheses: H0 : p1 − p2 = 0 versus H1 : p1 − p2 = Δ > 0 where p1, p2 are the population toxicity rates of the gemcitabine plus bevacizumab and gemcitabine plus placebo treatment, respectively. Since the sample sizes in each treatment arm are large, an approximate normal arcsin transformation procedure was used to test the hypothesis. Under this assumption, arcsinn( pˆ 1 ) , arcsinn( pˆ 2 ) are independent normal random variables with means arcsinn( p1 ) , arcsinn( p2 ) , and variances 1/(4n1), 1/(4n2), respectively, where n1, n2, are the sample sizes and pˆ 1 , pˆ 2 are the observed toxicity rates for the gemcitabine plus bevacizumab and gemcitabine plus placebo, respectively. The criterion to test this hypothesis is: Reject H0 : p1 − p2 = 0 in favor of H1 : p1 − p2 = Δ > 0 if arcsin( pˆ 1 ) − arcsin( pˆ 2 ) ≥ zα 1 / (4n1) + 1 / (4n2 ) (Eqn. 7-17). If it is of interest to provide a p-value rather than an a level test criterion then the p-value of the onesided test is given by: ⎡ ⎤ ⎢ ⎥ arcsin pˆ 1 − arcsin pˆ 2 ⎥ p-value = P ⎢⎢ N (0, 1) > ⎥ 1 1 ⎢ ⎥ + 4n1 4n2 ⎢⎣ ⎥⎦ (Eqn. 7-18).
TABLE 7.4
Powers Associated with the One-sided a Level Test Procedure (Eqn. 7-17) for Various Combinations of p1 and Δ = p1 − p2 for a = 0.05 and n1 = n2 = n/2 = 264. P1
P2
POWER
0.18 0.20 0.22 0.28 0.30 0.32 0.38 0.40 0.42 0.48 0.50 0.52
0.10 0.10 0.10 0.20 0.20 0.20 0.30 0.30 0.30 0.40 0.40 0.40
0.848 0.947 0.985 0.696 0.846 0.935 0.617 0.779 0.892 0.583 0.748 0.870
For any values of p1, p2 under H1 : p1 − p2 = Δ > 0, the power of the (Eqn. 7-17) procedure is given by: ⎡ arcsin( p1 ) − arcsin( p2 ) ⎤ ⎥ Power = 1 − P ⎢N (0, 1) < zα − 1 / (4n1) + 1 / (4n2 ) ⎢⎣ ⎥⎦ (Eqn. 7-19).
From (Eqn. 7-19), Table 7.4 provides powers for combinations of p1 = 0.10, 0.20, 0.30, 0.40 and Δ = 0.08, 0.10, 0.12 for a one-sided α = 0.05 test with n1 = n2 n/2 = 264. The sample size of n = 528 for this trial was set to accommodate the power calculations of the primary objective of the study, not to accommodate the power calculations of the toxicity rate comparison, which was a secondary objective. However, for equal allocation to each treatment (i.e., n1 = n2 = n/2), it is possible from (Eqn. 7-19) to calculate the sample size n required to attain a specified power for an a level one-sided test for the hypothesis H0 : p1 − p2 = 0 versus H1 : p1 − p2 = Δ > 0. The resulting equation is: n=
(zα + zβ )2 [arcsin( p1 ) − arcsin( p2 )]2
(Eqn. 7-20),
where the specified power equals 1 − b. So from (Eqn. 7-20) for an a = 0.05 one-sided test of H0 : p1 − p2 = 0 versus H1 : p1 − p2 = Δ > 0, the sample size required to achieve a power of 0.80 is provided in Table 7.5 for combinations of p1 = 0.10, 0.20, 0.30, 0.40 and Δ = 0.08, 0.10, 0.12.
50
ONCOLOGY CLINICAL TRIALS
EXAMPLE 7-5
TABLE 7.5
Sample Size Required to Attain a Power of 0.80 Associated with the One-Sided Hypothesis Test (Eqn. 7-17) for Various Values of p1 and p2 Given a = 0.05. p1
p2
0.18 0.20 0.22 0.28 0.30 0.32 0.38 0.40 0.42 0.48 0.50 0.52
0.10 0.10 0.10 0.20 0.20 0.20 0.30 0.30 0.30 0.40 0.40 0.40
n 456 307 223 700 460 326 864 560 393 950 610 424
For a two-sided a level test where the alternative hypothesis is H1 : p1 − p2 ≠ 0, equations (Eqn. 7-19) and (Eqn. 7-20) still hold provided a/2 is replaced by a and in (Eqn. 7-19) arcsin(p1) − arcsin(p2) is replaced by | arcsin(p1) − arcsin(p2)|. Testing, Estimation, and Confidence Intervals Testing and estimation of the primary endpoint, overall survival will not be examined for this example since the log-rank and Kaplan-Meier analysis procedure will be examined in detail in the next Example 7-5. Instead, in this example we will use the secondary objective to examine the difference in grade 4 and 5 maximum hematologic toxicity rates between the two treatment arms. The number of patients with grade 4 and 5 hematologic toxic events (8) was x1 = 30 and x2 = 21 from samples of size n1 = 263 and n2 = 277 in the bevacizumab and placebo arms, respectively. Therefore, the observed toxicity rates for the bevacizumab and placebo arms were pˆ1 = 30/263 = 0.1141 and pˆ2 = 21/277 = 0.0758. From (Eqn. 7-18), the p-value to test the one-sided hypothesis H0 : p1 − p2 = 0 versus H1 : p1 − p2 = Δ > 0 is given by: ⎡ ⎤ ⎢ ⎥ arcsin 0.1141 − arcsin 0.00758 ⎥ p-value = P ⎢ N(0, 1) > ⎢ ⎥ 1 1 + ⎢ ⎥ 4(263) 4(277) ⎣ ⎦
= 0.0639
(Eqn. 7-21).
A randomized phase III trial of induction (daunorubicin/ cytarabine) and consolidation (high dose cytarabine) plus midostaurin or placebo in newly diagnosed FLT3 mutated AML patients.
Discussion The primary objective of the trial was to determine if the addition of midostaurin to daunorubicin/cytarabine induction, high dose cytarabine consolidation, and continuation therapy improves overall survival in mutant FLT3 AML patients. A secondary objective was to determine if the addition of midostaurin to daunorubicin/cytarabine induction, high dose cytarabine consolidation, and continuation therapy improves disease free survival (DFS) in mutant FLT3 AML patients. Design and Hypothesis Test To address the primary objective of comparing the overall survival of the two treatments in newly diagnosed FLT3 mutated AML patients, the patients were randomized with equal probability to the two treatments: (1) induction (daunorubicin/cytarabine) and consolidation (high dose cytarabine) plus midostaurin, and (2) induction (daunorubicin/ cytarabine) and consolidation (high dose cytarabine) plus placebo. A formal one-sided hypothesis to compare whether the midostaurin arm improves overall survival is: H0 : l = 1 versus H0 : l > 1 where l is the hazard ratio of the two survival curve. Assuming exponential event times l = mM /mP where mM, mP are the median survival times of the midostaurin and placebo patient populations, respectively. The assumed total enrollment rate was 25 patients per month. For an a = 0.05 one-sided test, the trial was powered to distinguish a difference in the survival curves of the two treatments when the median survival of the daunorubicin/cytarabine plus midostaurin arm was 21 months and the median survival of the daunorubicin/cytarabine plus placebo arm was 15 months, producing a 1.4 hazard ratio under an exponential event time assumption. Using results from George and Desu (2) and Rubenstein et al. (3), an enrollment of n = 514 patients was required with a 20.5 months enrollment period and a 24-month follow-up period to attain a 90% power to detect a hazard rate of 1.4 using a onesided log-rank test. During the enrollment and followup period, 374 survival events were expected. The secondary objective of the trial was to compare the DFS between patients in the midostaurin and placebo arms. Among patients who achieved complete remission after the induction phase DFS time is the
51
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
period from complete remission to relapse or death, whichever comes first. Seventy-nine percent of the FLT3 mutated patients were expected to achieve complete remission in each treatment arm after the induction phase, producing 406 patients (=0.79*514) to investigate DFS. Assuming a median DFS time of 11 months in the placebo arm provides an 87% power (2, 3) to detect an increase to 15.4 months in the midostaurin arm. During the post induction enrollment period and the follow-up period, 335 DFS events are expected. The comparison of overall survival between the two treatment arms and the comparison between the DFS between the two treatment arms can both be performed through the log-rank test. Therefore the logrank test will be discussed relative to overall survival time and a comparable application can be made to the disease free survival time. An excellent discussion of the log-rank test is provided by Collett (6). Suppose that there are r distinct death times occurring in either arm and let tj represent the times when the deaths occurred for j = 1, 2,..., r and t1 < t2 <... < tr . Further suppose that d1j and d2j deaths occurred at time tj in treatment arm 1 and 2, respectively, with a total of dj = d1j + d2j deaths occurred at time tj. Under H0 : l = 1 (i.e., if no difference exists between the two survival curves) then the expected number of deaths in arm 1 at time tj equals e1j = n1jdj /nj , where n1j and n2j are the patients that are at risk (have not died) in treatment arms 1 and 2, respectively, just prior to time tj with the total number of patients at risk nj = n1j + n2j. Then the a level log-rank one-sided test criterion takes the form:
The a level log-rank two-sided test criterion to test the hypothesis H0 : l = 1 versus H0 : l ≠ 1 takes the form: Reject H0 : l = 1 in favor of H1 : l ≠ 1 if RL2 > χ12,α (0) (Eqn. 7-24), 2 where χ v,α (0) is the 100(1 − a) percentile of a central chi-square distribution with v degrees of freedom. The p-value for the two-sided test is given by:
p-value = P[ χ12 (0) > RL2 ]
2 where χ v (0) is a central chi-square random variable with v degrees of freedom. An estimate of the survivor curve Si(t) in the ith arm can be generated through a Kaplan-Meier curve for i = 1, 2. The Kaplan-Meier survival curve estimate Sˆi (t) for any time t in the interval tk to tk+1, k = 1, 2,...r where tr+1 is defined to be ∞ is given by: k ⎛n −d ⎞ ij ij Sˆi (t) = ∏ ⎜ ⎟ nj ⎠ j =1 ⎝
(Eqn. 7-26),
for arms i = 1, 2. The standard error of the KaplanMeier estimate of the survival function is given by the Greenwood formula: 1/ 2
⎫⎪ dij ⎪⎧ k s.e.{Sˆi (t)} ≈ [ Sˆi (t)] ⎨∑ ⎬ ⎩⎪ j =1 nij (nij − dij ) ⎭⎪
(Eqn. 7-27),
for tk ≤ t < tk+1. Therefore, a 100(1 − a)% confidence interval on the survival curve Si(t) at time tk ≤ t < tk+1 for treatment i = 1, 2 is given by
Reject H0 : l = 1 in favor of H1 : l > 1 if RL > za (Eqn. 7-22),
Sˆi (t) ± zα / 2s.e.{Sˆi (t)}
(Eqn. 7-28).
However, confidence intervals (Eqn. 7-28) are symmetric and therefore can produce confidence intervals outside the range (0,1) for values of Sˆi (t) near 0 or 1. To correct this situation, an alternative confidence interval on Si(t) produced through a logistic transformation on Si(t) is given by:
where RL =
(Eqn. 7-25),
UL VL r
UL = ∑ (d1 j − e1 j ) j =1
⎡ Sˆ (t)exp[ zα / 2s.e.{log[ − log Sˆi (t )]}] , Sˆ (t)exp[ − zα / 2s.e.{log[ − log Sˆi (t )]}] ⎤ i ⎣ i ⎦
r
VL = ∑ v1 j j =1
v1 j =
(Eqn. 7-29),
n1 j n2 j d j (n j − d j ) n 2j (n j − 1)
where . var{log[ − log Sˆi (t)]} =
The p-value for the one-sided test is given by: p-value = P[Ν(0,1) > RL]
(Eqn. 7-23).
k dij 1 (Eqn. 7-30), ∑ 2 ˆ [log S(t)] j =1 (nij − dij )
{
}
ˆ ˆ and s.e.{ log[ − log Si (t)]} = var{log[ − log Si (t)]}
1/ 2
.
52
ONCOLOGY CLINICAL TRIALS
Testing, Estimation, and Confidence Intervals Although this trial was just opened to enrollment at the time of this writing, Table 7.6 provides representative death times tj and values of d1j, n1j, d2j, n2j, dj, nj, e1j, and v1j, for a total enrollment of 514 patients. The times are grouped in tenths of year increments so there will be a manageable number of death times to demonstrate the numerical calculations.
From Table 7-6 UL = 38.2903 and VL = 86.5266. Therefore, RL = 38.5266/ 86.5266 = 4.14
(Eqn. 7-31),
and from (Eqn. 7-23) the p-value <0.0001. Since the follow-up time in this simulated data example is 2 years, all censored observations have times
TABLE 7.6
Example Deaths and Associated Times for Two Treatment Arm. tj*
d1j
n1j
d2j
n2j
dj
nj
e1j
d1j - e1j
v1j 5.27485 5.49091 6.59109 5.67791 5.87720 4.06325 3.59358 3.11955 5.12495 3.77548 3.54352 3.09252 2.85933 1.92814 2.38763 2.81843 2.56430 2.32699 2.10121 1.87212 1.84167 1.16760 1.16464 1.39225 1.80795 0.91899 0.88675 0.22657 1.31295 0.84786 0.23438 0.00000 0.00000 0.00000 0.64198 0.00000 0.00000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
15 14 16 13 14 10 10 7 12 7 5 6 7 4 8 9 6 4 4 7 5 2 2 4 4 4 1 0 2 0 1 0 0 0 1 0 0
257 242 228 212 199 185 175 165 158 146 139 134 128 121 117 109 100 94 90 86 79 71 65 59 52 46 39 34 31 22 21 18 15 9 9 4 1
7 9 12 11 11 7 5 6 10 9 10 7 5 4 2 3 5 6 5 1 3 3 3 2 4 0 3 1 4 4 0 0 0 0 2 0 0
257 250 241 229 218 207 200 195 189 179 170 160 153 148 144 142 139 134 128 123 122 109 100 88 80 75 72 64 53 44 35 30 25 18 15 7 1
22 23 28 24 25 17 15 13 22 16 15 13 12 8 10 12 11 10 9 8 8 5 5 6 8 4 4 1 6 4 1 0 0 0 3 0 0
514 492 469 441 417 392 375 360 347 325 309 294 281 269 261 251 239 228 218 209 201 180 165 147 132 121 111 98 84 66 56 48 40 27 24 11 4
11.0000 11.3130 13.6119 11.5374 11.9305 8.0230 7.0000 5.9583 10.0173 7.1877 6.7476 5.9252 5.4662 3.5985 4.4828 5.2112 4.6025 4.1228 3.7156 3.2919 3.1443 1.9722 1.9697 2.4082 3.1515 1.5207 1.4054 0.3469 2.2143 1.3333 0.3750 0.0000 0.0000 0.0000 1.1250 0.0000 0.0000
4.00000 2.68699 2.38806 1.46259 2.06954 1.97704 3.00000 1.04167 1.98271 0.18769 1.74757 0.07483 1.53381 0.40149 3.51724 3.78884 1.39749 0.12281 0.28440 3.70813 1.85572 0.02778 0.03030 1.59184 0.84848 2.47934 0.40541 0.34694 −0.21429 −1.33333 0.62500 0.00000 0.00000 0.00000 −0.12500 0.00000 0.00000
Total
204
−
164
−
368
−
−
38.2903
*Years #Trt 1 = Placebo, 2 = Midostaurin
86.5266
53
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
greater than or equal to 2 years. Conversely, all patients who have event times less than 2 years are associated with deaths, not with censored events. This fact can be observed in Table 7-6 by noting that the number of deaths in either arm d1j, d2j at any time tj ≤ 2 equals the decrease in the number of patients at risk at the next time point tj+1, i.e., dij = ni,j − ni,j+1 or ni,j+1 = ni,j − dij for i = 1, 2., respectively. For example, from Table 7-6 at time tj = 0.3 years, the number of deaths in treatment 1 is d1j = 16 and the number at risk is n1j = 228. Then since all patients in treatment 1 at time tj = 0.3 years died (i.e., no censored observations), the number at risk at the next time tj+1 = 0.4 is ni,j+1 = ni,j − dj = 228 − 16 = 212. For times greater than 2 years, some events are deaths and some events are censored. For example, at time tj = 2.7 years, the number of deaths in treatment 2 is d2j = 3 and the number at risk is n2j = 72. Then in treatment 2 at time tj+1 = 2.8 years, the number at risk is n1,j+1 = 64 indicating a decrease in the number of patients at risk of n2 j − n2, j +1 = 72 − 64 = 8 , d2j = 3 are deaths and 5 are censored observations. The previous discussion is presented to explain how the difference in the number of patients at risk between times tj and tj +1 are influenced by the number of deaths and censored observations at time tj. Note that in actual trials not all patients are followed for the full followup period. Therefore, the event time is the period from the enrollment date to the last observed date. If the difference between the last observed date and the enrollment date is less than the follow-up period and no event has occurred, then these censored observations have event times that are smaller than the follow-up time. To develop a Kaplan-Meier survival function estimate and an estimate of the standard error of the Kaplan-Meier estimate (Eqn. 7-26), (Eqn. 7-27) and the values in Table 7-6 are used jointly. The required values tk ≤ t < tk+1 , nij , dij , n j are reproduced in the first six columns of Table 7-7 followed by the KaplanMeier survival function estimate Sˆi (t) and the standard error of the Kaplan-Meier estimate s.e.{Sˆi (t)} for treatments i = 1,2. For example, to estimate Sˆi (t) for 0.3 ≤ t < 0.4 we apply (Eqn. 7-26) and obtain: Sˆ1 (t) = [(257 − 15) / 257 ] i [(242 − 14) / 242] i
[(228 − 16) / 228] = 0.825.
Calculated in a similar manner, from Table 7-7 for the Midostaurin arm in the interval 2.0 ≤ t < 2.1
the Kaplan-Meier estimated survival Sˆ2 (t) = 0.475. Then using n2j, d2j values d2 j
k
∑n j =1
2j
(n2 j − d2 j )
= 0.00431
function
(Eqn. 7-32),
where the (Eqn. 7-32) sum is added over all values of d2j, n2j up to and including those in the 2.0 ≤ t < 2.1 interval of Table 7.7. Therefore, using the Greenwood formula from (Eqn. 7-26) through (Eqn. 7-28), a 95% confidence interval on S2(t) in the interval 2.0 ≤ t < 2.1 is given by: Sˆ2 (t) ± z0.025s.e.{Sˆ2 (t)} = 0.475 ± 1.96 * 0.0312 = (0.414, 0.536)
(Eqn. 7-33),
where s.e.{Sˆ2 (t)} ≈ [0.475] i [0.00431]1/ 2 = 0.0312 (Eqn. 7-34). For comparison purposes from (Eqn. 7-29) and (Eqn. 7-30), a 95% confidence interval on S2(t) in the interval 2.0 ≤ t < 2.1 is given by: ⎡0.475exp[1.96 ⎣
0.00745 ]
, 0.475exp[ −1.96
0.00745 ]
⎤ = (0.414, 0.533) ⎦ (Eqn. 7-35),
where var{log[ − log Sˆ2 (t)]} =
1 [log Sˆ (t)]2 2
=
d2 j
k
∑n j =1
2j
(n2 j − d2 j )
0.00413 = 0.00745 (Eqn. 7-36). [log(0.475)]2
Note that the survival function estimate Sˆ2 (t) = 0.475 is not near 0 or 1 and therefore the two confidence intervals from (Eqn. 7-33) and (Eqn. 7-35) are almost identical. Using the values of Table 7.7, Figure 7.2 provides the Kaplan-Meier estimated survival function for both treatments. Finally, Figure 7.3 provides the Kaplan Meier estimated survival function for the midostaurin treatment arm with confidence bounds calculated using (Eqn. 7-29).
54
ONCOLOGY CLINICAL TRIALS
TABLE 7.7
Kaplan-Meier Survival Function Estimates and Standard Errors. T
0.1 ≤ 0.2 ≤ 0.3 ≤ 0.4 ≤ 0.5 ≤ 0.6 ≤ 0.7 ≤ 0.8 ≤ 0.9 ≤ 1.0 ≤ 1.1 ≤ 1.2 ≤ 1.3 ≤ 1.4 ≤ 1.5 ≤ 1.6 ≤ 1.7 ≤ 1.8 ≤ 1.9 ≤ 2.0 ≤ 2.1 ≤ 2.2 ≤ 2.3 ≤ 2.4 ≤ 2.5 ≤ 2.6 ≤ 2.7 ≤ 2.8 ≤ 2.9 ≤ 3.0 ≤ 3.1 ≤ 3.2 ≤ 3.3 ≤ 3.4 ≤ 3.5 ≤ 3.6 ≤
t < 0.2 t < 0.3 t < 0.4 t < 0.5 t < 0.6 t < 0.7 t < 0.8 t < 0.9 t < 1.0 t < 1.1 t < 1.2 t < 1.3 t < 1.4 t < 1.5 t < 1.6 t < 1.7 t < 1.8 t < 1.9 t < 2.0 t < 2.1 t < 2.2 t < 2.3 t < 2.4 t < 2.5 t < 2.6 t < 2.7 t < 2.8 t < 2.9 t < 3.0 t < 3.1 t < 3.2 t < 3.3 t < 3.4 t < 3.5 t < 3.6 t < 3.7
d1j 15 14 16 13 14 10 10 7 12 7 5 6 7 4 8 9 6 4 4 7 5 2 2 4 4 4 1 0 2 0 1 0 0 0 1 0
n1j
Sˆ1 (t)
s.e.{Sˆ1 (t)}
257 242 228 212 199 185 175 165 158 146 139 134 128 121 117 109 100 94 90 86 79 71 65 59 52 46 39 34 31 22 21 18 15 9 9 4
0.942 0.887 0.825 0.7745 0.720 0.681 0.642 0.615 0.568 0.541 0.521 0.498 0.471 0.455 0.424 0.389 0.366 0.350 0.335 0.307 0.288 0.280 0.271 0.253 0.233 0.213 0.208 0.208 0.194 0.194 0.185 0.185 0.185 0.185 0.164 0.164
0.258 0.186 0.149 0.132 0.118 0.111 0.105 0.101 0.096 0.094 0.092 0.090 0.088 0.087 0.085 0.083 0.082 0.081 0.080 0.079 0.079 0.079 0.079 0.079 0.080 0.081 0.081 0.081 0.083 0.083 0.085 0.085 0.085 0.085 0.103 0.103
d2j 7 9 12 11 11 7 5 6 10 9 10 7 5 4 2 3 5 6 5 1 3 3 3 2 4 0 3 1 4 4 0 0 0 0 2 0
n2j
Sˆ2 (t)
257 250 241 229 218 207 200 195 189 179 170 160 153 148 144 142 139 134 128 123 122 109 100 88 80 75 72 64 53 44 35 30 25 18 15 7
0.973 0.938 0.891 0.848 0.805 0.778 0.759 0.735 0.696 0.661 0.623 0.595 0.576 0.560 0.553 0.541 0.521 0.498 0.479 0.475 0.463 0.450 0.437 0.427 0.406 0.406 0.389 0.383 0.354 0.322 0.322 0.322 0.322 0.322 0.279 0.279
s.e.{Sˆ2 (t)} 0.378 0.250 0.190 0.160 0.142 0.133 0.127 0.122 0.114 0.108 0.102 0.099 0.097 0.095 0.095 0.094 0.092 0.090 0.088 0.088 0.087 0.087 0.086 0.086 0.086 0.086 0.086 0.086 0.088 0.091 0.091 0.091 0.091 0.091 0.113 0.113
55
7 DESIGN, TESTING, AND ESTIMATION IN CLINICAL TRIALS
Overall Survival for Midostaurin and Placebo Arms
Overall Survival for Midostaurin ARM With Confidence Intervals 1.0 Overall survival proportion
Overall survival proportion
1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 Years from study entry Placebo Midostaurin
N=257 N=257
Events=204 Events=164
FIGURE 7.2 Kaplan-meier estimated survival functions for the midostaurin and placebo arms using simulated data.
0.8 0.6 0.4 0.2 0.0 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 Years from study entry Midostaurin
N=257
Events=164
FIGURE 7.3 Kaplan-meier estimated survival function plus confidence intervals for the midostaurin arm using simulated data.
References 1. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials, 1988;10:1–10. 2. George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some clinical event. Journal of Chronic Diseases, 1974;27:15–24. 3. Rubenstein LV, Gail MH, Santner TJ. Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation. Journal of Chronic Diseases, 1981;34:469–479. 4. Berger JO. Statistical Decision Theory. 2nd ed. New York: Springer-Verlag; 1985. 5. Moser BK. Linear Models: A Mean Model Approach. San Diego: Academic Press; 1996.
6. Collett D. Modelling Survival Data in Medical Research. Boca Raton: Chapman and Hall; 1994. 7. Edelman MJ, Watson D., Wang X, et al. Eicosanoid modulation in advanced lung cancer: COX-2 expression is a positive predictive factor for Celecoxib + chemotherapy. Journal of Clinical Oncology; 2008, in press. 8. Stone RM, Moser BK, Schulman P, et al. High Dose cytarabine plus gemtuzumab ozogamicin for patients with refractory and relapsed acute myeloid leukemia: CALGB Study 19902, submission to; 2008. 9. Kindler H, Niedzwiecki D, Hollis D, Oraefo E. Summary report: a randomized phase III trial of gemcitabine plus bevacizumab versus gemcitabine plus placebo in patients with advanced pancreatic cancer. CALGB. July 2007 Summary Report.
This page intentionally left blank
Design of Phase I Trials
8 Anastasia Ivanova Leslie A. Bunce
This chapter gives examples of oncology trials illustrating various research problems in dose-finding trials in oncology and designs for these problems. Special considerations are given to the design and implementation of phase I clinical trials of an investigational agent being clinically developed to treat cancer. Unlike most other investigational products for which first-in-man and other early phase I clinical investigation occurs in a subject population of consenting, healthy human volunteers, clinical trials in oncology, including the earliest phase I trials, are conducted in consenting cancer patients who may have run out of commercially available therapeutic options and to whom the administration of a potentially beneficial, but yet unproven, cytotoxic agent is more acceptable than compared to its administration to healthy volunteers. Although in recent years there has been the advent of discovery and clinical development of noncytotoxic, targeted agents and biologics to treat certain types of cancer, the focus of the remainder of the discussion in this chapter will be on traditional cytotoxic agents. The purpose of phase I cancer trials is multifold, related to dosing, tolerability, and safety profile of the cytotoxic oncology drug, as well as to the study of the effect of the body on the drug (pharmacokinetics) and the effect of the drug on the body (pharmacodynamics). Phase I cancer trials are conducted, in part, to establish the maximum dose level tolerated by the study
patients in the absence of experiencing predefined doselimiting toxicities (of type and degree of severity) deemed unacceptable. This important threshold is called the maximum tolerated dose (MTD). As the MTD is defined in the context of dose-limiting toxicities, let us characterize these in more detail. The National Cancer Institute’s (NCI) Common Terminology Criteria for Adverse Events (CTCAE), version 3.0 (available online from the Cancer Therapy Evaluation Program Web site at http://ctep.cancer.gov/ protocolDevelopment/), designates laboratory ranges, symptoms, and signs as toxicities or adverse events (AEs) and defines each AE to a severity scale, with the following range of grades from 1 to 5: grade 1 for a mild AE, grade 2 for a moderate AE, grade 3 for a severe AE, grade 4 for a life-threatening AE, and grade 5 for a fatal AE. The scoring scale for AEs does range from grade 1 to 5 for some specified toxicities, but not for all cases. In cancer clinical trials, the dose-limiting toxicity (DLT) is typically defined as a nonhematological side effect related to the investigational agent, scored with a severity grade of 3 (or higher), or a hematological side effect related to the investigational agent, scored with a severity grade of 4 (or higher). Typically the occurrence of a DLT in a cancer trial subject results in permanent withdrawal of study drug from the subject, although the subject is still usually followed for a period of time as part of the study for ongoing safety
57
58
ONCOLOGY CLINICAL TRIALS
assessments. In contrast, other toxicities may occur that do not meet the criteria set forth in the study protocol as a DLT, but which lead the study investigators to reduce the study drug dose of that particular subject. Toxicities that will require dose reductions, as well as the algorithms for how to reduce the dose, are described in the clinical trial protocol. When assessing toxicity outcome, to determine feasibility and safety of dose escalation for the next dosing cohort of subjects, the outcome assessment is traditionally binary: DLT or no DLT. The likelihood of DLT is assumed to be a nondecreasing function of the dose. A typical study design for a phase I cancer trial is administration of the investigational agent in cycles, often with the subject receiving several administrations of the study agent per cycle, sequentially followed by more than one cycle at the same dose, if tolerated, and especially if there is observed benefit. Typically, written into a phase I trial protocol are the mandated review and safety assessments at the end of a certain number of cycles of study drug administered to the subjects in the same dosing cohort to determine if the subjects are safely tolerating the study drug at the tested dose. If no safety concerns are found after this evaluation, then the study investigators are allowed to proceed to enroll the next cohort of subjects to receive the next higher dose of study drug. Determination of the MTD is a major goal of a phase I clinical trial in oncology. As Rosenberger and Haines discuss, there is frequently no statistical definition of the MTD (1); instead the MTD is defined as the dose level immediately below the one at which a DLT occurs in more than one subject out of a dosing cohort of six subjects. The problem with this type of definition for MTD is the ambiguity of how to handle the situation of DLT occurring among subjects in a dosing cohort of less than or greater than six subjects. From a statistical viewpoint, the MTD is described as the dose at which the probability of DLT is equal to the maximum tolerable level, or target level, Γ. Target level Γ is usually set to 0.25, though values from 0.1 to 0.35 are sometimes used. Once the estimated MTD is determined, often additional subjects are assigned to receive the estimated MTD to accrue more clinical experience with that dose. This is when the statistical definition is very useful, since the DLT rate computed from all study subjects including additionally added subjects might be much higher or lower than the target level Γ, indicating that the initial estimate of the MTD was not precise. Preclinical findings of a study drug significantly help to guide the starting dose used in first-in-man studies, which is typically, one-tenth or two-tenths of the equivalent of the murine dose of the study drug that results in 10% mortality in preclinical animal studies
(termed LD10 and expressed in milligrams per meter squared). Once clinical development has begun with testing in humans, further dosing is refined based upon the growing clinical trial experience with the study drug. The planned starting dose to be administered in a particular cancer trial and the subsequent dose escalations during that trial are all preplanned, predetermined, and clearly written out in the clinical trial protocol, to be adhered to strictly by the participating investigators, unless a protocol amendment affecting the dosing occurs. One method of selecting the starting dose and subsequent dose escalations is per the modified Fibonacci sequence (2) that involves the designation of each dose level escalation by relative decreasing increments (e.g., 100%, 67%, 50%, 40%, and 30% increase thereafter of the prior dose, added to the prior dose level). In phase I cancer studies, sequentially increasing the dose of a cytotoxic investigational drug is generally thought to lead to improved antitumor effect and efficacy; however, continued dose escalation with the goal of improved therapeutic effect is usually restricted at some point by the eventual development of unacceptable toxicity (dose-limiting toxicity or DLT). Administration of study drug, beginning with the lowest dose level, to subjects assigned to small cohorts, especially early on in a phase I trial, allows one to proceed cautiously to higher doses of study drug administered to later cohorts of newly enrolling subjects, while limiting the number of subjects assigned to low doses that are predicted to have little or no efficacy. A key objective is to administer to the greatest number of subjects, as feasible, a dose of the study drug that is close to or equal to the maximally tolerated dose, in order to accrue as much clinical data (including effect, tolerability, and safety assessments) as possible.
TRADITIONAL, STANDARD, OR 3 + 3 DESIGN The most frequently used design in phase I trials in oncology is the traditional or standard or 3 + 3 design (2, 3). This design is often associated with the set of dose levels obtained using the modified Fibonacci sequence, described previously in this chapter (2). Therefore, this design is sometimes also referred to as the Fibonacci escalation scheme. However, this is actually a misnomer, as the escalation design establishes the rule of when and in what direction to adjust the dose level for the next cohort of subjects (e.g., to decrease, keep the same, or increase the dose level), and not to determine the degree of the dose escalation as a changing percentage of the prior dose, which the Fibonacci
8 DESIGN OF PHASE I TRIALS
59
TABLE 8.1
Hypothetical Trial with the 3 + 3 Design. ACTION
OUTCOME
DECISION
Assign 3 subjects to 1 mg Assign 3 subjects to 2 mg Assign 3 subjects to 3.3 mg Assign 3 subjects to 5 mg
0 DLTs out of 3 0 DLTs out of 3 0 DLTs out of 3 1 DLT out of 3
Increase dose for the next cohort of 3 Increase dose for the next cohort of 3 Increase dose for the next cohort of 3 Repeat dose for the expanded cohort of 6, adding 3 new subjects to same dose level cohort
Assign 3 more subjects to 5 mg
0 DLTs out of new 3 1 DLT out of 6 total 2 DLTs out of 3
Assign 3 subjects to 7 mg
sequence does. The 3 + 3 design can be used with any set of dose levels regardless of the incremental change of the dose levels per each dose escalation (or dose decrease, in the setting of DLT). The 3 + 3 design is defined as follows: Subjects are assigned in cohorts of three starting with the lowest dose of study drug with the following provisions. · If 0/3 DLTs occur, increase the dose to the next protocol-designated level. · If 1/3 DLT occurs (meaning that among the three subjects, one had a DLT), repeat the same dose in an expanded cohort of six, adding three more subjects to receive the study drug at the same original dose. · If 1/6 DLT occurs, increase the dose for the next cohort. · If ≥ 2/6 DLTs occur, STOP. · If ≥ 2/3 DLTs occur (meaning that among the three subjects, two or more subjects had a DLT), STOP, and assign three more subjects to receive the lower dose if there are only three subjects thus far in that dose level cohort. The estimated MTD would be that dose at which 0/3 or 1/6 subjects experienced a DLT. From another perspective, the estimated MTD is the highest dose level with an observed DLT rate of less than 0.33. The 3 + 3 design does not require sample size specification; the escalation is continued until a dose with an unacceptable number of DLTs is observed. EXAMPLE 8-1 3 + 3 design.
Consider a hypothetical trial where doses of 1, 2, 3.3, 5, 7, and 9 mgs are selected for the trial. Let DLT rates
Increase dose for the next cohort of 3 STOP
at these six doses be 0.01, 0.05, 0.10, 0.20, 0.35, and 0.50. Obviously, the DLT rates are not known before the trial. Table 8.1 is an example of a trial. The estimated MTD is the 5 mg dose. A dose-finding trial can result in other scenarios as well. For example, one can observe no DLTs at the lowest dose level and two DLTs at the 2 mg dose (using the hypothetical clinical trial dosing scheme, as described above), resulting in the estimated MTD being the dose of 1 mg. If one runs, say, 10,000 hypothetical trials, then based on the example dosing scheme such as this: · 3% of the trials will result in declaring 1 mg as the MTD; · 10% of the trials will result in declaring 2 mg as the MTD; · 25% of the trials will result in declaring 3.3 mg as the MTD; · 38% of the trials will result in declaring 5 mg as the MTD; · 20% of the trials will result in declaring 7 mg as the MTD; · 4% of the trials will result in declaring 9 mg as the MTD. In general, the likelihood of declaring a certain dose the MTD in the 3 + 3 design depends on the DLT rate at this dose, at the dose one level higher, as well as the DLT rates at all lower dose levels. The operating characteristics of the 3 + 3 design have been extensively studied for a number of dose-toxicity scenarios by exact computations and simulations (4, 5, 6, 7). On average, the dose chosen by the 3 + 3 design as the MTD has the DLT rate between 0.2 and 0.25. Since three to six subjects are assigned to each dose in the 3 + 3 design, the estimate of the MTD is not precise. This is illustrated by the example above, where
60
ONCOLOGY CLINICAL TRIALS
only 38% of all trials resulted in selecting the correct dose at the estimated MTD. One can improve the precision of the estimation by considering designs similar to the 3 + 3 but with larger cohort size. Consider the 3 + 6 design described as follows. EXAMPLE 8-2 3 + 6 design.
· If 0/3 DLTs occur, increase the dose to the next protocol-designated dose level. · If 1/3 DLT occurs, repeat the same dose in an expanded cohort of nine, adding six more subjects to receive the study drug at the same dose. · If ≤2/9 DLTs occur, increase the dose for the next cohort. · If ≥3/9 DLTs occur, STOP. · If ≥2/3 DLTs occur, STOP, and assign six more subjects to receive the next lower dose, if there are only three subjects at that dose level thus far. The estimated MTD is the highest dose level with observed DLT rate less than 0.33. Or in other words, for the 3 + 6 design, the MTD is the dose where 0/3 subjects or 2/9 subjects (or fewer) experienced a DLT. If one runs, say, 10,000 hypothetical trials, then based on the example dosing scheme such as this: · 1% of the trials will result in declaring 1 mg as the MTD; · 5% of the trials will result in declaring 2 mg as the MTD; · 22% of the trials will result in declaring 3.3 mg as the MTD; · 41% of the trials will result in declaring 5 mg as the MTD; · 25% of the trials will result in declaring 7 mg as the MTD; · 5% of the trials will result in declaring 9 mg as the MTD. This is compared to the 3%, 10%, 25%, 38%, 20%, and 4% results for the 3 + 3 design, previously described. The average number of trial subjects in the 3 + 6 design is 21.9, compared to 17.4 in the 3 + 3 design. The 3 + 6 design provides slightly better precision than the 3 + 3, and the average sample size is not that much greater. Designs like these are called A + B designs (6, 8) and can be constructed for different Γ and cohort sizes (8). The advantage of the 3 + 3 design, or any A + B design, is that it is easy to use and it does not require
sample size specification. One of the disadvantages is that it is not flexible. The inflexibility of this type of clinical trial design is illustrated in this slightly modified actual example of a phase I trial evaluating the administration of a cytotoxic investigational drug for the management of a localized, solid tumor. In the first cohort of three subjects assigned to the lowest dose of the study drug, one subject experiences a DLT. Therefore, this dosing cohort expands to include three more subjects assigned to the same lowest dose. Within this expanded cohort of now six subjects, a subject develops elevated liver function tests, and according to protocol his study drug is held until liver function tests decrease to an acceptable level. This grade 3 toxicity of liver function test abnormalities is considered by the investigators to be likely due to a concomitant medication (other than the study drug) that the subject is taking, yet, there remains a slight likelihood that the DLT is study drug-related. Counting this toxicity as a DLT would result in two DLTs occurring out of six subjects, with the resultant conclusion that the MTD had been exceeded at this lowest dose. The investigators in this example decide to count this event as a half of a DLT. Since the likelihood is sufficiently low that the toxicity is study drug-related, counting the event as a half of a DLT is a reasonable compromise. The next three subjects are assigned to this lowest dose as are the first six subjects in this expanded cohort. Thus, a total of nine subjects are assigned to the lowest dose resulting in 1.5 DLTs. The estimated DLT rate at the lowest dose is 1.5/9 = 1/6 = 0.17, allowing the investigators to assign the next cohort of subjects to the next higher dose level. When the 3 + 3 design is used, the occurrence of two DLTs at a dose precludes further investigation at this and higher dose levels, even though there is strong evidence that the occurrence of a DLT at lower dose levels is very unlikely. When the true DLT rate is 0.25, the probability of seeing two or more DLTs out of three subjects exposed is 0.16. The probability of observing one DLT out of three subjects exposed to study drug and then one or more out of the next three subjects is equal to 0.42 × 0.58 = 0.24. That is, with probability 0.16 + 0.24 = 0.40 the true MTD will be discarded from MTD considerations. This is a rather high rate to totally discard this and all higher doses.
DESIGNS THAT TARGET TOXICITY RATE OTHER THAN 0.2 OR 0.25 In certain cases only particular DLT types are of interest, and therefore the maximum tolerable DLT rate is lower than the usual 0.2 or 0.25. In this somewhat
8 DESIGN OF PHASE I TRIALS
modified example of an actual phase I study of a new cytotoxic drug being evaluated in hematology patients, the DLT is defined as grade 4 thrombocytopenia (significantly reduced platelet count). The maximum tolerable DLT rate in the study is set to equal to Γ = 0.1. It is not appropriate to use the 3 + 3 design in this case since it might result in assigning subjects to doses with rates of grade 4 thrombocytopenia that are much higher than 0.1. The 5 + 5 design (8) can be used instead: · If 0/5 DLTs occur, increase the dose. · If 1/5 DLT occurs, repeat the same dose in an expanded cohort of ten, adding five additional subjects to receive the same original dose for that cohort. · If 1/10 DLT occurs, increase the dose for the next cohort of subjects. · If ≥2/10 DLTs occur, STOP. · If 2/5 DLTs occur, STOP. The estimated MTD is the highest dose level with observed DLT rate less than 0.2. The investigators eventually decided to use the standard definition of the MTD and the 3 + 3 design. How to construct A + B designs for various quantiles Γ is described by Ivanova (8).
TRIALS WITH ORDINAL TOXICITY OUTCOME Often subjects in a cancer clinical trial experience toxicities that are not dose limiting, but require dose reductions for that particular patient. Consider a hypothetical example, where the drug is given daily, the
61
cycle length is 2 weeks, and DLT is determined on the basis of two cycles. Dose levels chosen for the trial are 1, 2, and 3.3 mgs. The first cohort of three subjects is assigned to the lowest dose of 1 mg (Dose 1), and the exposure to study drug at this Dose 1 does not result in any DLTs or other toxicities. According to the 3 + 3 design, the next cohort of three subjects is then assigned to the next dose of 2 mg (Dose 2). After the first cycle, a subject experiences toxicity (but not a protocoldefined DLT) that requires dose reduction to 1 mg, and for all subsequent dose administrations this subject receives a dose of 1 mg, although formally this subject remains in the dose cohort of 2 mg. In the middle of the second cycle, one more subject in Cohort 2 experiences a non-DLT AE that requires dose reduction and receives 1 mg from that point on. By the end of the second cycle, two out of three subjects in this Cohort 2 are receiving a dose of 1 mg, even though formally they remain in a dose cohort of 2 mg. The third subject in this dose cohort experiences a true DLT and is withdrawn from the study. Based upon the occurrence of a DLT (Subject #3 in Cohort 2) and using the 3 + 3 design for cohort size modification, Cohort 2 is expanded to include three more subjects receiving a starting dose of 2 mg of study drug. For illustrative purposes of this fictional example, let’s say that these additional three subjects (Subjects #4, #5, and #6 in the expanded Cohort 2) also experience toxicities while on the 2 mg dose—toxicities that do not meet the criteria of being a DLT, but of sufficient concern that for safety reasons the investigator chooses to continue these three subjects in the trial at a reduced dose of 1mg. Table 8.2 is a summary of this hypothetical trial where subsequent assignments are also described.
TABLE 8.2
Hypothetical Trial with the 3 + 3 Allocation that Takes Dose Reductions into Account. DLTS
DOSE REDUCTIONS
Cohort 1, dose 1 (1mg) Cohort 2, dose 2 (2mg)
0/3 DLTs 1/3 DLT
Cohort 2, dose 2 (2mg)
0/3 DLTs
Cohort 3 dose 3 (3.3mg)
2/3 DLTs
No reductions in any of these 3 subjects Reductions in 2/3 subjects in this Cohort 2; study drug is permanently withdrawn from Subject #3 with the DLT Dose reductions in the additional 3 subjects in expanded Cohort 2 due to non-DLT adverse events No reductions in these 3 subjects; the two subjects experiencing DLTs are withdrawn from the study, but still followed with ongoing safety evaluations
62
ONCOLOGY CLINICAL TRIALS
According to the 3 + 3 algorithm and based on the number of DLTs, in the above example the estimated MTD is Dose 2 (2 mg). The question is then: does it make sense to recommend Dose 2 for future trials if five out of six subjects (in Cohort 2) had to be dose-reduced at this dose? If a situation similar to this is likely, one possible solution is to define the MTD not only based on DLTs, but also on other toxicities that can result in dose reductions. This concern is illustrated in the following example, somewhat modified from the actual trial, in which the MTD is defined below: Pr[DLT] + 0.5* Pr[dose reduction] = 0.25 where Pr[DLT] is a DLT rate at the protocol-intended dose, and Pr[dose reduction] is the rate of dose reductions for that dose. For example, one can have the DLT rate of 0.15 and the dose reduction rate of 0.2 at the MTD. We assume that each subject can have either DLT or dose reduction (due to a non-DLT AE) or no toxicity. The 3 + 3 design can be easily extended to use with this definition of the MTD. Let’s call any DLT an event, and let’s call dose reductions observed in two subjects in a cohort an event. Now, apply the 3 + 3 design to the number of events rather than the number of DLTs. For example, if one or more subjects in a cohort of three develops a DLT, or if two or more subjects in a cohort of three require dose reduction in the absence of a DLT, then three additional subjects are assigned to the expanded cohort to receive the same dose level of study drug (meaning the protocol-designated dose level for that cohort and not the dose-reduced level). In general, designs for MTD defined like this can be described using the DLT/dose reduction score (8). After each cohort is assigned and toxicity outcome is observed, the DLT/dose reduction score is computed SCORE = Pr[DLT] + 0.5* Pr[dose reduction] · If SCORE = 0 or 0.17 (SCORE ≤ 0.2 ), increase the dose and proceed to enrolling subjects in next higher dose cohort. · If SCORE = 0.33 (0.2 < SCORE < 0.35), add three more subjects to an expanded cohort at the same dose level. · Then if SCORE ≤ 0.25, increase the dose and proceed to enrolling subjects in next higher dose cohort. · If SCORE > 0.25, STOP. · If SCORE ≥ 0.35, STOP. The estimated MTD is the highest dose level with observed SCORE less than or equal to 0.25. A more
complex trial with many toxicity types and grades is described in the work of Bekele and Thall (9).
TRIALS WITH LONG FOLLOW-UP As we have seen earlier a lot of dose finding problems in oncology can be addressed by designs similar to the 3 + 3 design. However, there are challenges that require different approaches. One such challenge is the dose-finding trial with long follow-up. Since many oncology trials are comprised of multiple cycles of study drug administered over extended duration, there is often a need to follow subjects for DLTs for a long period of time. For example, radiation therapy trials often require long follow-up since long-term toxicities are likely to occur. Consider this slightly modified example of an actual trial of a new cytotoxic drug used in conjunction with other drugs for prophylaxis of graft-verses-host disease in patients undergoing stem cell transplantation. The goal is to identify the MTD defined as a dose with DLT rate of 0.25. The DLT is defined as any irreversible grade 3 toxicity or any grade 4 nonhematologic toxicity related to study drug observed during the first 42 days following stem cell infusion. Late treatmentrelated toxicities, such as veno-occlusive disease of the liver, pulmonary fibrosis, or neurotoxicity, can also count as DLTs if they occur within 60 months after study drug initiation. A protocol based upon the 3 + 3 design, with each cohort followed for 60 months, will result in a trial of very long duration. One possible approach is to use the 3 + 3 design and to base decisions regarding dose escalation on the 42-day follow-up period. Then, one asks how to take into account late toxicities that occur beyond 42 days but within 60 months of the start of study drug administration. One possible design solution is described below. The idea is to have a blend of two designs: first, the design similar to the 3 + 3 scheme used in the beginning of the trial, which we will refer to as a start-up rule, and secondly, an up-and-down type design that is invoked after the first DLT is observed. In the start-up rule clinical trial subjects are assigned to cohorts of three. During this start-up phase subjects are monitored for the potential emergence of AEs including an AE that might develop into a DLT. If no DLTs are observed to occur in a cohort during the first 42 days after study drug initiation, then the next cohort of three subjects is assigned to the next higher dose level of study drug. If the study investigator chooses to enroll any additional subjects prior to the completion of the most recent cohort’s first 42 days
63
8 DESIGN OF PHASE I TRIALS
of study drug exposure at the current highest dose, then any additional subject could be given the lower tried-and-tested dose level for which there has already been the protocol-required 42-day minimum dose exposure experience of 3 patients without occurrence of a DLT. All study subjects continue to be monitored for AEs during the entire span of the 60-month study period. The start-up would continue until the first DLT is seen in any of the subjects, at which point the dosing assignments follow the upand-down design starting from the dose associated with the newly documented DLT. In the up-and-down design, the goal is to assign each new subject to the dose that is believed to be the MTD based on the data available so far. Let q be the estimated DLT rate at the time when a patient is ready for assignment. We will illustrate later how to compute this estimate. Assume that the most recent subject is assigned to dose dj. Then the dose for the next subject is determined based on the following algorithm: · if q ≤ Γ − Δ, the next patient is assigned to dose dj+1 · if q ≥ Γ+ Δ, the next patient is assigned to dose dj-1 · if Γ − Δ < q < Γ + Δ, the next patient is assigned to dose dj Special provisions are in place in case the lowest dose appears to have high rate of DLT. To make the decision rules in the up-and-down design similar to the 3 + 3 design, Γ = 0.26 and design parameter Δ = 0.09 are used. For example, if a DLT is observed at a dose, the dose can be escalated according to the up-and-down design if six patients are assigned to this dose with only one DLT observed. More details about the design and recommendations regarding the choice of the design parameter can be found in the review by Ivanova, Flournoy, and Chung (10). For example, for Γ between 0.1 and 0.25, Δ = 0.09 is recommended; for Γ between 0.30 and 0.35, Δ = 0.10 is recommended; for Γ = 0.40, Δ = 0.12 is recommended; and for Γ between 0.45 and 0.5, Δ = 0.13 is recommended. The DLT rate at each dose is estimated using all DLT information available at that dose so far. Let T be the follow up time, T = 60 months. Assume that a DLT time is uniformly distributed in (0,T), that is, the DLT is as equally likely to occur at any time during follow-up. At the time of analysis for subject i, define wi = min{(follow-up time)/T, 1}, the proportion of time a subject has been followed so far; and DLT indicator yi = 0, if no DLT so far, or yi = 1, if DLT. The weight wi is set to 1, if a subject had a DLT. The estimate of DLT rate at the dose q can be computed using the following iterative algorithm:
Step 0. θ = ∑ i =1 yi / ∑ i =1 wi n
Step 1. ai =
n
1 , i = 1,..., n 1 − wiθ
Step 2. θ new = ∑ i =1 ai yi / ∑ i =1 ai wi n
n
Then iterate between Steps 1 and 2. Mathematically speaking θ maximized the likelihood n
L(θ ) = ∏ ( wiθ ) i (1 − wiθ ) i =1
y
1− yi
,
and the algorithm described above finds θ that maximizes L(q ). For example, if the data at a dose are Subject 1: DLT (y1 = 1, w1 = 1); Subject 2: no DLT, follow-up time = T (y2 = 0, w2 = 1); Subject 3: no DLT so far, follow-up time = T/2 (y3 = 0, w3 = 0.5); The estimate q we use iterative algorithm described above Step 0. θ = (1 + 0 + 0) / (1 + 1 + .5) = 0.40 1 1 = 1.67, a2 = = 1.67, Step 1. a1 = 1 − 1 × 0.4 1 − 1 × 0.4 1 a3 = = 1.25 1 − 0.5 × 0.4 Step 2. θ new = (1.67 × 1 + 1.67 × 0 + 1.25 × 0) / (1.67 × 1 + 1.67 × 1 + 1.25 × .5) = 0.42 1 1 = 1.72, a2 = = 1.72, 1 − 1 × 0.42 1 − 1 × 0.42 1 = 1.27 a3 = 1 − 0.5 × 0.42 Step 2. θ new = (1.72 × 1 + 1.72 × 0 + 1.27 × 0) /
Step 1. a1 =
(1.72 × 1 + 1.72 × 1 + 1.27 × .5) = 0.42 The value of θ is the same in the last two iterations, indicating that the algorithm converged and the estimated DLT rate is q = 0.42. At the end of the trial, the dose with the estimated DLT rate closest to 0.25 is declared the MTD. The design described in this section allows for the trial to occur relatively quickly, provides flexibility in assignment of subjects, and will likely result in assigning many subjects to the MTD. The total sample size for the trial has to be specified in advance, and is usually 20 to 30 patients. It is recommended to run simulations using plausible dose-toxicity models to see if the
64
ONCOLOGY CLINICAL TRIALS
proposed total sample size yields high likelihood of selecting the right dose at the MTD. DISCUSSION Numerous designs have been developed for phase I trials in oncology. We only reviewed and presented designs that are easy to implement without a computer. Among others, the design that should be mentioned is the Continual Reassessment Method (CRM). The CRM (11) was developed with the goal of bringing experimentation close to the MTD as soon as possible and assigning as many subjects as possible to the MTD. The CRM uses a working model as a tool allowing for the utilization of all information available in the trial to derive the best dose assignment that is closest to the MTD for the next subject. We focused our attention on dose-finding trials with a single anticancer agent. Almost every oncology trial nowadays involves a combination of anticancer therapies. Methods have been developed that allow one to fully evaluate the space created by multiple doses, instead of changing one dose at a time. These include methods described by Thall et al., and by Ivanova and Wang (12, 13). Similar design problems arise in trials in which subjects can be stratified before the trial into two (or more) subpopulations according to their susceptibility to toxicity. For example, patients can be divided into two subpopulations using genetic information available before the trial. One has to run two different dose-finding trials, one for each subpopulation; however, the study can be made more efficient if the two trials exchange information (14, 15).
References 1. Rosenberger WF, Haines LM. Competing designs for phase I clinical trials: a review. Statistics in Medicine. 2002;21: 2757–2770. 2. Storer BE. Design and analysis of phase I clinical trials. Biometrics. 1989;45:925–937. 3. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon RM. A comparison of two phase I trial designs. Statistics in Medicine. 1994;13:1799–1806. 4. Kang SH, Ahn C. The expected toxicity rate at the maximum tolerated dose in the standard phase I cancer clinical trial design. Drug Information Journal 2001;35(4):1189–1200. 5. Kang SH, Ahn C. An investigation of the traditional algorithmbased designs for phase I cancer clinical trials. Drug Information Journal. 2002;36:865–873. 6. Lin Y, Shih WJ. Statistical properties of the traditional algorithmbased designs for phase I cancer clinical trials. Biostatistics. 2001;2:203–215. 7. Reiner E, Paoletti X, O’Quigley J. Operating characteristics of the standard phase I clinical trial design. Computational Statistics and Data Analysis. 1999;30:303–315. 8. Ivanova A. Escalation, up-and-down and A+B designs for dosefinding trials. Statistics in Medicine. 2006;25:3668–3678. 9. Bekele BN, Thall FT. Dose-finding based on multiple toxicities in a soft tissue sarcoma trial. Journal of American Statistical Association. 2004;99:26–35. 10. Ivanova A, Flournoy N, Chung Y. Cumulative cohort design for dose-finding. Journal of Statistical Planning and Inference. 2007;137:2316–2317. 11. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990;46:33–48. 12. Thall P, Millikan R, Mueller P, Lee, S-J. Dose finding with two agents in phase I oncology trials. Biometrics. 2003;59: 487–496. 13. Ivanova A, Wang KA Nonparametric approach to the design and analysis of two-dimensional dose-finding trials. Statistics in Medicine. 2004;23:1861–1870. 14. O’Quigley J, Paoletti, X. Continual reassessment method for ordered groups. Biometrics. 2003;59:430–440. 15. Ivanova A, Wang K. Bivariate isotonic design for dose-finding with ordered groups. Statistics in Medicine. 2006;25:2018–2026.
Design of Phase II Trials
9 Hongkun Wang Mark R. Conaway Gina R. Petroni
The most common primary goal of a phase II trial is to assess the therapeutic efficacy of a new agent or treatment regimen, and decide if the activity of the new agent or regimen shows enough promise to warrant further investigation. Most phase II trials use onesample designs, in which all the patients accrued are treated with the new agent or treatment regimen. Methods of single-stage, multistage, and sequential designs and analyses have been proposed and are used in practice. In this chapter, we will describe some of the most common designs and provide examples of the use of each approach in practice. We will discuss designs with multiple arms and multiple endpoints and give some concluding remarks toward the end of the chapter. In order to illustrate the different design choices, we will apply all the methods we discuss to a common example based upon a phase II study by Rietschel et al. (1) of extended-dose temozolomide (TMZ) in patients with melanoma. Patients with stage IV or unresectable stage III melanoma were enrolled into the study and two cohorts of patients, M1c disease or not, were studied. Detailed information regarding patient entry criteria, dose and administration of TMZ, study design, and study results can be found in the original paper. The same study design and target rates were used within each cohort, and will serve as the basis to illustrate many of the designs discussed. In some
examples cohort specific data will be identified to display sequential design strategies.
SINGLE-STAGE DESIGNS Hypothesis Testing Framework In a single-stage phase II trial, n patients are accrued and treated. Based on the anticipated number of responses, a statistical test is formulated to decide whether the new therapy should be tested further for efficacy. It is common to define the response variable as a dichotomous outcome where patients are classified as having responded or not responded depending upon prespecified criteria. The population proportion of patients who respond to the new therapy is denoted by p. The population proportion of patients who respond to the standard therapy is denoted by p0, which is assumed known. With this notation, the hypothesis test can be written as H0: p ≤ p0 versus Ha: p > p0. The null hypothesis, H0, will be rejected if the number of observed responses is greater than a specified threshold; otherwise we fail to reject the null hypothesis. If X is the random variable that counts the number of responses in a sample of n patients, then the distribution follows a binomial (p) distribution with probability density function P(X = k) = kn pk (1 − p)n −k where n = n ! . If we let r k!( n − k)!
()
() k
65
66
ONCOLOGY CLINICAL TRIALS
denote the number of observed responses, and assume pa is the response rate in the alternative hypothesis that is “important not to miss” then the sample size n and critical value c should satisfy the following equations:
α = P(reject H0 | p = p0 ) = P(X ≥ c | p0 ) n
=∑ k=c
( )p (1 − p ) n k
k 0
n −k
(Eqn. 9-1),
0
and
Confidence Interval or Precision Framework 1 − β = P(reject H0 | pa ) = P(X ≥ c | pa ) n
=∑ k=c
( )p (1 − p ) n k
k a
n −k
(Eqn. 9-2),
a
where a and b are the maximum tolerable levels for the probability of type I (false positive) and type II (false negative) errors. The type I error a represents the probability that the study will incorrectly identify a new therapy as “sufficiently promising” when the new therapy is no more effective than the standard. The type II error β represents the probability that the study will incorrectly identify a truly promising therapy as “not sufficiently promising.” In the TMZ example, it was assumed that a null response rate of 10% would be considered “not sufficiently promising” while an alternative response rate similar to that of extended-dose TMZ with antiangiogenic agents of 30% would be considered “sufficiently promising.” Type I and type II error rates were set at 0.10. Using the above formulae (9-1) and (9-2) with p0 = 0.1 and pa = 0.3, a sample of size 25 will yield a = 0.098 and b = 0.090 (power = 0.910) with a critical value c = 5 responses. Thus, at the end of the study we would reject the null hypothesis in favor of the alternative if 5 or more responses are observed among the 25 patients. For large n and a moderate p, the test can be based on, pˆ = r/n, which is asymptotically normally distributed. A test statistic for testing H0 is given by z=
with z1−a and z1−a the (1 − a) and (1 − b) percentiles of the standard normal distribution. The sample size from the large-sample approximation can be quite different from the sample size calculated from binomial formulae (9-1 and 9-2). Using the TMZ design parameters with a type I error rate of 0.1 and power of 0.9, the required sample size would be 30 using equation (9-3).
pˆ − p0 p0 (1 − p0 ) n
=
r − np0 np0 (1 − p0 )
,
(Eqn. 9-3),
which follows asymptotically a normal distribution. H0 will be rejected if z > z1−α, where z1−α is the (1-α) percentile of the standard normal distribution. The p-value is given by P(Z ≥ z | p0 ) = 1 − Φ(z) , where Φ is the standard normal cumulative distribution function. The sample size n required to have a significance level a and power 1 − b can be calculated approximately from n = { z1−α [ p0 (1 − p0 )]1/ 2 + z1− β [ pa (1 − pa )]1/ 2 }2 / (p0 − pa )2
In a single-stage phase II trial, sample size can be chosen to satisfy a specified level of precision where a useful measure of precision is the confidence interval around the estimated parameter. Here the goal is estimation and not hypothesis testing. Consider the 95% confidence interval for p, the proportion of interest. At the end of the trial p is estimated by pˆ ± 1.96 pˆ (1 − pˆ ) / n where pˆ = (# responses / n) . For sample size estimation one must decide upon how narrow or precise a 95% confidence interval is desired at the final analysis (for example ±10%). Narrowness is defined by the half-width of the 95% confidence interval which is equal to 1.96 p (1 − p) /n . With an initial guess for p, the equation can be solved for n, or one can set p = 0.5 the maximal amount for p(1 − p). This method uses the normal approximation to the binomial distribution. For extreme values of p (say, p < 0.2 or p > 0.8) estimation should be based upon exact binomial confidence limits. In the TMZ example, we could assume that the response rate will be similar to that of extended-dose TMZ with antiangiogenic agents of 30%. Setting p = 0.30 then we would need n = 36 for a half-width of 15% or n = 81 for a half-width of 10%. MULTISTAGE DESIGNS In a multistage setting, patients are accrued into the study in several stages. Testing is performed at each stage after a predefined target accrual has been completed. At each stage a decision of whether to terminate the trial early or to continue to the next stage is made. In the most commonly used multistage designs, accrual was to stop early only when the preliminary data supported inactivity of the new agent. A variety of early stopping rules have been proposed. Gehan’s Two-Stage Designs The two-stage design for phase II setting was first proposed by Gehan (2) in 1961. The goal was to identify the minimum number of patients to observe with a
67
9 DESIGN OF PHASE II TRIALS
sequence of no responses before concluding that the new drug was not worthy of further study. It is important to note that Gehan proposed this design when the response rate for the new agent under investigation was expected to be low, for instance, no more than 20%. In the first stage, n1 patients are enrolled and X1 responses are observed. If X1 = 0, the trial is closed to accrual. If one or more patients respond (X1 > 0), then accrual to the second stage begins. In the second stage, n2 additional patients are accrued and X2 responses are observed. The number of additional patients is chosen to estimate the response rate within a specified level of precision. With this design the true response rate, p, can be estimated by X1 + X2 , the total number of responders n1 + n2 in both stages, divided by the number of patients accrued at each stage. In Gehan’s design, the first stage sample size n1 is chosen to give a small probability of early stopping (say, 5%) at a fixed target response rate pa, considered to be the minimum response rate of interest, in a sequence of no responses. The first stage sample size n1 is found by n solving 0.05 = P(Stop early | pa ) = P(X1 = 0) = (1 − pa ) 1 . The second stage sample size n2 is chosen to give sufficient precision for estimating the response rate p after all patients are observed. The precision is based on choosing a desired value for the standard p(1 − p) error, SE = . For example, if one wants the n1 + n2 precision of effectiveness to be no greater than s, then n2 can be chosen to solve s=
pˆ 1 (1 − pˆ 1) , where pˆ 1 = X1 n1 . n1 + n2
(Eqn. 9-4).
X1 + X2 , n1 + n2 with an approximate standard error given by At the end of the trial, p is estimated by pˆ =
pˆ (1 − pˆ ) . To be conservative, Gehan suggested n1 + n2 using the upper 75% confidence limit for the true percentage of treatment success in the first sample in equation (9-4). In Gehan’s design, the number of patients accrued in the second stage depends on the number of responses in the first stage and the desired standard error. In practice, Gehan’s design is often used with 14 patients in the first stage and 11 patients in the second stage. This provides for estimation with approximately a 10% standard error. Higher precision provides better estimates but requires much larger sample sizes. Gehan’s design is sufficient for the very specific context in which it was derived, for instance, low small expected SE(pˆ ) =
response rate, but has practical and statistical disadvantages in other situations, and is not often used today. In the TMZ design, the response rate of interest is defined as at least 0.3 (pa ≥ 0.3). Following Gehan’s approach, in the first stage the P(Stop early | pa ) ≤ 0.1 gives n1 = 7, the probability of observing all 7 consecutive failures is (1−0.3)7 = 0.082. The chance of at least one success would be 1 – 0.082 = 0.918 or 91.8%. Thus if 0/7 responses are observed, we would reject the new treatment and would be approximately 92% confident that the response rate with the new treatment is less than 30%. If at least one response is observed, then the trial would continue to accrue to the second stage. Assuming a precision of 10% or 5% with one observed success in the first stage, the sample size for the 2nd stage is approximately 16 or 83 patients, respectively (equation [9-4]). Fleming’s K-Stage Designs The design proposed by Gehan (2) was designed to allow for early termination only if preliminary data supported that the new therapy was likely to be ineffective. Schultz et al. (3) defined a general multiple testing procedure as alternatives to the single-stage design. Fleming (4) continued the work of Schultz et al. and proposed the K-stage design to determine appropriate acceptance and rejection regions at each stage. In addition to allowing for early stopping if a treatment appears ineffective, Fleming’s design allows for early stopping if the treatment appears overwhelmingly effective, while preserving (approximately) size and power characteristics of a single-stage design. The study is done in K stages, with the kth stage sample size equal to nk, k = 1, . . . ,K. The total sample size over all stages, n = n1 + n2 + . . . + nk, is guided by the fixed sample design using the exact binomial calculations (9-1 and 9-2) and the large sample approximation. Fleming derived upper and lower boundaries that would allow investigators to make decisions based on the accumulated number of responses observed by the end of each stage. If the number of responses at the end of the kth stage exceeds the upper boundary for the kth stage, the study can be terminated with the conclusion that the treatment shows promise. If the accumulated number of responses is less than the lower boundary, the study can be terminated with the conclusion that the treatment is not effective. If neither boundary is crossed, the study continues by enrolling the next set of patients. The stage boundaries are chosen to preserve the type I and type II error rates of the fixed sample size design. Even though Fleming derived the boundaries based on normal approximations, he
68
ONCOLOGY CLINICAL TRIALS
evaluated the properties via simulations for small sample cases and concluded that the approximations give close answers. In Fleming’s design, decisions in favor of or against the new agent occur only when initial results are extreme. This permits the final analysis to be unaffected by interim monitoring if early termination does not occur. In practice, investigators rarely choose to terminate a phase II trial early if the data supports that the treatment is effective. Instead, they want to continue to get supportive data and to plan the possible followup phase III trial which will take time to develop. Referring to the TMZ design parameters, a Fleming two-stage design with 25 total patients and 15 in the first stage, would yield an exact α value of 0.093 and a β value of 0.102. In the first stage the trial would stop for futility if one or fewer responses is observed; or the trial would stop in favor of the new agent if at least five responses are observed; otherwise the trial would continue to a second stage. At final analysis (i.e., the end of the second stage) if only five responses are observed among all the patients, we would fail to reject the null hypothesis and conclude that the new agent does not warrant further study. If six or more responses are observed we would reject the null hypothesis and conclude that the data supports that the new agent is worthy of further study. Note in this example that the final critical value to reject the null hypothesis is increased by one compared to the single-stage decision rule. Simon’s Optimal Two-Stage Designs Simon (5) proposed an optimal two-stage design, where optimality is defined in terms of minimizing the expected sample size when the true response probability is p0. The trial is terminated early only for ineffective therapies. Simon argued that when the new agent has substantial activity, it is important for planning the larger comparative trial to estimate the proportion, extent, and durability of response for as many patients as possible in the phase II trial. The hypothesis to be tested is the same as in other phase II trials, H0: p ≤ p0 versus Ha: p > p0. The Simon design consists of choosing stage sample sizes n1 and n2 along with decision rules c1 and c2, where c1 and c2 are critical values to guide decisions at each stage. At the first stage, patients are accrued and X1 responses are observed. If too few responses are observed in the first stage X1 ≤ c1, the trial is stopped and the treatment is declared “not sufficiently promising” to warrant further study in a comparative trial (fail to reject H0). If the number of patients who respond exceeds the prespecified boundary X1 > c1, an additional n2 patients will be accrued into the second stage of the study. Among the n2 patients accrued in the
second stage, X2 responses are observed. If the total number of responders in both stages exceeds the prespecified boundary X1 + X2 > c2, the null hypothesis is rejected and the treatment is deemed sufficiently promising. If there are too few observed responders X1 + X2 ≤ c2, then the null hypothesis is not rejected. The boundaries c1and c2 are chosen to meet the following type I and type II error specifications: (size) P[ X1 > c1 , X1 + X2 > c2 | po ] ≤ α , and (power) P[ X1 > c1 , X1 + X2 > c2 | pa ] ≥ 1 − β . There are many choices of (c1, c2) that meet the type I and type II error specifications. Among all sets of (c1, c2) that meet the requirements, Simon proposed to choose the boundaries that minimize the expected sample size under H0. Intuitively, this is a sensible criterion in that, among all designs that meet the type I and type II error specifications, the optimal design is the one that on average treats the fewest number of patients with a therapy that is no better than the current standard therapy. Simon tabulated the optimal designs for various choices of design parameters N = n1+ n2, p0, pa, a and b. In addition, Simon tabulated designs for an alternate optimality criterion, which he called the “minimax design.” This design has the smallest maximum sample size among all designs that meets the type I and type II error requirements. The stage sample size and the boundaries for the two optimality criteria can be very different. Simon pointed out that in cases when the difference in expected sample sizes is small and the patient accrual rate is low, the minimax design may be more attractive than that with the minimum expected sample size under H0. The TMZ example was designed using the Simon minimax design. The study called for accrual of 16 patients in the first stage and 9 additional patients in the second stage for a maximum accrual of 25 patients. At the first stage, the trial would stop for futility if ≤1 response was observed, otherwise the trial would go on to the second stage. At the final analysis, if ≥5 responses were observed among all 25 patients, then they would reject the null hypothesis in favor of the alternative. In the TMZ study, the Simon’s minimax design resulted in the same total sample size and final decision rule as the one determined from the single-stage design; that will not always be the case. If Simon’s optimal design had been chosen, then accrual to the first stage would have been set at 12 patients, and 23 additional patients would have been accrued to the second stage if the first stage stopping criterion had not been met. With the optimal design the
69
9 DESIGN OF PHASE II TRIALS
trial would stop at the first stage if ≤1 response is observed, otherwise it would continue to the second stage. At the final analysis, if ≥6 responses were observed among all 35 patients, then they would reject the null hypothesis in favor of the alternative. In this setting the stage and total sample sizes and critical values differed considerably among the two designs.
FULLY SEQUENTIAL DESIGNS The multistage designs can be considered as groupsequential designs where patients are entered in cohorts. A decision of whether to terminate the trial early or to continue to the next stage is made after each cohort is accrued and the test statistic is calculated. In a fully sequential design, an analysis is performed after the outcome of each new patient is observed using a test statistic based on the accumulated data to that point. The test statistic is then compared with an upper and lower boundary. If the test statistic falls in the region between the boundaries, an additional patient is sampled and his/her response to the new treatment is observed. If, however, the test statistic falls above the upper (below the bottom) boundary, then accrual is stopped and the null (alternative) hypothesis is rejected. A fully sequential design requires continuous monitoring of the study results, patient by patient, and thus it is often difficult to implement. Herson (6) proposed a fully sequential design for phase II trials. In this setting the null and alternative hypotheses were defined as H0: p ≥ pa versus Ha: p < pa, where pa is a “minimum acceptable response rate” chosen by the investigators. The technical details of Herson’s method, which use Bayesian predictive distributions, are beyond the scope of this chapter, but the idea is intuitive. A fixed sample size trial would reject Herson’s H0 if at the end of the trial there were too few responders. At any point in the trial, given information on the number of patients treated so far and the number of observed responders, it is possible to predict the probability that the null hypothesis will be rejected after n patients have been observed. If this predicted probability is too low or too high, it is reasonable to stop the trial. Thall et al. (7) also proposed a fully sequential design using Bayesian methods. The design requires specification of the response rate for the standard therapy p0, the experimental therapy p, the prior for p0, the prior for p, a targeted improvement for new therapy d0, and bounds Nmin and Nmax on the allowable sample size. The priors for p0 and p are chosen to be independent Beta distributions. Thall et al. give recommendations as to choosing the parameters of the prior distributions.
After the outcome from each patient is observed, Thall et al. calculate the posterior probability that the new therapy will be shown to be effective. The trial continues until the maximum sample size is reached, or the experimental therapy is shown with high probability to be effective. Thall et al. also calculate the posterior probability that the new therapy will meet the targeted improvement in response rate, and will terminate the trial early if this probability is too low. One of the advantages of Thall et al.’s approach is that it allows for the uncertainty in response rate with the standard therapy to be incorporated into the design. A disadvantage is that it requires monitoring the data continuously and needs numerous analyses. It can often arrive at “not convincing either way” decisions. For the TMZ example responses were observed in the 12th, 15th, and 23rd patients in the first cohort and in the 2nd, 15th, and 21st patients in the second cohort. Using a targeted improvement of 0.20, and the priors recommended in Thall and Simon, the trial would be stopped in the first cohort at the maximum of 4 or the preselected minimum sample size Nmin. In the second cohort, the trial would be stopped at the maximum of 9 or the preselected minimum sample size. In each case, the trial would be stopped because of a low posterior probability that the therapy would meet the targeted improvement in response rate. Lee et al. (8) proposed a predictive probability design for phase II cancer clinical trials based on Bayesian predictive probability framework. The predictive probability is defined as the probability of observing a positive result by the end of the trial based on the cumulative information in the current stage. A higher (lower) predictive probability means that the new treatment is (is not) likely to be efficacious by the end of the study, given the current data. Given p0, p, the prior distribution of response rate, and the cohort size for interim monitoring, they search for the maximum sample size Nmax, threshold values θ L , θT , θU (usually choose 1.0) for the predictive probability, to yield a design satisfying the type I and type II error rates constraints simultaneously. The smallest Nmax that controls both the type I and type II error rates at the nominal level is the one to choose. Similar to the Thall et al. approach, the predictive probability design is computationally intensive. OTHER PHASE II DESIGNS Bivariate Designs In some circumstances it is necessary to consider the use of multiple endpoints for determining sample size and stopping guidelines instead of a single endpoint.
70
ONCOLOGY CLINICAL TRIALS
For example, in studies of high dose chemotherapy it may be hypothesized that a more intensive therapy will induce more responders; however, more intensive therapy may also cause more unacceptable adverse events. An increase in the adverse event rate may be acceptable as long as the toxicities are not too severe or are reversible as long as the higher dose results in an increased response rate. Similar in structure to Simon’s two-stage designs, Bryant et al. (9) proposed methods that integrate toxicity monitoring into phase II designs. The trials are terminated at the initial stage if either the number of observed responses is inadequate or the number of observed toxicities is excessive. If there are both a sufficient number of responses and an acceptable toxicity rate in the second stage, then the new agent is considered to be worthy of further study. The design parameters are determined by minimizing the expected accrual for treatments with unacceptable rates of response or toxicity. Conaway et al. (10) proposed similar designs. A new therapy may be acceptable if it can achieve a substantially greater response rate but with acceptable toxicity, or it has a slightly lower response rate but with substantially less toxicity. Conaway et al. (11) proposed two-stage designs to allow for early termination of the study if the new therapy is not sufficiently promising, and to allow for trade-offs between improvements in activity and increases in toxicity. Thall et al. (12, 13) took the Bayesian approach that allows for monitoring each endpoint on a patient by patient basis in the trial. They define for each endpoint in the trial a monitoring boundary based on prespecified targets for an improvement in efficacy and an unacceptable increase in the rate of adverse events. Thall et al. (14, 15) proposed another approach for multi-endpoint designs by quantifying a two-dimensional treatment effect parameter for efficacy and safety. Most bivariate designs require both endpoints to be binary, which may not always be the situation (i.e., when it is important to distinguish between grades of toxicities). Also, the specified trade-off between the response and toxicity may be viewed as too subjective. Phase II Design Using Time-to-Event Endpoints All the designs discussed so far focused on a binary(s) endpoint, tumor response, as a measure of efficacy. However, in situations when tumor response is difficult or not possible to evaluate, or when the agents studied are not expected to reduce tumor response (i.e., cytostatic agents), response rate may not be an appropriate endpoint to evaluate the efficacy of the new
agent. Endpoints that incorporate information of a time-to-event outcome such as disease free survival (DFS), progression free survival (PFS), or overall survival (OS) may be reasonable choices in this situation. The use of PFS estimates at a fixed time point, instead of the response rate as the primary endpoint, was proposed by Van Glabbeke et al (16). Mick et al. (17) proposed a methodology for evaluating time-toprogression as the primary endpoint in a single-stage design with each patient’s previous response time serving as their own control. If it is reasonable to assume that the time-to-event outcome, say survival, follows a known distribution (such as the exponential) then under specified assumptions for accrual and minimum follow-up time, sample size can be estimated from a one-sample test for median survival rates (Lawless [18]). Owzar et al. (19) recommended dichotomizing the time-to-event outcome at a clinically relevant landmark over the parametric and nonparametric methods for the design of phase II cancer studies. Randomized Designs The effects from single-arm designs are greatly influenced by entry criteria, the definition of response, patient selection bias, and so on. Randomized designs are considered when the aim of a phase II trial is to evaluate two or more regimens concurrently, adequate historical control data are not available, or to select which of the several new agents should be studied further. The standard single-arm paradigm may be inefficient in these settings. The randomized selection design (Simon et al. [20, 21]; Liu et al. [22]; Scher et al. [23]) allows multiple single-arm trials to be conducted at the same time with the same entry criteria. Typically patients are randomized to two or more experimental arms without a control arm. A test for activity using standard criteria for single-arm studies is conducted for each arm, and there is a selection rule for selecting the best arm(s) for further investigation. The advantages of this type of design include decreasing the patient selection bias, and the ability to ensure uniform evaluation criteria in each arm. The weakness is that the probability of selecting the better arm decreases if the difference between the arms decreases and if the number of arms increases (Gray et al. [24]). In a randomized control design, patients are randomly assigned to an experimental or control arm and the results obtained from the two arms are compared. Comparison to a control arm is useful when there is little prior information of the expected response rate in a population, or when endpoints such as time to progression (TTP) and PFS, which are influenced by
9 DESIGN OF PHASE II TRIALS
patient selection, are used. This design format is typical in the phase III setting, but has been proposed in phase II setting as well (Herson et al. [25]; Thall et al. [26]; Korn et al. [27]). Compared with a standard single-arm phase II study, a potential weakness of this type of design often includes the need for a second larger study before moving on to a phase III study. The randomized discontinuation design (Stadler et al. [28]; Rosner et al. [29]; Ratain et al. [30]) was proposed to select a more homogeneous group of patients thus to provide smaller bias. All patients are initially treated with the experimental drug. Patients free of progression at some defined time point are randomized between continuing the experimental drug and receiving a placebo. The effectiveness of the design depends on the accuracy of identifying true treatment responders who are clearly benefiting on the basis of disease stabilization versus rapid progression. It may overestimate the treatment benefit and may require a larger sample size compared with other phase II designs. As pointed out by Freidlin et al. (31), with careful planning, it can be useful in some settings in the early development of targeted agents where a reliable assay to select patients expressing the target is not available.
DISCUSSION There are many design methods available for phase II clinical trials. It is important to note that the study objectives should define the choice of design to use and not the reverse. If the main objective is to assess clinical response rates, then one can choose from the classic design methods. Whereas, for a more complicated study, it is recommended that a novel design method be used. This requires more interaction between the investigators and statisticians, and results in higher quality research. The methods discussed here are only some of the methods available in the literature, but we hope it will serve as a good starting point when considering the design of phase II trials. Appropriate trial design remains an expanding field of research.
References 1. Rietschel P, Wolchok JD, Krown S, et al. Phase II study of extended-dose Temozolomide in patients with melanoma. J Clin Oncol. 2008;26:2299–2304. 2. Gehan EA. The determination of number of patients in a follow-up trial of a new chemotherapeutic agent. J Chronic Dis. 1961;13:346–353. 3. Schultz JR, Nichol FR, Elfring GL, Weed SD. Multiple stage procedures for drug screening. Biometrics. 1973;29:293–300. 4. Fleming TR. One-sample multiple testing procedures for phase II clinical trials. Biometrics. 1982;38:143–151.
71
5. Simon R. Optimal two-stage designs for phase II clinical trials. Control Clin Trials. 1989;10:1–10. 6. Herson J. Predictive probability early termination plans for phase II clinical trials. Biometrics. 1979;35:775–783. 7. Thall PF, Simon R. Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994;50:337–349. 8. Lee JJ, Liu DD. A predictive probability design for phase II cancer clinical trials. Clin Trials. 2008;5:93–106. 9. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics. 1995;51:1372–1383. 10. Conaway MR, Petroni GR. Bivariate sequential designs for phase II trials. Biometrics. 1995;51:656–664. 11. Conaway MR, Petroni GR. Designs for phase II trials allowing for trade-off between response and toxicity. Biometrics. 1996;52:1375–1386. 12. Thall PF, Simon RM, Estey EH. Bayesian sequential monitoring designs for single-arm clinical trials with multiple outcomes. Stat Med. 1995;14:357–379. 13. Thall PF, Simon RM, Estey EH. New statistical strategy for monitoring safety and efficacy in single-arm clinical trials. J Clin Oncol. 1996;14:296–303. 14. Thall PF, Cheng SC. Treatment comparisons based on twodimensional safety and efficacy alternatives in oncology trials. Biometrics. 1999;55:746–753. 15. Thall PF, Cheng SC. Optimal two-stage designs for clinical trials based on safety and efficacy. Stat Med. 2001;20: 1023–1032. 16. Van Glabbeke M, Verweij J, Judson I, et al. Progression-free rate as the principal endpoint for phase II trials in soft-tissue sarcomas. Eur J Cancer. 2002;38:543–549. 17. Mick R, Crowley JJ, Carroll RJ. Phase II clinical trial design for noncytotoxic anticancer agents for which time to disease progression is the primary endpoint. Control Clin Trials. 2000;21:343–359. 18. Lawless J. Statistical Models and Methods for Lifetime Data. Chap. 3. Wiley & Sons, 1982. 19. Owzar K, Jung S. Designing phase II studies in cancer with timeto-event endpoints. Clin Trials. 2008;5: 209–221. 20. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treat Rep. 1985;69:1375–1381. 21. Simon RM, Stienberg SM, Hamilton M, et al. Clinical trial designs for the early clinical development of therapeutic cancer vaccines. J Clin Oncol. 2001;19:1848–1854. 22. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival endpoints. Biometrics. 1993;49: 391–398. 23. Scher HI, Heller G. Picking the winners in a sea of plenty. Clin Cancer Res. 2002;8:400–404. 24. Gray R, Manola J, Saxman S, et al. Phase II clinical trials: methods in translational research from the genitourinary committee at the eastern cooperative oncology group. Clin Cancer Res. 2006;12:1966–1969. 25. Herson J, Carter SK. Calibrated phase II clinical trials in oncology. Stat Med. 1986;5:441–447. 26. Thall PF, Simon R. Incorporating historical control data in planning phase II clinical trials. Stat Med. 1990;9:215–228. 27. Korn EL, Arbuck SG, Pluda JM, et al. Clinical trial designs for cytostatic agents: are new approaches needed? J Clin Oncol. 2001;19:265–272. 28. Stadler WM, Ratain MJ. Development of target-based antineoplastic agents. Investigational New Drug. 2000;18: 7–16. 29. Rosner GL, Stadler W, Ratain MJ. Randomized discontinuation design: application to cytostatic antineoplastic agents. J Clin Oncol. 2002;20:4478–4484. 30. Ratain MJ, Eisen T, Stadler WM, et al. Phase II placebocontrolled randomized discontinuation trial of Sorafenib in patients with metastatic renal cell carcinoma. J Clin Oncol. 2006;24: 2505–2511. 31. Freidlin B, Simon R. Evaluation of randomized discontinuation design. J Clin Oncol. 2005;23:5094–5098.
This page intentionally left blank
Randomization
10 Susan Groshen
In the perfect and ideal clinical trial that is designed to compare the effects of two or more interventions, all study participants would be alike and would be the same as all other patients in the target population to which the conclusions will be applied. In this perfect and ideal trial, each intervention or treatment would be delivered to the assigned patients in exactly the same way, the effect would be the same on each patient, and this effect would be measured correctly and exactly. Thus, the only difference from patient to patient would be the real effect of the intervention. In contrast in a typical oncology trial, patients are heterogeneous, treatment cannot be delivered in a completely reproducible fashion, and the effect of the treatment is not the same from patient to patient. Furthermore, not all factors affecting the outcome can be adequately controlled and the measurement of effect is often difficult. To deal with the realities of clinical research, as in all scientific research, the design of an experiment involves control, randomization, and replication (1). The objective is to have patients in the treatment groups to be similar on average in all important aspects, except for the treatment assigned; and in this way, observed differences can be attributed to treatment. The first step in designing a clinical trial is to identify sources of patient heterogeneity, treatment variability, and outcome assessment measurement error. To the extent possible, all known sources of heterogeneity and variability should be controlled or limited. For
those sources of heterogeneity that are either not known or that cannot be directly controlled, randomization is used to reduce and limit the impact of nonrandom differences between patients in the treatment arms. Replication is the mechanism by which the impact of random error or noise is controlled; sample sizes are calculated to ensure that there is a high probability of detecting a specified signal-to-noise ratio. Issues of control and replication are discussed in other chapters of this book; the purpose of this chapter is to discuss the application of randomization and to describe some common and effective randomization procedures. In this chapter, the terms treatment and intervention will be used interchangeably, as will patient and study participant. In addition, we will use the phrases “assign each patient to a treatment arm” and “assign a treatment to each study participant” to mean the same thing. Bias Bias is the term used to describe systematic error, in contrast to random error. In the context of clinical trials, serious bias arises when systematic (nonrandom) differences in patient characteristics are confused with treatment effect. An illustration of bias (i.e., systematic error) can be found in a registry of patients with muscle invasive transitional cell bladder cancer who
73
74
ONCOLOGY CLINICAL TRIALS
have undergone radical cystectomy (2). In this series, many patients with involved lymph nodes were treated off protocol with adjuvant chemotherapy. When the survival of patients who received chemotherapy was compared to the survival of patients who did not receive chemotherapy (all with involved lymph nodes), the patients who received adjuvant chemotherapy had inferior survival. This was not because chemotherapy had a detrimental effect on survival, but rather because patients with other unfavorable features (not all easily specified) were more likely to receive adjuvant chemotherapy. Because patients were not randomly assigned to receive chemotherapy, the assignment to chemotherapy depended on factors that are not all readily apparent. The resulting comparison of survival between the chemotherapy and nonchemotherapy groups was biased because the differences observed were not due to the chemotherapy, but rather to patient or disease characteristics. There are other potential sources of bias, such as those that occur during the assessment of response to treatment, that cannot be reduced or eliminated by randomization. Many trials are single-blind (when the patient does not know which treatment he/she is receiving) or double-blind (when the patient and the individual evaluating the response, often the physician, do not know the treatment assignment) in order to reduce potential bias in outcome measurement. Thus even with randomization, there is still a need to exercise as much control as possible in the design of the trial in order to reduce bias. This systematic error, this bias, cannot be overcome or corrected by increasing the sample size. If the cause of the bias can be identified, then possibly the error can be corrected with careful analysis. But this is often not possible, and often the study investigators may not be aware of the bias. And this, finally, is the problem with bias—it is usually undetected. One can never say that there is no bias in a trial; one can only say that every precaution was taken to reduce the possibility of bias.
done by ensuring that each patient is as likely to be assigned a particular treatment as any other patient. That is, patient with feature X is just as likely as patient with feature Y, to be assigned to a specific treatment. With randomization correctly executed, there is no systematic preference to assigning one treatment over the other, based either on patient characteristics or physician preference or (often) subconscious opinion. Randomization does not guarantee that at every stage, the treatment arms will be exactly equally balanced in terms of known and unknown patient characteristics that could impact the patient response to treatment; the randomization process will decrease the chances that this deleterious imbalance occurs with small numbers of patients, and has a high chance of achieving balance with large numbers of patients. Thus randomization has the feature that in the long run, the patients in one treatment group will be similar (on average) to patients in another treatment group. It is this long run balance that will permit the conclusion that observed differences between treatment groups are due to treatment. Finally, randomization can confer robustness to the statistical analysis. That is, the set of all possible randomization outcomes provides a context by which to evaluate the observed treatment differences; often this will suggest that standard (parametric) tests can be used to calculate p-values to summarize the strength of the evidence refuting the null hypothesis. From a practical perspective, the randomization scheme should be easy to use, it should be unpredictable, and it should be truly random; wherever possible it should assign individual patients and not groups of patients. For example, a scheme that assigns patients registered on Mondays to one treatment and those registered on Wednesdays to another treatment will invite bias. Patients making appointments for Monday may differ from those making appointments for Wednesday (e.g., traveling from out of town and/or attempting to
Goals of Randomization For all the considerations discussed above, trials using randomization to assign treatment to study participants are regarded as the most credible type of investigation for generating experimental data to compare the benefit and safety of therapies for the treatment of cancer. Randomization, properly implemented achieves two goals. The first goal is to assign a treatment to each study participant. The second, equally important goal of randomization is to reduce the chance of bias (as summarized in Table 10.1). This is
TABLE 10.1
Goals of Randomization. 1. 2.
3.
4.
Assign treatment to study participants. Ensure that each study participant is as likely to receive each of the trial treatments as any other participant. Ensure that in the long run, on average, the treatment groups are similar in terms of the participants’ characteristics. Justify statistical analysis of results.
10 RANDOMIZATION
minimize work days missed); this is sometimes called experimental bias. In addition, this form of assignment allows physician preference to intervene; this is sometimes called selection bias. Other forms of systematic assignment of patients to one treatment or another are also vulnerable to (often subconscious) selection bias of patients. Hence, a treatment assignment scheme that assigns patients with a medical record number ending in an even digit to one treatment and assigns all other patients to a second treatment invites patient selection bias if the scheme becomes known. In 1930, the Department of Health for Scotland undertook a nutritional experiment in the schools of Lanarkshire (3). Ten thousand children received free milk for 4 months and 10,000 children were observed as controls. Assignment was done by ballot or alphabetically, although adjustment by the head teacher at each school was allowed if it appeared that “an undue proportion of well fed or ill nourished children” was assigned to one group. At the end of the experiment, it was observed that the control group weighed more than the milk group. Analysis of baseline heights and weights demonstrated that children assigned to the milk group had (on average) weighed less both at the start of the experiment and at the end, although they had gained more weight during the 4 months of the study. This example demonstrates the bias in personal selection of subjects for treatment or control; randomization is needed to ensure that assignment is free of bias and to maximize the likelihood that the comparison groups are similar on average.
Deciding Whether to Randomize Two issues must be considered in deciding whether a planned trial should include randomization. One consideration is scientific: can the potential problem of bias be managed without randomization? The other consideration is ethical: will it be ethical to randomly assign one of several treatments to a study participant, rather than use medical opinion? There are disadvantages and logistic difficulties to using randomization to assign treatment. These are not usually insurmountable, but they can complicate the conduct of the trial. Explaining the choice of treatments and the randomization process to patients is difficult to do well; many patients, as well as their physicians, would rather that the treating physician select the treatment option thought or believed to be best for the patient. As a result, some patients will not enroll in a trial that involves randomization, resulting in slower accrual and longer-lasting studies. To mitigate the reluctance of patients to enroll in a trial with a not-
75
yet-determined treatment, Zelen introduced the randomized consent design (4). In this design, the treatment assignment is obtained (by randomization) prior to the consent process, and the patient is presented with the study knowing the treatment assignment. The goal of this design was to increase participation; however, if a sizable portion of patients decline participation after randomization, then interpretation of the results will be more complicated and possibly compromised (see the discussion on intention-to-treat later in the chapter). As a result, this approach has not been used often. Acceptance of randomization remains an obstacle for many patients and physicians. Randomized trials are recognized as the most scientifically sound mechanism for establishing which of two or more treatments is superior. This is especially true in oncology clinical research, where phase III trials aim to establish standard of care options, and therefore require a control group or comparator in order to compare the new therapeutic regimen to a standard of care option (i.e., the control). In this setting, historical controls or concurrent nonrandomized controls are almost always unacceptable due to vulnerability to substantial bias. Because for most cancers, therapies now exist that offer palliation if not cure, differences needed to establish superiority or equivalence will be incremental and not large; treatment effect differences may not be substantially larger than differences resulting from changes in referral patterns or changes in ancillary care procedures. In addition, supportive care and diagnostic procedures have evolved substantially over the last decades and continue to do so. Hence, the use of historical series is (at best) problematic and often invalid; concurrent, nonrandomized controls are usually subject to even more bias than historical controls. The onus will be on the study investigators to convincingly demonstrate to the skeptical reviewer that differences observed between the new treatment and the historical control group (or nonrandomized concurrent control group) are not due to bias. This will be (at best) difficult. Thus, definitive trials that aim to change standard or care, or convincingly compare two or more standard treatment options, should incorporate concurrent controls and randomization. While the scientific justification for randomly assigning treatment to patients is rarely disputed, the ethical justification is not straightforward. The term clinical equipoise provides the ethical justification for clinical trials with randomization. Clinical equipoise is satisfied when there is “genuine uncertainty in the expert medical community . . . about the preferred treatment” (5). This occurs when there are no clear and definitive data supporting or refuting the hypothesis of benefit or superiority and individual clinicians and
76
ONCOLOGY CLINICAL TRIALS
clinical investigators are either unsure or have conflicting opinions regarding the new treatment. It is not the purpose of this chapter to review all the ethical arguments for and against using randomization. However, these issues must be carefully discussed before initiating a randomized trial, by evaluating the potential harm to study participants and weighing the risks and benefits of undertaking the study to study participants and to society. Much has been written on this subject (6–8).
PRACTICAL CONSIDERATIONS There are many practical and logistic issues that must be resolved in order to correctly and effectively incorporate randomization into a clinical trial. To begin with, randomization should not be done by involved investigators; a separate office (possibly with telephone randomization) should be used. A document should be created and saved for each randomization, indicating the time and date of the randomization, the individual initiating the randomization, the pertinent patient information, and the randomization outcome. Timing of Randomization Randomization should not take place until after the patient is confirmed to be eligible and has signed the informed consent. In addition, randomization should be delayed until the latest practical moment before the intervention is to begin (see Durrleman and Simon for a discussion of this issue (9)). In general it is reasonable for the intervention to begin within 1 or 2 weeks of randomization. This narrow window reduces the likelihood of intervening events to cause the patient to discontinue trial participation or fail to adhere to the treatment schedule as required. Noncompliance and dropouts can introduce bias and may complicate the final interpretation of the results. In 1991, the Children’s Cancer Group began a trial for the treatment of children with high-risk neuroblastoma (10). All patients were treated with the same initial regimen of chemotherapy. Those without disease progression at the completion of the initial therapy were then randomly assigned to receive either three cycles of intensive chemotherapy, or to receive myeloablative therapy with autologous bone marrow rescue. Patients who completed the second phase of cytotoxic therapy without disease progression were then randomized to no further treatment or 6 months of 13-cis-retinoic acid. In this study, both randomizations were delayed until patients were ready to begin the next phase of treatment. Had randomization taken place prior to the start of the initial chemotherapy with
all 539 eligible patients randomized, then only 379 of the 539 patients would have begun the three cycles of intensive chemotherapy or myeloablative therapy, with 160 patients not receiving the randomly assigned treatment. Only 258 of the initial 539, who were still free of progression after all cytotoxic therapy, would have been able to comply with the second randomization. By delaying the randomization until patients were ready to begin the treatment, the investigators clearly defined the appropriate comparison groups and eliminated (upfront) those patients who would not contribute meaningful data to the planned comparisons. Intention-to-Treat Analysis In a randomized trial, an intention-to-treat (ITT) analysis is one in which all eligible patients who are randomized are included in the analysis of the results and are classified according to the treatment assigned, regardless of whether this was the treatment actually received. The ITT analysis should be the primary analysis in a randomized trial designed to show the superiority of one regimen. The Children’s Cancer Group, by delaying the randomization, was able to focus the ITT analysis on those patients who were still able to benefit from the second phase or third phase of therapy. When some of the randomized patients fail to receive the assigned treatment, there is no correct way to include or exclude these patients in the analysis. Nor is there a way to correctly adjust the statistical analysis. This is because the outcome of these patients, had they received the assigned treatment, is unknown; furthermore, it is often the case that these patients are different from those who received the assigned treatment, but it is not usually known how they differ. In the presence of noncompliance, any analysis will lead to a biased estimate of the true treatment difference. The ITT analysis is advocated in the case of trials designed to establish a difference (that one treatment arm is superior) because the bias, to the degree that it exists, will lead to an underestimate of the magnitude of the treatment difference. With other approaches, such as the as treated analysis or inclusion of only the compliant patients, the impact of the bias is not known: the resulting observed treatment difference could be either too large or too small. The ITT analysis is known to be a conservative strategy, and if the ITT results indicate a clear difference (i.e., statistically significant at the planned level), then the results can be considered conclusive. If the ITT analysis does not result in a clear (statistically significant) difference, but the as treated or another analysis does, then interpretation of the results will be inconclusive.
77
10 RANDOMIZATION
Balanced versus Unbalanced Randomization
randomization in a clinical trial (13). In this section, several methods will be described; simple randomization and the permuted block design are by far the more commonly used methods, but the others can be very useful in selected situations and are presented briefly to indicate the range of randomization options. For illustration purposes, it will be assumed that there will be two treatments (A and B), and that a balanced allocation will be used (i.e., 50% of patients will be assigned treatment A and 50% will be assigned treatment B). However, each of the methods presented can easily be adapted to three or more treatment arms and to unequal allocation. Computer programs exist and are easy to create in order to generate the random treatment assignments. For many of the randomization designs, a list can be prepared in advance, although this is not feasible for adaptive or minimization randomization schemes.
Generally, it is most efficient to assign the same number of patients to each of the treatment groups under comparison in the clinical trial. That is, for the same total number of patients, the power will be greatest when the treatment arms have equal numbers. On some occasions, however, it may be appropriate to assign more patients to one treatment arm. In some trials, more patients will be required for secondary objectives; additional studies may be performed on patients assigned to one of the treatments. In randomized pilot studies, there may be a need to obtain more data on the new or experimental arm, while sufficient experience exists for the arm that represents standard of care. On occasion, one treatment may be substantially more expensive than the other. Sometimes, there might be more variability associated with one arm compared to the other, and in this case it would be more efficient to randomize more patients to the treatment with greater variability. In a related situation, Sposto and Krailo discuss unequal allocation when the primary outcome is time to an event and fewer events are expected in one arm (11). It is not appropriate, however, to design a study with fixed unequal allocation that will assign fewer patients to an arm that is thought to be inferior; this would suggest that clinical equipoise is not satisfied.
Simple Randomization (Completely Randomized Design) Simple randomization is equivalent to flipping a coin or rolling a die every time a patient is randomized. If the coin is fair (i.e., the probability of getting a head is equal to getting a tail), Pr{head} = Pr{tail} = 1/2, then the randomization is balanced. With this form of randomization, the outcome (heads or tails) does not depend on the number of previously randomized patients or their characteristics. That is, if the last five flips all resulted in heads, the probability that the next flip is a head is still 1/2. Although computers can quickly create a list of treatment assignments using simple randomization, Table 10.2 of random digits is used here to illustrate the process.
METHODS OF RANDOMIZATION There are many ways to randomly assign treatments to patients (12). Kalish and Begg provide a comprehensive review of different schemes that are available for
TABLE 10.2
Three Hundred Random Digits.
A B C D E F G H I J K L M N O
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
5 0 1 6 7 4 5 9 0 3 6 2 6 7 7
6 8 3 1 9 8 6 1 9 9 8 8 6 6 8
1 7 3 2 5 8 0 3 8 5 3 6 5 3 7
0 4 8 8 0 3 7 0 9 8 5 8 8 5 0
6 5 1 7 1 5 4 2 0 4 7 6 8 2 1
2 6 9 8 8 0 4 7 9 2 6 6 5 0 2
1 2 2 1 2 2 5 1 2 3 7 1 1 7 3
8 5 5 6 4 2 2 7 5 2 6 9 6 5 9
7 3 8 8 7 3 3 3 0 2 2 6 0 7 4
9 9 4 4 5 4 1 9 0 1 1 9 2 8 0
4 2 9 5 4 9 8 4 5 7 5 2 6 3 7
2 4 5 8 7 8 3 9 7 4 9 6 1 4 4
4 6 1 0 7 8 5 3 7 4 9 5 0 1 8
5 5 3 6 2 8 0 8 6 9 4 7 4 7 6
2 8 6 9 9 0 9 4 1 9 5 5 8 5 9
8 7 7 3 8 4 5 7 7 0 0 9 1 1 4
2 5 1 5 2 1 2 7 6 4 7 8 0 6 1
1 7 7 4 1 7 2 0 9 3 7 1 1 0 3
19 20 0 3 4 1 9 9 9 3 4 4 9 9 6 3 7
4 4 1 6 0 9 4 7 1 8 0 7 3 9 3
78
ONCOLOGY CLINICAL TRIALS
Biased Coin Randomization (14)
Table 10-2 contains 300 random digits; that is, each of the 10 digits (0 to 9) appears in about 10% of the entries in the table, in no order. If even digits (0, 2, 4, 6, 8) are used to correspond to treatment A and odd digits (1, 3, 5, 7, 9) correspond to treatment B, then Table 10-2 can be used to assign treatment to 15 patients. First it is necessary to identify a starting point in Table 10-2, and then read a sequence of 15 digits originating at the starting point. Selection of the starting point should also be done randomly. If the starting point is column 8, row A, and 15 digits down column 8 are read, then the assignment of treatment to 15 patients is given in Figure 10.1. In this case, treatment A was assigned to 8 patients and treatment B was assigned to 7 patients. The features of simple randomization can be summarized as:
This randomization scheme begins with simple randomization: for each new patient, the probability of being assigned treatment A is 1/2. If the difference in the number of patients assigned to A minus the number of patients assigned to B differs by some prespecified number, D (or -D) then the probability of assignment to A is changed. If D more patients have been assigned to A, then the probability of assigning A to the next patient is changed to “p” where p < 1/2. If D more patients have been assigned to treatment B, then the probability of assigning A to the next patient is changed to “1-p” where 1-p > 1/2. In general, p = 1/3 is reasonable; smaller values will result in assignments that are too predictable. The choice of D should depend on the total number of patients planned and the timing of interim analyses. The features of the biased coin randomization can be summarized as:
· · · ·
completely unpredictable (eliminates selection bias) eliminates experimental bias in the short run, can lead to serious imbalance rarely exactly balanced, but balance is good in the long run · procedure is flexible and easy · resulting statistical analysis is easy
· · · ·
largely unpredictable (eliminates selection bias) eliminates experimental bias better balance in the short run rarely exactly balanced, but close to exactly balanced; balance is good in the long run · flexible but a little more complicated to set up · resulting statistical analysis can be somewhat more complicated
Replacement Randomization It is possible that the randomization scheme described above will produce substantial imbalance in the early part of the trial. A simple fix to avoid this problem is to specify a requirement in advance. For example, a possible requirement might be: in the first 40 patients, the difference in the number of patients assigned to receive A versus assigned to receive B should not exceed 8. The randomization list is prepared in advance; if the list does not fulfill the requirement, the entire list is rejected and another randomization list is prepared. As long as the entire randomization list is replaced and as long as this is done before the first patient is enrolled, this is one reasonable way to overcome the potential short-run imbalance of the simple randomization scheme.
Permuted Block Design The permuted block design is one of the most commonly used methods for randomly assigning treatment to patients. The advantage of this method is that it forces exact balance at very regular intervals during the accrual phase of the trial, and not just at the end. To construct a randomization list using the permuted block design, the first step is to select the block size which will be a multiple of K (2K, 3K, 4K, etc.), where K is the number of treatments in the trial. To illustrate this scheme for K = 2, block sizes of 6 = 3 × 2 will be used. In a block of 6 units (or patients), there
Pt. Seq.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Random Digit
8
5
5
6
4
2
2
7
5
2
6
9
6
5
9
Treatment
A
B
B
A
A
A
A
B
B
A
A
B
A
B
B
FIGURE 10.1 Assignment of treatment to 15 patients using simple randomization.
79
10 RANDOMIZATION
are 20 different orders that the 2 treatments can be assigned with each treatment assigned to 3 patients (e.g., AAABBB or ABBAAB or ABABBA, etc.). There are two ways to construct the randomization list. The first way is to randomly select blocks from a list of all 20 possible blocks; this is probably the fastest method. The second way is to randomly assign treatment to patients in groups of 6. This second method is demonstrated below in Figure 10.2 for 30 patients, which will require 5 blocks with 6 patients per block. Taking random digits from Table 10.2, using row F for Block #1, row G for Block #2, through row J for Block #5, six digits will be read across each row, beginning in column 11. As before, even digits will correspond to treatment A and odd digits to treatment B. At the end of each block, both treatment arms have exactly the same number of patients assigned. This short run balance, and ease of implementation, are very favorable characteristics of this method of randomization. However, this randomization scheme is not as unpredictable as simple randomization, replacement randomization, or the biased coin method since, if the block size is known, one could anticipate the assignment of the last patient in the block. For this reason, a variation on this randomization scheme is frequently adopted in which the block sizes are also varied. For example, with K = 2 treatments, the block size of 4, 6, or 8 could be randomly chosen. The features of the permuted block design randomization can be summarized as:
· closely balanced at end (exactly balanced if total number of patients is a multiple of the block size) · flexible but a little more complicated to set up · resulting statistical analysis is easy Adaptive Randomization for Treatment Assignment In this class of treatment assignment schemes, the probability that the next patient is assigned to treatment A or B will depend on the outcomes of the previously randomized and treated patients. It is for this reason that the methods are called adaptive, since the probability of assigning treatment A will be adapted to the data observed to date. Zelen first described the adaptive “play the winner” strategy in 1969 (15) for clinical trials; the statistical properties of these designs are reviewed by Simon (16). These randomization schemes are somewhat controversial with proponents citing that in the long run, patients are more likely to be assigned to the regimen with the better outcome. Concerns regarding these designs involve practical and ethical issues. From a practical perspective, these designs can lead to early imbalance in treatment assignment which may compromise the ability to perform an adequate comparison; that is, they are not efficient in terms of estimating the difference between treatments. In addition, for many of these designs, the treatment assignments can become nearly completely predictable quickly, exposing the trial to selection bias. Those that cite ethical concerns state that if clinical equipoise is not maintained, then it is not appropriate to assign even a few patients to the inferior
· largely unpredictable, except at end of each block (can vary block size to decrease predictability) · eliminates experimental bias · very good balance in the short run
Pt. Seq
1
2
3
4
Random #
9
8
8
8
Treatment
B
A A
A
5
B
6
B
7
8
9
10
8
3
5
0
9
A
B
B
A
B
Block 4 Pt. Seq
5
7
7
Treatment
B
B
B
A
11 12
A
Block 5
19 20 21 22 23 24
Random #
Block 3
Block 2
Block 1
A
A
25 26 27 28
29 30
7
4
4
9
9
B
A
A
B
B
A
FIGURE 10.2 Assignment of treatment to 30 patients using the permuted block design.
13
14
15 16
17
4
9
3
8
4
A
B
B
A
A
18
B
Note: The numbers in the shaded cells with the strikethrough lines indicate that those treatment assignments were required to ensure that 3 patients were assigned to A and to B.
80
ONCOLOGY CLINICAL TRIALS
treatment; that is, one should be prepared to use balanced randomization or not randomize at all. Trials of neonatal extracorporeal membrane oxygenation (ECMO) used adaptive randomization and provide a fascinating discussion of the issues involved with randomization (in general) and adaptive randomization (in particular) (6). On a practical note, adaptive designs may not be feasible in many oncology trials which involve time-to-event outcomes such as survival or time-to-recurrence that is not expected for several years. In general, these designs are not used in oncology clinical trials (16).
STRATIFICATION AND RANDOMIZATION In most clinical trials, there are patient and tumor characteristics which are known to influence the response to treatment. Randomization can be expected (in the long run) to yield treatment arms that are balanced in terms of these characteristics. Furthermore, statistical methods exist to adjust for these factors at the time of the final analysis. However, this may not be sufficient: interim analyses are planned, studies are terminated early, and sometimes “in the long run” implies a very large trial. Hence, it is often desirable to control the randomization process to increase balance across the treatment arms in terms of these patient characteristics. This is done by stratification. While often successfully used, stratification can lead to loss of power and efficiency if not done properly. The patient and tumor characteristics used for stratification should be available at the time of randomization and based on objective data, free of interpretation. Care should be taken not to over stratify and create too many subsets of patients with very small numbers. The statistical analyses should incorporate the stratification that was used at the time of randomization. Byron Brown (17) summarized the rationale for stratification with “. . . there is much to be gained in persuasiveness or credibility by presentation of data that show the numbers of patients assigned to the several treatments to be closely balanced with regard to the variables commonly felt to be related to the course of the disease and the response to treatment. No amount of poststratification and covariance analysis, particularly if the techniques used are complex in nature, will be as convincing as the demonstration that the groups were balanced in the beginning. . . .” Simple Stratification Simple stratification works well when there are one or two patient or disease characteristics and these can each be grouped into two to four categories. The
stratification variables will be cross-classified to create strata. For example, if there are two stratification variables, one with two classes (e.g., good performance status vs. poor performance status) and one with three classes (e.g., no prior chemotherapy, prior chemotherapy but no taxanes, prior treatment with a taxane), then there will be a total of 6 = 2 × 3 strata. Within each stratum, a separate randomization list (on paper or in the computer) will be created to assign treatment to the patients in that stratum. Operationally, each patient is first classified into one stratum and then randomized within that stratum. In the neuroblastoma example, at the first randomization, patients were stratified according to whether they had metastatic disease or not; at the second randomization, patients were stratified according to the treatment they had been assigned at the first randomization (10). A permuted block design was used for random treatment assignment within each stratum. The replacement, biased coin, and permuted block randomization schemes can all be easily adapted to this type of stratification. Balance (or near balance) should be achieved within each stratum. While there are no set rules for the maximum number of strata permitted, practical considerations would suggest that there should be on average at least 10 patients per stratum, but at least 20 is better. Adaptive Randomization to Minimize Imbalance: Minimization Randomization Stratified randomization (as described above) is attractive for its simplicity and general effectiveness within each stratum. However, this approach can lead to troublesome imbalance across strata during the early stages of a trial. Consider the hypothetical example in Figure 10-3 where there are 3 strata, 2 treatments, and within each stratum, a permuted block design with block size 4 used. In Figure 10.3, after the first 10 patients, with 6, 3, and 1 patients in stratum 1, 2, and 3, respectively, 3 patients have been assigned A and 7 have been assigned B. Within each stratum, there is only slight imbalance, but the cumulative effect is more pronounced. This example highlights the potential problem with simple stratification when there are many strata, especially early in the trial. In 1974, Taves published a method to minimize imbalance over the entire study, and within each of the subgroups, dictated by the individual patient characteristics (i.e., each stratification variable) separately, but not within the strata formed by cross-classifying all the stratification variables (18). His method was deterministic and did not incorporate randomization. Pocock and Simon independently and subsequently published an extension of the Taves method which
81
10 RANDOMIZATION
Stratum 1 B A B A B B A* A
Stratum 2 A B B A* A B A B
Gender
Stratum 3 B B* A A B A B A
A: B:
Example of cumulative imbalance across strata with simple stratification
included randomization (19). Their method is illustrated with the following simplified example. Suppose that 50 patients have already been enrolled and randomized in a trial with 2 treatments and 3 stratification variables giving rise to 12 strata, as displayed in Figure 10.4(A). The minimization method considers the stratification variables separately, as displayed in Figure 10.4(B). Figure 10.4(C) shows the impact on the treatment assignment balance when the next patient, a male with low performance status and no prior therapy, is assigned to either A or B. In Figure 10.4(C), the imbalance for each level of each stratification variable is calculated (bottom line). A summary of the imbalance is then calculated either by squaring the differences and adding them or by taking the absolute values and adding them. Squaring the differences will put greater weight on large differences; taking the absolute value tends to weight all the differences more equally. Using the squared differences, if the next patient is assigned to A, the summary of imbalance becomes (−3)2 + (−1)2 + (0)2 = 10; if this next patient is instead assigned to B, the summary of imbalance becomes (−5)2 + (−3)2 + (−2)2 = 38. Assigning this patient to receive treatment A will minimize the imbalance at this point. The Taves
Mid. 9 9
Low 11 12
procedure would assign the treatment A; the PocockSimon approach would randomly assign treatment with probability p* > 1/2 for treatment A. Variations exist: stratification variables can be weighted and a term can be added that accounts for the overall balance across all strata. With available computer programs, both the simple stratification and the minimization randomization are easy to implement. Simple stratification achieves balance within each of the strata, while minimization is more likely to achieve balance across the stratification variables and overall.
SUMMARY Randomized clinical trials remain the most robust and credible method for generating data to formally compare treatments for patients with cancer. In planning the technical aspects of the randomization during the design of the clinical trial, the stratification variables and the randomization scheme must be selected. Choice of the number of stratification variables represents a trade-off between controlling as much as possible and striving for simplicity to the extent that it is possible. Criteria for using a stratification variable in the design should include the strength of the association between the variable and the outcome measures, its known ability to affect the response to treatment, and the reliability with which it can be measured prior to start of treatment. If there are only 1 or 2 stratification variables and the final study size will be Gender: Male
Females No
Performance Status High 5 4
Yes 6 4
Balance of treatment assignments within each of the stratification variables—number of patients assigned to A or B.
FIGURE 10.3
H M L 2 4 5 1 2 7
Prior Therapy No 19 21
Female 15 11
FIGURE 10.4(B)
*Treatment assignments for future patients. Only 6, 3, and 1 patients randomized in strata 1, 2, 3, respectively.
Gender: Males No Yes Pror Therapy: H M L Performance Status* H M L 1 1 0 Assigned to A: 2 2 4 1 2 0 Assigned to B: 2 4 5
Male 10 14
Yes H M L 0 2 2 0 1 0
*H = High (KPS = 100), M = Middle (KPS = 90–80), L = Low (KPS = 70–60)
FIGURE 10.4(A) Status of trial with 3 stratification variables after 50 patients—number of patients assigned to treatments A and B within each of the 12 strata
New Patient Assigned to: New # in A: New # in B: Difference: A –B
Prior Therapy: No
Performance Status: Low
A
B
A
B
A
B
11 14
10 15
20 21
19 22
12 12
11 13
–3
–5
–1
–3
–0
–2
FIGURE 10.4(C) Impact of assigning the next patient (male, no prior therapy, and low performance status) to A or B.
82
ONCOLOGY CLINICAL TRIALS
large, then simple stratification will be sufficient. In this case, randomization using the permuted block design is both easy and effective. If there are many stratification variables or if there will be early interim analysis or the study is relatively small, then minimization randomization will be more effective for achieving shortterm balance across the stratification variables. All three elements of trial design are essential: control, randomization, and replication. Properly designing and executing the randomization is critical for the success of the trial.
References 1. Fisher RA. The Design of Experiments. Edinburgh: Oliver and Boyd. 1935. 2. Stein JP, Stein JP, Lieskovsky G, et al. Radical cystectomy in the treatment of invasive bladder cancer: long-term results in 1,054 patients. J Clin Oncol. 2001;19:666–675. 3. Student. The Lanarkshire milk experiment. Biometrika. 1931;23:398–406. 4. Zelen M. Randomized consent designs for clinical trials: an update. Stat Med. 1990;9:645–656. 5. Freedman B. Equipoise and the ethics of clinical research. N Engl J Med. 1987;317:141–145. 6. Royall RM. Ethics and statistics in randomized clinical trials. Statist Sci. 1991;6:52–88.
7. Temple R, Ellenberg SS. Placebo-controlled trials and activecontrolled trials in the evaluation of new treatments. Part 1: ethical and scientific issues. Ann Intern Med. 2000;133:456–464. 8. Ellenberg SS, Temple R. Placebo-controlled trials and activecontrolled trials in the evaluation of new treatments. Part 2: practical issues and specific cases. Ann Intern Med. 2000;133:464–470. 9. Durrleman S, Simon R. When to randomize? J Clin Oncol. 1991;9:116–122. 10. Matthay KK, Villablanca JG, Seeger RC, et al. Treatment of high-risk neuroblastoma with intensive chemotherapy, radiotherapy, autologous bone marrow transplantation, and 13-cisretinoic acid. N Engl J Med. 1999;341:1165–1173. 11. Sposto R, Krailo MD. Use of unequal allocation in survival trials. Stat Med. 1987;6:119–125. 12. Pocock SJ. Allocation of patients to treatment in clinical trials. Biometrics. 1979;35:183–197. 13. Kalish LA, Begg CB. Treatment allocation methods in clinical trials: a review. Stat Med. 1985;4:129–144. 14. Efron B. Forcing a sequential experiment to be balanced. Biometrika. 1971;58:403–417. 15. Zelen M. Play the winner rule and the controlled clinical trial. J Amer Stat Assoc. 1969;64:131–146. 16. Simon R. Adaptive treatment assignment methods and clinical trials. Biometrics. 1977;33:743–749. 17. Brown BW Jr. Statistical controversies in the design of clinical trials—some personal views. Cont Clin Trials. 1980;1:13–27. 18. Taves DR. Minimization: a new method of assigning patients to treatment and control groups. Clin Pharmacol Ther. 1974;15:443–453. 19. Pocock SJ, Simon R. Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics. 1975;31:103–115.
11
Design of Phase III Clinical Trials
Stephen L. George
A phase III clinical trial is a randomized prospective controlled study designed to compare the efficacy of two or more regimens for the treatment of a specified disease or medical condition. These trials employ accepted scientific principles of good experimental design including, among other things, specification of eligibility criteria (types of patients appropriate for study), explicit statements of primary and secondary objectives, details of the treatment regimens to be compared, and statistical considerations (hypotheses tested, sample size and expected duration of the trial, statistical procedures, interim analysis plans, and related topics). A properly designed and executed phase III clinical trial provides the best available scientific evidence on the relative efficacy of the treatments being studied and the most reliable information for evidence-based medicine. The adoption and wide-spread use of phase III clinical trials in the latter half of the twentieth century and the early twenty-first century represents one of the more important contributions to the practice of scientific medicine during the last 60 years. The statistical aspects of the design, conduct, and analysis of clinical trials have been extensively studied during this time and there are now many textbooks (including this one) covering this material at various levels of statistical sophistication. Other chapters in the current text cover important topics in the design and analysis of phase III clinical trials, including selecting endpoints, randomization and stratification, interim analysis, adaptive
design, and Bayesian designs. The focus in the present chapter will be on determining the required sample size (number of patients or number of events) and duration of a phase III clinical trial in many commonly encountered practical situations (1, 2). In an attempt to provide maximum clarity for the underlying concepts, free of unnecessary complexity, the situations considered are elementary ones. References to papers covering more complex scenarios are provided where appropriate.
CANONICAL SAMPLE SIZE FORMULAE Testing Hypotheses The sample size considerations in this chapter are derived from a statistical hypothesis testing perspective, usually involving a test of a null hypothesis, H0, against an alternative hypothesis, H1. In the simplest case, suppose the outcome variable (endpoint) of a clinical trial comparing two treatments is some continuous random variable, X, and we wish to compare the mean value of X for the two treatments. The usual null and alternative hypotheses in this case may be expressed as H0 : m1 = m2 vs. H1 : m1 ≠ m2
(Eqn. 11-1),
where mi is the mean for treatment i (= 1, 2). 83
84
ONCOLOGY CLINICAL TRIALS
The statistical inference in this case is a decision rule, based on the observed values of X in the two treatment groups, for deciding between the two competing hypotheses. In the standard statistical approach, the trial is designed to control the rates for the two possible types of error: · Type I—rejecting H0 (in favor of H1) when H0 is true · Type II—not rejecting H0 when H1 is true The error rates for these two types of errors are conventionally denoted by a and b, respectively, with the power of the test defined as 1—b, the complement of the type II error rate. That is, the power of the test is the probability of correctly rejecting H0 when H1 is true. The usual approach to determining the required sample size is to set the type I error rate a at some acceptable level, often 0.05 or 0.01, and then to find the minimum required sample size to achieve a power of at least some specified value 1—b, often 0.80 or 0.90, at some specified alternative value (i.e., some particular value in the alternative hypothesis space when H1 is a composite space). Suppose (for simplicity) that the endpoint Xi in the ith treatment group (i = 1, 2) has a normal statistical distribution with mean mi and known variance s 2, denoted, Xi ∼ N (mi, s 2), and we plan to enter an equal number of patients, n, on each treatment. In this case, the usual test statistic, Z, used in testing H0 versus H1 may be written as: Z=
n X1 – X2 2 σ
(Eqn. 11-2),
– where Xi is the sample mean of X in the ith treatment group. The hypothesis H0 : m1 = m2 is rejected in favor of H1 : m1 ≠ m2 if Z ≥ za /2, where zx is the upper 100 (1 – x) percentile of the standard normal distribution. To determine the required common sample size, we need to solve the following equation for n:
(
)
P Z ≥ zα /2 δ ≥ 1− β
(Eqn. 11-3),
where P(X⎪Y) denotes the probability of X given Y and d = m1– m2 ≠ 0 is some prespecified value in the alternative hypothesis space. That is, we want the power to be at least 1—b when the true difference between the means is d. Some straight-forward algebraic manipulation of (11-3) yields the following sample size formula for the approximate number of patients, n, required on each treatment group:
(
⎡ 2σ 2 z + z α /2 β n=⎢ ⎢ δ2 ⎢⎣
)
2
⎤ ⎥ ⎥ ⎥⎦
(Eqn. 11-4),
where [x] denotes the smallest integer ≥ x. Equation (11-4) is the canonical sample size formula for the hypothesis-testing scenario considered here. If the variances in the two groups are not equal, the formula becomes
)(
(
⎡ σ 2+ σ 2 z + z 1 α /2 β 2 n=⎢ ⎢ δ2 ⎢⎣
)
2
⎤ ⎥ (Eqn. 11-5), ⎥ ⎥⎦
These formulae can also be used to determine the power for a given sample size by solving for zb when n is given. Although the above formulae are strictly applicable only for the assumptions underlying their derivation, they are approximately correct in many other settings. Often the test statistic or some simple transformation of the test statistic is approximately normally distributed for reasonably large sample sizes. Also, the formulae nicely illustrate several general points about the required sample size in phase III clinical trials: ·The required sample size n increases as the variance s 2 increases. The size of s 2 is a feature of the population under study. ·The required sample size n increases as the error rates decrease. For example, an increase in power (decrease in b) requires an increase in sample size. ·The required sample size n increases as the detectable effect size d decreases. The required sample size calculated from (11-4) for some common values of a and b as a function of the standardized effect size, d /s, is given in Table 11.1. If the variances are not equal, the standardized effect size _ 2 2 may be defined as d /s where σ = σ 1 + σ 2 2 .
(
)
For example, if a = 0.05, 1—b = 0.90, and d /s = 0.50, the number of patients required on each treatment is n = 85 and the total required sample size is 2n = 170. Unknown Variances If we relax the assumption that sigma is known, the appropriate test statistic is not (11-2) but the t-statistic T=
n X1 – X2 2 s
(Eqn. 11-6),
where s is the pooled estimate of the common, but unknown, standard deviation s. In this case, a good approximation (3) to the required sample size n* is: ⎡ z2 ⎤ n* = n + ⎢ α / 2 ⎥ ⎣ 4 ⎦
(Eqn. 11-7),
85
11 DESIGN OF PHASE III CLINICAL TRIALS
TABLE 11.1
Required Sample Size on Each Treatment Arm to Test H0 : μ1 = μ2 vs. H0 : μ1 ≠ μ2. d/s a
1—b
0.10
0.25
0.50
0.75
1.00
0.01
0.80 0.90
2336 2976
374 477
94 120
42 53
24 30
0.05
0.80 0.90
1570 2102
252 337
63 85
28 38
16 22
where, as before, [x] is the smallest integer ≥ x and n is defined in (11.4). Although n* is always greater than n, the difference is not large. For example, n*− n = 1 when a = 0.05 and n*− n = 2 when a = 0.01. Thus, in most practical situations, the required number of patients when the variance is unknown is only one or two more patients per treatment group than the number required when the variance is known. Unequal Sample Sizes
More Than Two Treatment Groups
To allow for different sample sizes ni on the two arms, (11-4) may be written as: ⎛ 1 1⎞ ⎜⎝ n + n ⎟⎠ 1 2
−1
(
⎡σ 2 z + z α /2 β =⎢ ⎢ δ2 ⎢⎣
)
2
⎤ ⎥ ⎥ ⎥⎦
(Eqn. 11-8),
Any pair of values (n1, n2) that satisfy (11-8) will work. However, the required total sample size, n1 + n2, is minimized when n1 = n2. If we randomize patients to the two treatments in the ratio r:1, for some r > 1, rather than in the usual balanced 1:1 ratio (i.e., n1 = rn2), the required sample sizes are ⎡r +1 ⎤ n1 = ⎢ n⎥ ⎣ 2 ⎦ ⎡r +1 ⎤ n2 = ⎢ n⎥ ⎣ 2r ⎦
(Eqn. 11-9),
and the required total sample size is approximately n1 + n2 = 2n ( r + 1) 4r
entered on treatment 1 than on treatment 2), we would require (2n)(9/8) patients, a 12.5% increase over the 2n required in the balanced case. However, there may be other reasons for an unbalanced randomization (such as wanting more patients on one of the treatments to increase the precision of the estimated outcomes for that treatment). If so, an unbalanced allocation might be preferable to a balanced one even with the resultant sample size inflation factor.
2
(Eqn. 11-10), where n is determined by (11-4). The inflation factor of (r +1)2/4r in the required total number of patients is the primary reason that balanced randomization is generally preferred to unbalanced randomization. For example, if r = 2 (i.e., twice as many patients are
In many phase III clinical trials there are k > 2 treatments (4). Unfortunately, in order to control the error rates in this setting, the above sample size formulae cannot be extended simply by entering n patients on each of the k treatments (a total of kn patients). More than kn patients are required. Three important types of phase III clinical trials with more than two arms are considered below. Testing Equality Among k Treatment Arms. In the simplest type of k arm clinical trial, there is a randomization to one of the k arms and the primary objective is to test a global null hypothesis. With an obvious extension of the notation in (11-1), the hypotheses being tested are H0 : m1 = m2 = . . . = mk vs. H1 : mi ≠ mj for some i ≠ j (Eqn. 11-11), If s 2 is known then the test statistic, analogous to (11-2) in the two-sample case, is k
X2 =
2
n ∑ ( xi − x ) i =1
σ2
(Eqn. 11-12),
where x–i is the sample mean in the ith treatment arm and x– is the overall sample mean. The hypothesis
86
ONCOLOGY CLINICAL TRIALS
H0 is rejected in favor of H1 if X2 > χ2a,k–1, the upper 100(1–α) percentile of a chi-square distribution with k – 1 degrees of freedom. To determine the required sample size we need to solve an equation similar in form to (11-3):
(
)
P X 2 ≥ χ α2,k−1 Δ ≥ 1− β k
2
where Δ = ∑ ( μi − μ )
σ 2 and μ =
i =1
(Eqn. 11-13), 1 k ∑ μ. When Δ ≠ 0, k i =1 i
X2 has a noncentral chi-square distribution with noncentrality parameter nΔ and no closed-form solution for n exists analogous to (11-4). However, solutions are easily available either from computer programs or from tables of the noncentral chi-square distribution. As noted previously, the required sample size per treatment arm for k > 2 will be larger than that required when k = 2, increasing as a function of k. Table 11.2 gives the multiplication factor required for k = 3, 4, 5, and 6 when all means other than the largest and smallest are midway between the largest and smallest. For example, the number of patients required for k = 3 is 1.23n per arm (i.e., 23% more patients on each arm) when a = 0.05, 1—b = 0.80. For k = 4, 5, and 6, the requirements are 1.39n, 1.52n, and 1.63n, respectively. Although Table 11.2 represents the worst-case scenario, the one with the least favorable configuration of mean values between the two extreme means, there is generally a high price to be paid for increasing the number of treatment arms on a clinical trial. Two or More Experimental Arms and a Control Arm. Another common k-arm design results when we wish to compare k—1 new or experimental arms with a standard or control with randomization of each patient to one of the k arms. In this case, there are k—1 comparisons of interest. If we let arm 1 be
TABLE 11.2
Multiplication Factors for the Number of Patients Required for k > 2 Treatment Arms. k = NUMBER
OF
ARMS
a
1—b
3
4
5
6
0.01
0.80 0.90
1.19 1.17
1.32 1.29
1.43 1.39
1.53 1.48
0.05
0.80 0.90
1.23 1.20
1.39 1.35
1.52 1.47
1.63 1.57
* the control arm and define δ = min { μi − μ1 } , then a i = 2,...,k
simple, albeit conservative, approach in this setting is to apply equation (11-4), substituting α/2(k—1) for α/2 and δ* for δ. This approach ensures that a sufficient number of patients are entered to achieve the requisite power for all comparisons, allowing for the multiple comparisons. A better approach, requiring fewer patients, is to adjust for the inherent multiplicity using a less conservative multiple comparison procedure (5–7). Jung et al. (8) use a Dunnett-type procedure for this purpose in the setting of survival distributions (see the “Comparing Survival Distributions” section later in the chapter for more details on survival endpoints). Factorial Designs. In a factorial design there are several factors (or treatment types in a clinical trial) to be tested in combination (9). In the simplest type of factorial design, referred to as a 2 × 2 design, there are two treatments, each given at one of two levels. For example, the treatments might refer to particular therapeutic agents (say, A and B) and the two levels might refer to the presence or absence of a specified regimen for the agent. The four possible combined treatment regimens are: A and B absent; A absent and B present; A present and B absent; A and B both present. In general, with two factors we could define a R × C factorial design, with R levels of one factor and C levels of the other factor, although in clinical applications it would be rare for R or C to exceed three. A 2 × 2 design is by far the most common factorial design. In a factorial design it is possible to compare the effects of each of the treatments separately as well as to estimate the interaction effects among treatments. An interaction implies that the effect of a treatment depends on the presence or absence of another treatment. To make this point clear, consider a 2 × 2 design with two treatments either present or absent. The mean values in each of the four possible treatment combinations are given in Table 11.3. The treatment effects in Table 11.3 are the differences in mean values when the treatment is given compared to when it is not given. The quantity ε measures the interaction between treatments. If e = 0, the effect of treatment A is δ0 regardless of whether treatment B is given or not, and the effect of treatment B is δ1 regardless of whether treatment A is given or not. If e > 0 (a positive interaction), there is synergy between the treatments; the effect of each treatment is increased in the presence of the other. If e < 0 (a negative interaction), there is antagonism between the treatments; the effect of each treatment is decreased by the presence of the other.
87
11 DESIGN OF PHASE III CLINICAL TRIALS
TABLE 11.3
Treatment Effects in a 2 × 2 Factorial Clinical Trial. TREATMENT B ABSENT Treatment A
absent present Treatment A effect
m m + d0 d0
If one can assume that ε ≅ 0, then a factorial design is highly efficient. One can design the trial to test the effect of treatment A or B without consideration of the other treatment and get “two trials for the price of one.” If nA and nB are the required sample sizes for the two individual trials, then a single factorial trial of size n = max {nA,nB} will achieve at least the same power for each individual treatment comparison as two trials with total sample size of nA + nB. In fact, the power will be greater than required for the comparison with the smaller required ni. However, if e < 0, the power of the individual comparisons may be considerably less than that when there is no interaction. To allow for this possibility, one option is to assume a slight negative interaction in the design of the trial and increase the size of the trial accordingly. Unfortunately, if one wishes to test formally for interactions, the required size of the trial will be quite large, counteracting one of the primary advantages of a factorial design (9). It would also be possible to consider the trial as if it were a trial with k = RC treatments and use the approach outlined above for k treatment arms. However, this approach also can result in a very large trial, and in any case does not take advantage of the unique structure of the factorial design. Factorial clinical trials can play an important role in evaluating therapies, especially in a setting where treatments are likely to be used in combination in practice. Indeed, such trials are essential to learn about the joint effects of treatments. However, if the treatment combinations are not likely to be used in practice, factorial designs are not appropriate because of the potential for negative interactions and the resultant loss of power.
COMPARING SUCCESS RATES If the assumptions underlying the above derivations are approximately correct, the resultant sample size formulae can in principle be used directly. However, it is often necessary to consider modifications of the approach
PRESENT
TREATMENT B EFFECT
m + d1 m + d0 + d1 + e d0 + e
d1 d1 + e
when designing actual clinical trials. One such situation requiring special consideration concerns trials in which the outcome measure is a binary variable. Such trials are considered in this section. Another important special case, trials in which the outcome measure is time to some event, is considered in the subsequent section. In some phase III clinical trials the outcome on each patient is assessed as a success or a failure and the objective of the trial is to compare the success rates of the treatments under study. For example, a success might be defined as achieving a particular clinical status, perhaps achieving an objective response or remaining disease free for some specified time period. In these cases, the observed success rate on treatment i will have a binomial distribution with a mean pi, the unknown probability of success, and variance pi (1 – pi)/n. The hypotheses equivalent to those in (11-1) are H0 : p1 = p2 vs. H1 : p1 = p2
(Eqn. 11-14),
Even though the binomial distribution is not a normal distribution, for large samples a normal approximation is reasonable and one may use equation (11-5) directly with δ = p1 – p2 and si2 = pi(1 – pi). That is,
)(
(
⎡ p (1 − p ) + p (1 − p ) z + z 1 1 2 2 α /2 β n= ⎢ 2 ⎢ ( p1 − p2 ) ⎢⎣
)
2
⎤ ⎥. ⎥ ⎥⎦
(Eqn. 11-15), A second approach is to apply the variance-stabilizing transformation arcsin x/ n to the observed proportion of success x/n. This approach yields 2 ⎡ zα / 2 + zβ ⎢ n= ⎢ ⎢⎣ 2 arcsin p1 − arcsin p2
(
(
)
⎤ ⎥ . 2 ⎥ ⎥⎦
)
(Eqn. 11-16),
88
ONCOLOGY CLINICAL TRIALS
TABLE 11.4
Number of Patients on Each of Two Treatments to Compare Success Rates (a = 0.05, 1–b = 0.80). d = p2—p1 P1
0.10 0.20 0.30 0.40 0.50
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
725 1134 1416 1573 1605
219 313 376 408 408
113 151 176 186 183
72 91 103 107 103
51 62 68 70 66
38 45 49 49 45
30 35 37 36 33
25 28 29 28 25
A third approach, also based on the approximate normality of the sample proportions (10), yields
(
) ⎤⎥⎥.
⎡ ⎢ zα /2 2 p (1 − p ) + zβ p1 (1 − p1 ) + p2 (1 − p2 ) n= ⎢ 2 ( p1 − p2 ) ⎢ ⎣
2
⎥ ⎦
(Eqn. 11-17), Haseman (11) showed that all of the above formulae result in values that are too small when the actual test being used is an exact test. Casagrande et al. (12) provided an improved formula in this setting and Fleiss et al (13) showed that a better approximation results from a simple modification to (11-17):
(
⎡ ⎢ zα /2 2 p (1 − p ) + zβ n= ⎢ ⎢ ⎢⎣
(
) )
(
p 1− p + p 1− p 1 1 2 2 2 p −p 1 2
(
))
2
⎤ +2 p −p ⎥ 1 2 ⎥. ⎥ ⎥⎦
(Eqn. 11-18), Table 11.4 gives the required sample sizes based on (11-18) for some selected cases. COMPARING SURVIVAL DISTRIBUTIONS When the hypotheses being tested involve time-toevent, or survival data, several complications arise. The most important one is that the observations may be incomplete (or censored) at the time of the analysis, either because of dropouts or loss to follow-up or because the event in question (recurrence, progression, death, etc.) has not yet occurred for some patients. Censoring affects both the number of patients that need to be enrolled on trial as well as the required duration of trial. For reasons that will be made clearer below, the number of events, rather than the number of patients on trial, is the key quantity to be determined
and the duration of the trial must be planned to achieve the desired number of events. There has been a vast literature on the design of clinical trials with survival as the major endpoint (8, 14–32), mostly at a more advanced statistical level than the level of this chapter. Required Number of Events For a random variable T representing the time to some event, the key probability functions are the survival distribution or probability of surviving beyond time t, S(t) = P(T > t), and the hazard function l(t) = f(t)/S(t), where f(t) is the probability density function. The hazard function may be thought of as the instantaneous failure rate at time t for a patient who has survived up to time t. Each function may be derived from the other if the other is fully specified. The simplest type of survival distribution is the exponential distribution, for which the hazard function is constant over time. George and Desu (15) provided a framework for determining both the required number of events and the required duration of study when the survival distributions under study follow an exponential distribution. In this case, the survival function, the probability of surviving beyond time t, is Si(t) = exp(–lit) in the ith treatment group, where li is the hazard rate in the ith treatment group. The hypotheses being tested, analogous to those in equation (11-1), are H0 : l1 = l2 vs. H1 : l1 ≠ l2
(Eqn. 11-19),
Or equivalently, in terms of the hazard ratio, Δ = l1/l2, H0 : Δ = 1 vs. H1 : Δ ≠ 1
(Eqn. 11-20),
89
11 DESIGN OF PHASE III CLINICAL TRIALS
ards are not. The log-rank statistic is available as the score statistic from the maximum likelihood fitting of the proportional hazards model (33). Schoenfeld (34) showed that the method of George and Desu approximates the power of the log-rank test as long as the assumption of proportional hazards holds. Rubinstein et al. (19) show via simulations that trial lengths calculated using the statistic of George and Desu and assuming exponential failure times give valid powers for the log-rank test when the underlying survival distributions are exponential and Weibull. Under a proportional hazards model, the distribution of the log of ˆ can be approximated by the estimated hazard ratio, Δ, the same approximate normal distribution as in the exponential case. Thus, although the exponential distribution represents a simple special case of proportional hazards, the required number of events defined by (11-21) applies directly to the more general proportional hazards case. A more precise formulation is given in two papers by Lakatos (24, 26). If the proportional hazards assumption is not correct, the sample size formulae based on the assumption can produce erroneous results (27).
The ratio of the estimated hazard rates has an F distribution, so the required sample size can in principle be derived by solving an equation analogous to (11-3) for the F distribution. But these equations do not yield a closed form expression for the sample size. A much simpler and quite accurate approximation for a reasonably large number of events is based on the approximate normality of the natural logarithm of the estimated hazard rate in each treatment group: 1 ⎞ ⎛ ln λˆ i ~ N ⎜ ln λi , ⎟ , ⎝ di⎠ where di is the number of observed events. Thus, the distribution of the log of the estimated hazard ratio can be approximated as: ⎛ λˆ ⎞ ⎛ ⎛ 1 1 ⎞⎞ ln Δˆ = ln ⎜ 1 ⎟ ~ N ⎜ ln Δ, ⎜ + ⎟ ⎟ . ˆ ⎝ d1 d2 ⎠ ⎠ ⎝ ⎝ λ2 ⎠ The required number of events on the ith treatment group, di, can be obtained from the following equation, directly analogous to equation (11-8):
(
−1 ⎡ z +z ⎛ 1 1⎞ β ⎢ α /2 + = 2 ⎜⎝ d d ⎟⎠ ⎢ 1 2 ⎢⎣ ( ln Δ 0 )
)
2
⎤ ⎥ . (Eqn. 11-21), ⎥ ⎦⎥
Required Duration of Study The sample size approximation formula (11-21) and the tabulated values in Table 11.5 are for the required number of events at the time of the final analysis. In order to observe the requisite number of events, it is necessary to follow patients over time until the events are observed. At one extreme, we could enter exactly 2d patients on trial and follow all until failure; at the other extreme, we could enter patients continuously until 2d patients have failed. The former approach will require the maximum duration of study; the latter will yield the shortest duration of study but at the cost of entering the maximum number of patients. Either approach will yield the requisite power. However, some intermediate approach would be more appropriate in most cases.
where Δ0 ≠ 1 is the specified hazard ratio for which we desire the power of the test to be1 – b . Table 11.5 gives values of d1 + d2, for some common design situations. Here we assume d1 ≅ d2, yielding the minimum required total number of events. If the di are expected to differ substantially, an inflation factor similar to equation (11-10) should be applied. The exponential assumption is not as restrictive as it might first appear since the calculations are approximately correct for the general proportional hazards case, in which the ratio of the hazard functions is independent of time, even though the individual haz-
TABLE 11.5
Total Number of Events Required to Compare Two Survival Distributions. ∇ a
1—a
0.90
0.85
0.01
0.80 0.90
4209 5362
1769 2254
0.05
0.80 0.90
2829 3787
1189 1592
0.80
= HAZARD RATIO
0.75
0.70
0.65
0.60
0.55
0.50
939 1196
565 720
368 468
252 321
180 229
131 167
98 124
631 845
380 508
247 331
170 227
121 162
98 118
66 88
90
ONCOLOGY CLINICAL TRIALS
We assume that we will enter a sufficient number of patients (at least 2d) on study during some accrual period, each randomized to one of the two treatment arms. After this accrual period, there will be an additional follow-up period (T, T + τ) for all patients who have not failed before T in order to obtain the necessary 2d events. It is shown in George and Desu (15) that the expected number of events at time t, denoted D(t), can be written as E ⎡⎣ D ( t ) ⎤⎦ =
γt* p1 ( t ) + p2 ( t ) 2
(
)
(Eqn. 11-22),
where g is the average accrual rate, t* = min(T, t), and −1 pi ( t ) = 1 − ( λi t *) exp ( – λi t ) ⎡⎣exp ( λi t *) − 1⎤⎦ . To find
an appropriate accrual period T and follow-up time τ, we may require that the expected number of events at time T + τ be at least 2d: E ⎡⎣ D (T + τ ) ⎤⎦ ≥ 2d
(Eqn. 11-23),
As noted earlier, the minimum T + τ occurs when τ = 0 (i.e., enter patients continuously until the end of the study with no follow-up), although this also results in the maximum number of patients entered on study. Table 11.6 gives the minimum length of study for the case τ = 0 for some selected cases. In this table, the median time in the control group is assumed to be one year. For other median times, the times in Table 11.6 must be adjusted accordingly by multiplying by the correct control median. If the control median is in fact t years, the required length of study is t times the values given in Table 11.6. In addition, it should be noted that the time at which the required number of deaths occurs is a random variable. The above formulation in terms of the expected number of deaths yields a result that can provide a rough
approximation to the required length of study. Even if the assumptions are exactly correct, the actual time at which the required number of events will occur on any given clinical trial might be considerably different from the expected time. If we enter only the minimum number of patients (2d), we will require the maximum length of study T + τ. Indeed, the expected time until the last patient fails (this is required if we enter only the minimal number of patients) will be approximately 2d/g + (1.44M1 ln (2d))/Δ, where M1 is the median in the control group. For example, if we consider a = 0.05, 1—b = 0.80, and Δ = 0.70 then from Table 11.5, 2d = 248. If the entry rate (g ) is 200 per year, then from Table 11.6, the required minimum duration of study is 2.3 years when M1 = 1. However, if we enter only 248 patients and follow all of them to failure, the expected length of study would be approximately 12.6 years (T ≅ 1.2 years, t ≅ 11.4 years). Obviously, some kind of compromise approach is needed between the two extremes discussed above. Although we desire a reasonably short time until the study is completed, it is also desirable to keep the excess number of patients entered over the required number of events to be relatively small. One practical approach is to define a maximum proportionate increase, p, in patients entered over the required number of events, set T = 2d(1 + p)/g, and solve (Eqn. 11-23) for t. The duration of study calculations in this section, in contrast to the required number of events calculations in the previous section, depend heavily on the exponential assumption. For example, if the hazard rates decrease sharply over time, additional follow-up will not yield sufficient numbers of events as quickly as in the constant hazard rate model. In designing actual clinical trials, it is important to make realistic assumptions about the anticipated hazard rates. The
TABLE 11.6
Required Minimum Duration of Study (In Years) to Compare Two Survival Distributions (a = 0.05, b = 0.80). g = ANNUAL ACCRUAL RATE
∇ 0.90 0.80 0.70 0.60 0.50
2d
50
100
150
200
250
2830 632 248 122 68
58 14 6.2 3.6 2.3
30 7.6 3.7 2.2 1.5
20 5.5 2.8 1.7 1.2
16 4.4 2.3 1.4 1.0
13 3.8 2.0 1.3 0.9
11 DESIGN OF PHASE III CLINICAL TRIALS
general approaches outlined here still apply, but the details will differ.
CONCLUDING COMMENTS The situations considered in this chapter are simple ones, in order to emphasize the basic design principles without introducing unnecessary complexity. In practice, the complexity cannot be ignored. Some of the topics that will need consideration include interim analyses, missing data, dropouts, crossovers, timedependent losses, nonproportional hazards, unequal and informative censoring, composite endpoints, stratification and covariates, equivalence testing, nonuniform accrual, and others. Several chapters in this book are devoted in part to one or more of these topics. Also, many of these topics are covered in the papers by Ahnn et al. (29) and Barthel et al. (32) on complex clinical trials. In addition, there have been several books devoted exclusively to sample size determination for clinical trials, including those by Shuster (35), Desu (36), Machin (37), and Chow, et al (38). These books and the other references cited in the “References” section can be consulted for more extensive coverage of the design of complex clinical trials.
References 1. Donner A. Approaches to sample size estimation in the design of clinical trials—a review. Stat Med. 1984;3(3):199–214. 2. George SL. The required size and length of a clinical trial. In: Buyse ME, Staquet MJ, Sylvester RJ, ed. Cancer Clinical Trials. Oxford: Oxford University Press; 1984:287–310. 3. Guenther WC. Sample size formulas for normal theory t tests. Am Stat. 1981;35(4):243–244. 4. George SL. Multiple treatment trials. In: Crowley J, ed. Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker; 2001:149–160. 5. Bang H, Jung SH, George SL. Sample size calculation for simulation-based multiple-testing procedures. J Biopharm Stat. 2005;15(6):957–967. 6. Cook RJ, Farewell VT. Multiplicity considerations in the design and analysis of clinical trials. J Roy Stat Soc(A). 1996;159(1): 93–110. 7. Bauer P. Multiple testing in clinical trials. Stat Med. 1991;10(6): 871–889. 8. Jung SH, Kim C, Chow SC. Sample size calculation for the logrank tests for multi-arm trials with a control. J Kor Stat Soc. 2008;37(1):11–22. 9. Peterson B, George SL. Sample size requirements and length of study for testing interaction in a 2 × k factorial design when time-to-failure is the outcome. Contr Clin Trials. 1993;14(6): 511–522. 10. Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York: Wiley & Sons; 1981. 11. Haseman JK. Exact sample sizes for use with the Fisher-Irwin test for 2 × 2 tables. Biometrics. 1978;34(1):106–109. 12. Casagrande JT, Pike MC, Smith PG. An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics. 1978;34(3):483–486.
91
13. Fleiss JL, Tytun A, Ury HK. A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics. 1980;36:343–346. 14. Pasternack BS, Gilbert HS. Planning the duration of long-term survival time studies designed for accrual by cohorts. J Chronic Dis. 1971;24(11):681–700. 15. George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chron Dis. 1974;27(1):15–24. 16. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. introduction and design. Br J Cancer. 1976;34(6): 585–612. 17. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. analysis and examples. Br J Cancer. 1977;35(1):1–39. 18. Bernstein D, Lagakos SW. Sample size and power determination for stratified clinical trials. J Statist Comput Simul. 1978;8: 65–73. 19. Rubinstein LV, Gail MH, Santner TJ. Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation. J Chron Dis. 1981;34:469–479. 20. Freedman LS. Tables of the number of patients required in clinical trials using the log-rank test. Stat Med. 1982;1(2):121–129. 21. Makuch RW, Simon RM. Sample size requirements for comparing time-to-failure among k treatment groups. J Chron Dis. 1982;35(11):861–867. 22. Schoenfeld DA. Sample-size formula for the proportionalhazards regression model. Biometrics. 1983;39(2):499–503. 23. Palta M, Amini SB. Consideration of covariates and stratification in sample size determination for survival time studies. J Chron Dis. 1985;38(9):801–809. 24. Lakatos E. Sample size determination in clinical trials with timedependent rates of losses and noncompliance. Cont Clin Trials. 1986;7(3):189–199. 25. Lachin JM, Foulkes MA. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics. 1986;42(3):507–519. 26. Lakatos E. Sample sizes based on the log-rank statistic in complex clinical trials. Biometrics. 1988;44:229–241. 27. Lakatos E, Lan KK. A comparison of sample size methods for the log-rank statistic. Stat Med. 1992;11(2):179–191. 28. Ahnn S, Anderson SJ. Sample size determination for comparing more than two survival distributions. Stat Med. 1995;14(20): 2273–2282. 29. Ahnn S, Anderson SJ. Sample size determination in complex clinical trials comparing more than two groups for survival endpoints. Stat Med. 1998;17(21):2525–2534. 30. Jung SH, Hui S. Sample size calculation for rank tests comparing k survival distributions. Lifetime Data Anal. 2002;8(4): 361–373. 31. Halabi S, Singh B. Sample size determination for comparing several survival curves with unequal allocations. Stat Med. 2004;23(11):1793–1815. 32. Barthel FMS, Babiker A, Royston P, Parmar MKB. Evaluation of sample size and power for multi-arm survival trials allowing for non-uniform accrual, non-proportional hazards, loss to follow-up and cross-over. Stat Med. 2006;25(15):2521–2542. 33. Cox DR. Regression models and life tables (with discussion). J Roy Stat Soc(B). 1972;34:187–220. 34. Schoenfeld D. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981;68: 316–319. 35. Shuster J. Practical Handbook of Sample Size Guidelines for Clinical Trials. Boca Raton: Chemical Rubber Company; 1993. 36. Desu MM, Raghavarao D. Sample Size Methodology. San Diego: Academic Press; 1990. 37. Machin D, Campbell MJ. Statistical Tables for the Design of Clinical Trials. 2nd ed. Oxford: Blackwell Scientific; 1996. 38. Chow SC, Shao J, Wang H. Sample Size Calculations in Clinical Research. 2nd ed. New York: Chapman and Hall; 2008.
This page intentionally left blank
Multiple Arm Trials
12 Susan Halabi
Many phase III trials evaluate more than one experimental regimen against a standard therapy or a control. Loosely defined, multiple arm trials are trials that have more than two treatment arms. Historically, a common approach has been to estimate the sample size based on a two-arm trial and then increase the sample size by multiplying the number of patients per arm by k, where k is the number of arms. The main problem with using this heuristic approach is that the investigator does not take into account the multiple comparisons that are made. In addition, the power may be insufficient. In this chapter we review the basic principles involved in the design, conduct, and analysis of phase III trials with multiple treatment arms. We begin with a brief review of types of multiple treatment trials, and subsequently describe the factorial design. Next, we present a general discussion of design considerations including the control of type I error rate and sample size estimation. Finally, we provide an overview of the inherent challenges in analyzing data from multiple arms trials.
TYPES OF MULTIPLE ARM TRIALS Perhaps the oldest multiple arm trial was conducted in 1747 on board on the sailing ship Salisbury where sailors were randomized to one of six possible treat-
ments for scurvy: cider, vinegar, elixir vitriol, sea water, nutmeg, or oranges and lemon. Although the sample size was very small (two patients per treatment arm), patients who received oranges and lemon had visible benefit and were able to resume their duties on board. There are several different types of multiple treatment trials, we will discuss the more common types in this chapter. We focus on three arms for simplicity; however, the same principles apply to the situation when the design includes more than three arms. Monotonic Increasing or Decreasing Relationship with a Certain Therapy The primary hypothesis is that there is a monotonic increasing or decreasing relationship between the dose level and endpoint. As an example, a trial in metastatic prostate cancer is conducted where men are randomized to receive hydrocortisone (40 mg per day) plus three different doses of suramin: low (3.192 g/m2), intermediate (5.320 g/m2), or high (7.661 g/m2) (1). The primary hypothesis is that there is a monotonic relationship between the dose level and objective response rate in men with prostate cancer. In other words, men who receive an intermediate dose will have a higher response proportion than men with a low dose, but men in the high dose will have an even higher response proportion than both groups of men.
93
94
ONCOLOGY CLINICAL TRIALS
BCG followed by maintenance (control) R A N D O M I Z E
BCG +α interferon followed by maintenance (control + drug A)
BCG + Gemcitabine followed by maintenance (control + drug B)
R A N D O M I Z A T I O N
Arm A: Cetuximab + FOLFOX or FOLFIRI (control) Arm B: Bevacizumab + FOLFOX or FOLFIRI (control + drug A) Arm C: Cetuximab + Bevacizumab + FOLFOX or FOLFIRI (control + drug A + drug B)
FIGURE 12.1 First type of multiple arm trials.
FIGURE 12.2 Second type of multiple arm trials.
Control Versus Several Experimental Drugs In these types of trials, specific combinations of experimental therapies are of interest. The first type of these trials compares the control with control plus experimental regimen A and the control with control plus experimental regimen B (Fig. 12.1). In other words, this type of design tests two hypotheses: is experimental drug A efficacious, and is experimental drug B efficacious? Note that the investigator is not interested in comparing control plus experimental drug A versus control plus experimental drug B. To illustrate this point, consider an example where an investigator is interested in testing two hypotheses in patients with bladder cancer where the primary endpoint is complete response rate at 6 months. The first hypothesis is: the proportion of complete response is higher in patients randomized to BCG followed by maintenance (control) versus BCC plus interferon alpha followed by maintenance (experimental regimen). Whereas, the second hypothesis to be compared is: the proportion of complete response is higher among patients randomized to BCG followed by maintenance (control) versus BCC plus gemcitabine followed by maintenance (experimental regimen). The second type of trials are those that test the two hypotheses: is drug A efficacious, and is drug B efficacious over and above A? In an ongoing CALGB/ SWOG trial in patients with untreated advanced or metastatic colorectal cancer with K-ras wild type mutations, an investigator is interested in testing the two primary hypotheses concerning the overall survival endpoint (Fig. 12.2). The primary objectives were: to determine if the addition of bevacizumab and treatment with chemotherapy (FOLFOX or FOLFIRI) prolongs survival of patients treated with chemotherapy plus cetuximab. Furthermore, the investigator is interested in testing whether the addition of bevacizumab and treatment with chemotherapy prolongs survival of patients treated with bevacizumab plus chemotherapy plus cetuximab.
The third type of multiple arm study is a trial which tests hypotheses based on all pair-wise comparisons. For example, in a randomized cancer intervention trial, women were randomized to usual care (control), tailored print (TP), and TP plus telephone counseling (TC) (TP + TC). The major endpoint is proportion with adhered mammography and the investigator was interested in testing the three pair-wise comparisons (2). Factorial Trials The fourth type of multiple arm trials is the factorial design where two factors each at two or more levels are compared. The 2 × 2 factorial experiment is the simplest and one of the most common types of designs where two different treatment regimens are tested simultaneously in the same study without increasing the sample size. This type of design allows certain comparisons to be made that cannot be tested in regular k-arm trials. In some situations, if there is no interaction between the two treatments, two factors can be tested using the same number of subjects. In addition, factorial trials are the only design in which an investigator can test for treatment interaction between two drugs. The literature is rich in examples of factorial designs (3–4). We will consider a 2 × 2 factorial design. Table 12.1 presents the 2 × 2 design with four possible groups: λ11, λ21, λ12, λ22. The elements are ordered such that the last subscript (associated with factor B) changes first and the first subscript (associated with factor A) changes last. λ11 is the hazard rate in patients randomized to receive neither factor A nor factor B, whereas λ22 is the hazard rate in patients randomized to receive both factors A and B. λ2. is the average hazard where treatment A is present pooled over treatment B categories. For instance, in testing for drug A difference, we are interested in comparing the margins
95
12 MULTIPLE ARM TRIALS
TABLE 12.1
Example Factorial Design Cells with Two Factors A and B. FACTOR B FACTOR A
NO
YES
POOLED
No Yes Pooled
λ11 λ21 λ.1
λ12 λ22 λ.2
λ1. λ2. λ. .
λ1.− λ2.. On the other hand, in testing drug B difference, we are focused on comparing the margins λ.1− λ.2. Factorial designs are widely used not only in estimating and testing the effect due to the difference in treatment arms, but also in the interaction effects of the treatment arms or interaction of gene by treatment arm. One way to test interaction is to use the proportional hazards model with: λ (t| x1, x2) = λ0(t) exp(b1x1 + b2x2 + b3 x1x2), where λ0(t) is the baseline hazard function, xi = 0 or 1 represents the absence or presence of the factor A or B, respectively. The null hypothesis β3 = 0 indicates that there is no interaction between factors A and B. Testing for interaction is not sufficient as generally the power to detect interaction between two factors is poor. In fact, the main drawback to factorial trials is that interaction between the two treatment groups is common and investigators do not take this into account, and as a result such trials are usually underpowered. We will need to consider the quantity (λ11− λ12)— (λ21– λ22) for a test of interaction between factors A and B. Peterson et al. developed the sample size required in a 2 × 2 factorial design in the presence of interaction when the endpoint is time-to-event (5). Simon et al. proposed inflating the sample size by 30% to account for the possibility that an interaction may be present between the two factors (6).
SELECTION AND SCREENING DESIGN Selection Design Selection or screening designs, whereby patients are randomized to several treatment regimens, are commonly used in clinical trials in oncology (7–9). One of the main objectives in randomized phase II trials is to screen for multiple experimental regimens and choose
the most promising agent to take forward in a phase III setting (7). Investigators often use multiple variations of a new regimen to select the best regimen. Selection trials do not follow the traditional hypothesis testing framework, but are designed to achieve a high probability of selecting the true best treatment correctly. Furthermore, selection trials are usually underpowered, being designed with smaller sample size than a phase III trial. In addition, the type I error rate is not controlled at the design stage, as they do not allow for formal comparisons. Refer to Chapter 9 for more details on these designs. Chen et al. developed a procedure for comparing the experimental arms that differ from the control for the preferred treatment among several experimental arms (10–12). By using the multiple selection procedure, the type I error rate is protected for the preferred treatment.
DESIGN CONSIDERATIONS Type I Error Rate The type I error rate, or the false-positive rate which is often referred to as α, is the probability of rejecting the null hypothesis when the null is true. It is well recognized by investigators that the type I error rate increases with each hypothesis that is tested. The probability of finding one significant result by chance when in fact there are no differences is 1− (1− α)m if we conduct m independent tests of significance, each test with type I error rate of α. Table 12.2 presents the true type I error rate if multiple hypotheses are performed assuming α = 0.05 for each test and independence. For instance, if three hypotheses are tested to evaluate whether new treatments prolong survival compared to a standard therapy, the type I error rate is 14%. As a result, the probability of committing a type I error rate has increased from 5%, which is the original design, to 14%.
96
ONCOLOGY CLINICAL TRIALS
TABLE 12.2
Effect of Multiple Testing on the True Type I Error Rate Assuming α = 0.05. NUMBER 1 2 3 4 5 10
OF
TESTS
TYPE I ERROR RATE 0.0500 0.0975 0.1426 0.1855 0.2262 0.4013
Clinical trials are expensive and take a long time to conduct and the investigator needs to protect himself/herself from making erroneous claims when in fact there are no real effects. Therefore, it is critical for an investigator to control for the type I error rate prospectively and to specify the methods to adjust for multiplicity at the design stage. This is particularly important with multiple arm trials whereby several hypotheses will be compared and multiplicity will occur in the data analysis phase. Many multiple arm trials start with a global test of equality in hazard rates (or survival times). If the global null hypothesis of the equality of hazard rates is rejected, then one is generally interested in identifying treatments where hazard rates differ significantly. There are two common approaches to hypothesis testing after a global null hypothesis is rejected: either follow by pair-wise comparisons or use sequentially rejective methods. The Bonferroni procedure is one of the older procedures for controlling for the type I error rate (13). By using this approach, an investigator is assured that the type I error rate does not exceed α. The procedure is simple to calculate and is implemented by dividing the overall type I error rate α by the number of independent tests (m) to be conducted. Therefore, each hypothesis will be tested with αj = α/m. This procedure is considered a single-step approach, as only one step is needed to find the critical value and to perform the test. The main drawback of the Bonferroni procedure is that it is very conservative, especially when the tests are correlated and the number of independent tests (m) is large, and therefore will require large sample sizes. Various modifications of the Bonferroni procedure have been proposed for controlling for the type I error rate (14–18). These procedures are sequentially rejective methods (or stepwise) since the result of a given test depends upon the results of the other tests. These methods can be described as either step-down or
step-up procedures. Let us assume that we have m independent hypotheses to be tested. In the step-down procedure we start by ordering the p-values from the most significant test and then step down to the least significant test (15). For the step-up procedure (on the other hand) we start by controlling the p-value for the least significant test and then step up to the most significant test (16–17). Hochberg’s approach is one of the most commonly used step-up procedures (17). For example, in an international lung trial, patients with stage III B to IV non-small-cell lung cancer were randomly assigned with equal probability to receive docetaxel (75 mg/m2) and cisplatin or docetaxel and carboplatin every 3 weeks or vinorelbine and cisplatin every 4 weeks (19). The investigators used the Hochberg method to control the type I error rate for the two hypotheses being tested (treatment comparisons). The trial will be statistically significant if both tests achieved p-value ≤0.05 or if only one of them achieved p-value ≤0.028. Sample Size Determination There are different approaches to designing trials with several treatment groups (20–24). In this section, we discuss sample size computations based on a time-to event endpoint, the most common endpoint that is used in phase III trials in oncology. Comparison of Three Survival Curves For simplicity, we assume that the equal number of patients are randomly allocated to all treatment groups. Table 12.3 presents the accrual times (T) for k = 3 treatment groups, given the follow-up time τ, yearly entry rate n, a common hazard rate Φ for the censored observations, two hazard ratios Δ1 = λ1/λ2 and Δ2 = λ1/λ3, and two-sided type I error rate of 0.05. λ1, which is the hazard in the control arm and may be unknown at the design stage, but the median survival time (M1) in the control group is known. Therefore, using the wellknown relationship λ1 = (log 2) /M1 for the exponential distribution, λ1 can be determined. The hazard ratio (denoted as Δ = λ1/λ2) is the ratio of the hazard rate in subjects assigned to the control and experimental arms, respectively. Alternatively, we can define the hazard ratio as the ratio of the median survival time in the experimental arm to the control arm (Δ = M2/M1) if the survival time follows an exponential distribution. For each τ, Δ1, Δ2, there are two rows: the first row gives values of T for the power of 0.80 and the second row gives values of T for the power of 0.90. The values of T are obtained by solving nonlinear equations using the Newton-Raphson procedure.
97
12 MULTIPLE ARM TRIALS
TABLE 12.3
Accrual Period in Years (T) Assuming Equal Allocation Across the Groups and Type I Error Rate = 0.05. ENTRY RATE (PER
YEAR)
s
D1
D2
60
80
100
120
180
240
1
1.20
1.25
1.30
1.35
1.40
1.45
1.50
1.55
2.00
2.05
20.48 25.85 12.36 15.30 8.88 10.87 7.00 8.50 3.65 4.40
16.16 20.22 9.94 12.22 7.22 8.79 5.71 6.92 3.00 3.62
13.53 16.82 8.44 10.32 6.16 7.48 4.89 5.92 2.56 3.10
11.75 14.53 7.39 9.01 5.42 6.57 4.31 5.21 2.25 2.73
8.67 10.62 5.54 6.72 4.09 4.94 3.23 3.93 1.68 2.05
7.03 8.57 4.53 5.48 3.34 4.04 2.65 3.21 1.36 1.67
1.20
1.25
1.30
1.35
1.40
1.45
1.50
1.55
2.00
2.05
19.78 25.14 11.66 14.60 8.20 10.17 6.33 7.82 3.06 3.78
15.46 19.52 9.26 11.52 6.55 8.10 5.07 6.26 2.44 3.03
12.83 16.12 7.76 9.63 5.02 6.81 4.27 5.27 2.04 2.54
11.06 13.83 6.73 8.33 4.78 5.91 3.70 4.57 1.76 2.20
7.99 9.93 4.91 6.06 3.49 4.31 2.69 3.34 1.26 1.58
6.37 7.90 3.92 4.85 2.79 3.45 2.13 2.66 0.98 1.24
1.20
1.25
1.30
1.35
1.40
1.45
1.50
1.55
2.00
2.05
18.42 23.78 10.34 13.23 6.94 8.85 5.14 6.55 2.16 2.77
14.12 18.17 7.98 10.20 5.36 6.84 3.97 5.07 1.66 2.13
11.52 14.78 6.53 8.34 4.39 5.61 3.24 4.15 1.34 1.73
9.76 12.51 5.55 7.08 3.72 4.76 2.75 3.52 1.13 1.46
6.77 8.66 3.85 4.92 2.57 3.30 1.89 2.43 0.77 1.00
5.23 6.68 2.96 3.80 1.97 2.54 1.44 1.86 0.58 0.76
2
5
Suppose we are interested in testing the null hypothesis H0: λ1 = λ2 = λ3 against the alternative hypothesis H1: λ1 = 0.25, λ2 = 0.17, λ3 = 0.161 assuming a follow-up period of 2 years, power of 80%, and a two-sided type I error rate of 0.05. In our example, Δ1 = 1.50 and Δ2 = 1.55. From previous studies, we know that the yearly entry rate is 80 patients. Given the above values (from Table 12.3), the accrual length is 5.07 years and, therefore, the total duration of the trial is 7.07 years. In the colorectal cancer example, the investigators are interested in testing two primary hypotheses and the study is designed with a type I error rate = 0.025 (global type I error rate = 0.05) to be used to test each hypothesis. The study was designed with 90%
power to detect hazard ratio = 1.25 (median survival 27 months/22 months in the control arm). The accrual rate was assumed to be 75 patients/month with an accrual period ~30.5 and the follow-up = 24 months and the total trial duration is 64.5 months. The target sample size is 2,289 patients with 1,478 deaths expected at the end of the trial. Factorial Design Without Interaction An investigator was interested in testing the effect of two regimens (epothilone or taxotere and docetaxel) in men with castrate resistant prostate cancer. There are two factors each with two levels. Let Mij represent the
98
ONCOLOGY CLINICAL TRIALS
median survival time in months for the ith treatment of Factor A and the jth treatment of factor B. Overall survival was the primary endpoint and the median survival times were assumed to be M11 = 18, M12 = 23, M21 = 23, and M22 = 29 months. Based on historic data, an accrual rate of 240 patients per year was assumed (or nij = 60 patients per cell). We control on the type I error rate as we are testing two hypotheses. The following calculations assume a total of 1,200 patients accrued over a 4-year period, and followed for 4 years after study closure and a two-sided alpha level of 0.025. The power to detect a hazard ratio = 1.25 (equivalent to an increase in median overall survival [OS] from 20 months to 25 months) is 92%. With Interaction In an ongoing CALGB phase III trial patients with transitional cell carcinoma, patients are randomized to one of two treatments: (a) gemcitabine, cisplatin, and placebo, or (b) gemcitabine, cisplatin, and bevacizumab. It is assumed that the prevalence of a positive VEGF marker is 30%. The primary endpoint is OS. The investigator is interested in testing treatment by VEGF level interaction in predicting overall survival using a two-sided α = 0.05 level to attain a power of 0.80. In this example, the two factors A and B correspond to treatment with two levels (gemcitabine, cisplatin, and placebo and gemcitabine, cisplatin, and bevacizumab) and VEGF with two levels (negative and positive). The median survival times in months were assumed to be 13.80 (i.e. M11 = M12 = 13.80) for patients treated with gemcitabine, cisplatin, and placebo regardless of VEGF levels whereas M21 = 15.18 and M22 = 28.15 for patients treated with gemcitabine, cisplatin, and bevacizumab with negative and positive VEGF levels, respectively. The total yearly accrual rate of 167 patients will be randomized equally between the two treatments. With a 30% positive VEGF marker prevalence, 25 patients per year were expected to have a positive VEGF marker in each treatment arm, and 58.5 patients per year in each treatment arm were expected to have a negative VEGF marker. Also, we assumed a follow-up period of 12 months to detect the median survival differences noted above. Using the methodology developed by Peterson et al. (5), the accrual period is 3.52 years with a required sample size of 588 patients and an expected number of deaths of 428.
CHALLENGES IN MULTIPLE ARM TRIALS Because of space limitations, we cannot go in depth discussing some of the common challenges that
investigators may face in conducting and analyzing multiple arm trials; however, we provide an overview below. In multiple arm trials, plans to stop accrual in one experimental arm(s) need to be prespecified at the design stage to prepare for the situation when the results reach statistical significance depending on the study objective. If one experimental arm is closed, then the trial may continue accrual and patients will be randomized to the other arms. In the situation described in Figure 12.1, if the control arm is shown to be inferior to either experimental arm then it is necessary to terminate the entire study. Another important question is whether to control on the type I error rate and what approach to use. Recently, Freidlin et al. argued that there is no need to control the type I error rate when several experimental arms are compared to a control or standard arm (25). Their rationale is that such trials are designed to answer the efficacy question for each experimental drug separately, and as such the results of one comparison should not influence the results of the others. We have focused on the type I error rate, but power is also jeopardized by testing several hypotheses. Although the individual power to test each hypothesis may be sufficient, the global power may be less than optimal. Trials with either induction (obtain response from patients) or maintenance randomization may end up with different timing of randomization, and as such the interpretation of results may be difficult. Design of factorial or multiple arm trials becomes more complex when testing for superiority and for noninferiority. For instance, investigators designed the TAX 326 to answer both superiority and noninferiority questions, and thus the sample size is much larger compared to two-arm trials (19). Finally, we concentrate on the situation where k = 3 arms. The design of trials with more than 3 arms can become extremely complicated. In some instances, investigators may be able to frame their alternative hypothesis based on a partial order. The statistical methodology under order restriction is very mathematical and complex and the reader is referred to Singh et al. (24).
SUMMARY In summary, multiple arm trials pose many challenges in their design, conduct, and analysis. In such studies, investigators should explicitly state the hypotheses to be tested a priori, and specify the type I error rate and the method of controlling the type I error rate at the design stage. If designed and analyzed appropriately,
12 MULTIPLE ARM TRIALS
multiple arm trials can answer important questions concerning a relevant patient population and provide the basis to make valid inferences about the experimental therapies being tested.
References 1. Small EJ, Halabi S, Ratain M, et al. Randomized study of three different doses of suramin administered with a fixed dosing schedule in patients with advanced prostate cancer: results of Intergroup 0159, Cancer and Leukemia Group B 9480. J Clin Oncol. 2002;20:3369–3375. 2. Rimer BR, Halabi S, Skinner CS, et al. Effects of a mammography decision-making intervention at 12 and 24 months. Am J Prev Med. 2002;22:247–257. 3. Henderson, IC, Berry DA, Demetri GD, et al. Improved outcomes from adding sequential paclitaxel but not from escalating doxorubicin dose in an adjuvant chemotherapy regimen for patients with node-positive primary breast cancer. J Clin Oncol. 2003;21: 976–983. 4. Stampfer MJ, Buring JE, Willett W, et al. The 2 × 2 factorial design: its application to a randomized trial of aspirin and carotene in U.S. physicians, Stat Med. 1985;4:111–116. 5. Peterson B, George SL. Sample size requirements and length of study for testing interaction in a 2 × k factorial design when time-to-failure is the outcome. Cont Clin Trials. 1993;14: 511–522. 6. Simon R, Freedman LS. Bayesian design and analysis of 2 × 2 factorial clinical trials. Biometrics. 1997;53:456–464. 7. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology. New York: Chapman and Hall; 1997. 8. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Canc Treat Rep. 1985;69:1375–1381. 9. Gibbons JD, Olkin I, Sobel M. Selecting and Ordering Populations: A New Statistical Methodology. New York: Wiley & Sons; 1977. 10. Chen TT, Simon R. A multiple step procedure with sequential protection of preferred treatments. Biometrics. 1993;49: 753–761.
99
11. Chen TT, Simon R. Extension of two-sided test to multiple treatment trials. Commu Statist Theory Methods. 1996;25: 947–965. 12. Chen TT, Simon RM: Extension of one-sided test to multiple treatment trials. Cont Clin Trials. 1994;15:124–134. 13. Bland JM, Altman DG. Multiple significance method. Br Med J. 1995;310:170. 14. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986;73:751–754. 15. Holm S: A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6:65–70. 16. Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811–818. 17. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75:800–802. 18. Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc. 1955;50:1096–1121. 19. Fossella F, Pereira JR, Pawl J, et al: Randomized multinational phase III study of docetaxel plus platinum combinations versus vinorelbine plus cisplatin for advanced non-small cell lung cancer: The TAX 326 study group. J Clin Oncol, 2003; 21:3016–3024. 20. Makuck RW, Simon RM. Sample size requirements for comparing time-to-failure among k treatment groups. J Chron Dis. 1982;35:861–867. 21. Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival endpoints. Cont Clin Trials. 1994;16: 119–130. 22. Ahnn S, Anderson SJ. Sample size determination for comparing more than two survival distributions. Stat Med. 1995;14:2273–2282. 23. Halabi S, Singh B. Sample size determination for comparing several survival curves with unequal allocations. Stat Med. 2004;23:1793–1815. 24. Singh B, Halabi S, Schell M. Sample size selection in clinical trials when population means are subject to a partial order. J Appl Stat. 2008;35:583–600. 25. Freidlin B, Korn E, Gray B, et al. Multi-arm clinical trials of new agents: some design considerations. Clin Cancer Res. 2008;14:4368–4371.
This page intentionally left blank
13
Noninferiority Trials in Oncology
Suzanne E. Dahlberg Robert J. Gray
The randomized phase III clinical trial is the gold standard for establishing the clinical efficacy of a new therapy. The majority of trials are designed to demonstrate the superiority of a novel agent or of a novel combination of therapies over a placebo or the current standard of care. In oncology, the remainder of these trials tends to fall into a class of trials known as noninferiority trials, which aim to determine whether or not a new treatment is no worse than the standard treatment by a prespecified margin. These designs are especially appropriate if the new treatment is thought to be less toxic, less expensive, or less invasive than the standard of care without placing a patient at an unacceptable amount of additional risk relative to the standard therapy, or more generally, when the new agent is thought to provide practical benefits without compromising efficacy. This chapter discusses issues pertaining to the design, monitoring, analysis, and interpretation of noninferiority trials. Many of these issues are also relevant for equivalence studies, but these are rare in oncology. Equivalence trials test whether or not a new treatment is no worse and no better than a standard therapy by a particular margin. Because curing cancer is the ultimate goal of cancer researchers, it generally is not of interest to conduct trials to show that a new agent does not provide more benefit than currently available therapies. Although the words noninferior and equivalent may be used interchangeably in certain
settings, it is important to emphasize the different statistical interpretations of these words when used in the context of clinical research. The examples discussed throughout this chapter illustrate designs of phase III noninferiority trials in oncology. Phase II noninferiority trials are an oxymoron; phase II trials generally aim to demonstrate a level of efficacy that deems an agent worthy of further study, and this level should be large enough to justify the use of resources to conduct a phase III trial. On the other hand, noninferiority studies also require very large sample sizes, which are not a feasible characteristic of phase II studies. Therefore, establishing noninferiority in the phase II setting is generally not of interest.
STATISTICAL DESIGN In noninferiority trials, patients are randomized between the new treatment (or combination) and an active control arm (usually representing the current standard of care). The most important aspect of any noninferiority design is the selection of the noninferiority margin (Δ). The value of Δ is a measure of the difference in the efficacy outcome between the treatment arms, and the magnitude of Δ is chosen so that the new treatment would be preferred if it is worse (less efficacious) than the standard by an amount < Δ. No formal rules for the
101
102
ONCOLOGY CLINICAL TRIALS
selection of this margin exist. The International Conference on Harmonization (ICH) recommends choosing a margin that is both statistically and clinically justified in a study population similar to those previously studied in the disease of interest (1). The magnitude of the noninferiority margin must be smaller than the magnitude of efficacy of the control arm, since otherwise a conclusion of noninferiority might be possible for a treatment that was not superior to best supportive care (Fig. 13.1). The small increase in risk characterized by the noninferiority margin must be compared to and considered an acceptable trade-off for the possible benefits in side effects, quality of life, and cost. The most common endpoints for noninferiority trials in oncology are time-to-event endpoints such as overall survival (OS) or progression free survival (PFS). Binary endpoints are feasible, but they provide less information and precision than the time-to-event endpoints, which report both the binary event status and the time of the event; therefore, binary endpoints provide less power than time-to-event outcomes given a fixed sample size. If registration with the Food and Drug Administration (FDA) is the ultimate goal of the trial, one should proceed with caution on the use of binary endpoints, since the FDA has stated that noninferiority trials “with endpoints other than survival are problematic” (2). Therefore, the examples that follow highlight noninferiority trials that implement time-to-event endpoints.
Types of Designs Traditional (Fixed Margin) Method The traditional noninferiority design, called the fixed margin method, uses a fixed prespecified noninferiority margin. For time-to-event endpoints, treatment differences are often expressed in terms of the hazard rate ratio. The hazard rate at follow-up time t is the instantaneous rate at which events occur among patients who have not had an event by t, and the hazard ratio Θ is the ratio of the hazard rate in the new treatment arm over that in the control arm. In general, the hazard ratio would vary with t, but under the common proportional hazards assumption, the ratio is assumed to be constant over time. That is, the event rate is assumed to be increased or decreased by the same proportionate amount across follow-up time. When the hypotheses are expressed in terms of hazard ratios under the proportional hazards assumption, the noninferiority margin Δ is >1, and 100(Δ − 1) is the minimum percent increase in the hazard rate where the new treatment would be considered unacceptably worse than the standard.
1
BSC
Standard 2 New Drug FIGURE 13.1
The magnitude of the noninferiority margin, Δ2, must be less than the magnitude of superiority, Δ1, of the control arm over best supportive care (BSC).
The null hypothesis in the traditional noninferiority design is usually taken to be inferiority: H0: Θ ≥ Δ. (The new treatment is at least 100 (Δ − 1)% worse than the standard of care.) HA: Θ < Δ. (The experimental arm has at most 100 (Δ − 1)% higher risk than the control.) The sample size is chosen to ensure adequate power to reject inferiority in favor of equivalence (Θ = 1). Since the amount a new treatment could be worse than the standard and still be considered acceptable is usually quite small, the difference between the null and alternative hazard ratios (Δ and 1) is usually smaller than in superiority studies, so noninferiority trials tend to require much larger sample sizes. The null hypothesis in this formulation cannot be tested using a standard log-rank test, but Cox proportional hazards models can be used instead. The hypothesis in a noninferiority trial can also be formulated with equivalence (Θ = 1) as the null and inferiority (Θ > 1) as the alternative. In this setting, the sample size is chosen to ensure adequate power to detect a ratio of Θ = Δ, so in either case the study is designed to discriminate between hazard ratios of Δ and 1. One advantage of using the null of equivalence is that the traditional log-rank test can be used. In the second version, the interpretation of the type I and type II error rates are reversed from their usual definitions in the superiority test setting. The type I error is the probability of falsely rejecting the null hypothesis. Rejecting the null hypothesis of equality results in the continued use of the standard therapy, so a type I error corresponds to rejecting a new treatment with equivalent efficacy. The type II error rate is defined as the
13 NONINFERIORITY TRIALS IN ONCOLOGY
probability of falsely declaring equality of the two regimens, which in this setting results in adopting a new regimen that is truly inferior to the standard of care. With the null hypothesis of equivalence, controlling the type II error rate is very important, so values of 0.025 to 0.05 are common; whereas larger type I error rates (0.05 to 0.10, one-sided) may be appropriate. When the null hypothesis of inferiority is used, rejecting the null hypothesis is the same as concluding noninferiority, and the usual definitions of the type I and type II error rates apply. Therefore, controlling the one-sided type I error rates is as important for the null of inferiority as it is in a superiority trial. With either null hypothesis, it is strongly recommended that confidence intervals for the primary endpoint always be presented in addition to any p-value from the hypothesis test so that the precision of the estimates may be considered in the context of previously conducted studies of the standard agent (3 − 5). These confidence intervals are compared to the noninferiority margin to declare noninferiority, as discussed later in this chapter.
103
appropriate. In hybrid designs, a noninferiority margin Δ > 1 is specified as in the traditional fixed margin design, and the null hypothesis of inferiority (Θ = Δ) is used, but the sample size is chosen to have adequate power to detect an alternative ratio of Θ = ΘA < 1. Since the difference between Δ and ΘA is larger than between Δ and 1, hybrid designs require smaller sample sizes than true noninferiority designs. However, they are also underpowered for concluding noninferiority when the new treatment is only equivalent to (rather than superior to) the current standard. These designs allow for a formal test of superiority in addition to the noninferiority test. First, the null hypothesis of inferiority is tested. If inferiority is rejected, then the null hypothesis of equivalence can be tested against the alternative of superiority. There is not an inflation of the overall type I error rate, because the second test is only performed if the first is significant. This superiority test may be underpowered for realistic alternatives, though, unless this hypothesis has been considered in determining the sample size. Examples of Study Designs
Percent Retention Method The fixed margin method assumes that prior data or other considerations support an absolute standard for the noninferiority margin. However, there is always some uncertainty in the estimated benefit of the standard treatment, both in terms of statistical variability in the estimates of benefit from prior studies and from other differences between the current study and the prior studies that established the benefit of the standard, such as differences in patient population or supportive care. A more appropriate approach may be the percent-retention method proposed by Rothmann et al. (6), which takes into account the uncertainty in the benefit of the standard treatment to ensure that a conclusion of noninferiority establishes that the new treatment is superior to best supportive care by at least a specified percent of the benefit of the current standard over best supportive care. This method compares the two-sided 95% confidence interval for comparing the new treatment to the active control in the noninferiority study to the confidence interval for the hazard ratio from a meta-analysis of the control compared to placebo, with the confidence level adjusted to preserve the overall type I error rate of the study. Hybrid Design If a new therapy has a milder toxicity profile and an expected modest benefit in efficacy compared to the standard treatment, a hybrid design (7) may be
One study utilizing a traditional noninferiority design is the Eastern Cooperative Oncology Group-led intergroup phase III trial, TAILORx, which compares hormonal therapy alone to chemotherapy plus hormonal therapy in node-negative, hormone receptor positive patients with an Oncotype DX Recurrence Score (RS) of 11-25 (8, 9). Patients with low RS levels have been shown to have low absolute recurrence rates (10), and retrospective data suggest that the benefit of adjuvant chemotherapy is primarily in patients with high RS levels (11), so the purpose of the study is to examine the effect of chemotherapy in the population with intermediate RS values. Since the study population consists of patients meeting standard guidelines for treatment with chemotherapy, the study uses a noninferiority design to determine if hormonal therapy alone is as efficacious as hormonal therapy plus chemotherapy in this selected subset. In addition, patients with low risk disease (RS < 11) are directly assigned to hormonal therapy alone, to prospectively validate whether the recurrence rate in this group is as low as expected. This study uses the null hypothesis of equivalence (the second form discussed above) with inferiority as the alternative. Thus, under the null hypothesis, the addition of chemotherapy to hormonal therapy is assumed to have the same effect on disease free survival (DFS) as hormonal therapy alone. Since hormonal therapy is known to be beneficial in this population, the considerations discussed above for choosing the noninferiority margin small enough to ensure that it corre-
104
ONCOLOGY CLINICAL TRIALS
sponds to benefit for the new therapy compared to best supportive care do not apply to this study. Rather, the noninferiority margin was chosen based on the minimum benefit that chemotherapy would have to have to be considered worth the side effects and cost. In extensive discussions with oncologists and patient advocates, it was agreed that a 3% absolute increase in the 5-year DFS event rate with hormonal therapy alone would be unacceptable. Since the 5-year DFS rate was expected to be 90% with chemotherapy, the noninferiority margin on the hazard ratio was set at a 32.2% increase in the DFS hazard rate (Δ = 1.322), since this increase in hazard corresponds to a decrease in the 5-year DFS rate of 90% to 87%, assuming proportional hazards. The design uses a one-sided type I error rate of 10% and a type II error rate of 5%, which reflects the reversal of the roles of the type I and type II errors when the null hypothesis of equivalence is used in a noninferiority design. The design calls for randomization of 4,390 patients, and full information will be reached when 534 DFS events have occurred (in this case, the large sample size is a result of the low event rate, rather than because of a small noninferiority margin). The primary comparison is also planned to use an intention-to-treat analysis, where patients are included with their assigned treatment group regardless of treatment received. If some patients assigned to chemotherapy do not receive it, and some assigned to hormonal therapy alone receive chemotherapy, there will be dilution of the treatment difference between the randomized arms, and consequently a loss of power and inflation of the type II error. The planned sample size includes an inflation of 10.8%, based on the LachinFoulkes adjustment (12), to compensate for an assumed treatment nonadherence rate of 2.5% (average on each arm). This is a critical issue in a noninferiority study, since this dilution of the treatment effect could result in an inferior treatment appearing noninferior, due to reduced power. Since controlling the type II error is critical in a noninferiority design using the null hypothesis of equivalence, it is important to closely monitor the treatment adherence rates and to adjust the design, if necessary, to ensure adequate power. In the case of TAILORx, preliminary data suggest that the nonadherence rate is higher than allowed in the design, and that a substantial increase in the planned sample size will be needed to ensure adequate power. A second example of a trial utilizing a traditional noninferiority design was published by Scagliotti et al, who reported results from a phase III trial comparing cisplatin/gemcitabine to cisplatin/pemetrexed as firstline therapy for patients with metastatic non-small-cell lung cancer (13). Because pemetrexed has a better safety profile and is easier to administer, this study was
designed using a noninferiority design. The noninferiority margin for the overall survival endpoint was set at 1.176 comparing cisplatin/pemetrexed to cisplatin/ gemcitabine. This study used the null hypothesis of inferiority, defining inferiority to be a 17.6% increase in the mortality hazard rate with cisplatin/pemetrexed (the 17.6% increase corresponds to the cisplatin/ gemcitabine hazard rate being 15% lower than the cisplatin/pemetrexed rate). Therefore, cisplatin/ pemetrexed would be considered noninferior to cisplatin/gemcitabine if the upper bound of the two-sided 95% confidence interval for the overall survival hazard ratio was less than 1.176. This study design required 1,725 patients and full information of 1,190 deaths to provide adequate power to discriminate between hazard ratios of 1.176 and 1.0. The reported hazard ratio was 0.94 (95% CI: 0.84, 1.05) and the authors concluded that these two treatment regimens have similar efficacy in this patient population. Since the new treatment is being used in combination with the same drug used in the current standard of care (cisplatin), comparison to best supportive care is not relevant in this study. The current standard of care for first line chemotherapy treatment of NSCLC consists of cisplatin or carboplatin doublets (usually in combination with a taxane, vinorelbine, or gemcitabine). These doublets have been shown to improve survival over the older control arm of cisplatin plus etoposide, and are thought to be roughly equivalent to each other. Instead of indirect comparison to best supportive care, the relevant question here is whether the noninferiority margin was sufficient to establish that adding pemetrexed to cisplatin improves survival over the older control arm of cisplatin plus etoposide, since the current standard treatments are known to do so. A meta-analysis of survival outcomes of 4,556 patients randomized to gemcitabine plus platinum chemotherapy or other platinum therapy in 13 other trials resulted in an OS hazard ratio of 0.90 (95% CI: 0.84–0.96) in favor of gemcitabine (14); specifically, one of the studies included in this metaanalysis was a study of cisplatin/gemcitabine versus cisplatin/etoposide (15). The reported overall survival hazard ratio from this study of 135 patients with advanced NSCLC was 0.77 (95% CI: 0.55–1.10), but it should be noted that the primary endpoint of this trial was response rate. Neither of these results is consistent with a hazard ratio of 0.85 favoring gemcitabine/ cisplatin, since the upper limits of both confidence intervals exceed this margin. In fact, none of the upper confidence bounds from the 13 trials in the meta-analysis is lower than 0.85, and therefore the choice of the noninferiority margin for the pemetrexed study seems suspect. Since the upper confidence limit of the overall survival hazard ratio in the Scagliotti trial (1.05) is
105
13 NONINFERIORITY TRIALS IN ONCOLOGY
less than the upper bound reported in the meta-analysis (1.10), one could argue that the observed results of the pemetrexed trial may be sufficient to establish noninferiority. An example of a clinical trial utilizing a hybrid design is the phase III ECOG trial E1A06 comparing melphalan, prednisone, and thalidomide (MPT) to melphalan, prednisone, and lenalidomide (MPL) in newly diagnosed myeloma patients who are not candidates for high dose therapy. Although MPT is considered the standard of care for this patient population because of its dramatic improvement in efficacy, thalidomide is a very toxic agent. Lenalidomide, an agent with similar activity to thalidomide, is less toxic than thalidomide, hence justifying the use of a noninferiority trial in this setting. Although it is anticipated that MPL will prove the superior regimen with improved PFS and lower toxicity, neither feature can be confidently predicted since long-term follow-up for the MPL regimen is lacking and myelosuppression may prove limiting and may compromise drug dosing and efficacy. Therefore, this trial aims to test the following hypotheses: H0: Θ ≥ 1.220. (MPL is inferior to MPT.) HA: Θ < 1.220, but the study is powered to detect ΘA < 0.935. (MPL is noninferior to MPT, but superiority is expected.) Here Θ represents the PFS hazard ratio for MPL versus MPT, and the noninferiority margin corresponds to the PFS hazard rate being 22% larger on MPL than on MPT. Instead of powering the comparison to detect a difference between ratios of 1.22 and 1.0, the hybrid design is only powered to conclude noninferiority if it turns out that the new treatment is slightly better than the standard. Additionally, the PFS hazard ratio for MPT versus MP is 0.51, and the 95% CI is (0.39–0.66) (16), which is consistent with an event rate that is 1/0.66 = 1.52 times the event rate on MPT. Since the noninferiority margin of 22% is much less than the known magnitude of benefit of MPT over the previous standard of care, rejecting the null hypothesis of inferiority would be more than sufficient to establish that MPL was superior to melphalan and prednisone (MP) alone. The noninferiority margin is substantially smaller than the difference between MPT and MP because, in the judgment of the investigators, an increase of 22% in the PFS hazard rate over MPT would make the use of MPL inappropriate. Assuming PFS follows exponential distributions, the hazard ratios can be translated into differences in median PFS. The trade-off in toxicity associated with MPL would be considered acceptable as long as the median PFS for the MPL arm was not more than
5 months less than the median PFS of 28 months assumed for the MPT arm, even though MPL is expected to demonstrate a 2-month improvement in median PFS. With 560 patients to be randomized and full information of 366 events, this hybrid design has 80% power at a one-sided 0.05 significance level to test these hypotheses. Implementing a traditional noninferiority design for this trial with the same type I and type II errors as the hybrid design would require randomizing 1,003 patients and full information of 655 events.
MONITORING Randomized clinical trials require formal monitoring by a data monitoring committee (also known as a data safety monitoring committee) at planned intervals throughout their duration, but noninferiority trials may be subject to monitoring rules that vary from those established for superiority trials. Generally, noninferiority studies should include early stopping rules based on clear evidence of inferiority at interim analyses. On the other hand, there may not be a clear ethical imperative to stop early for evidence of noninferiority, and early stopping in favor of noninferiority should only be considered when there is strong evidence that the new treatment is not inferior to the standard. For these reasons, rules for early stopping in favor of noninferiority may be more conservative (if used at all) than typical of early stopping rules in superiority studies. Given the importance of confidence intervals for interpreting the results of noninferiority studies, repeated confidence intervals (17) provide a natural approach for interim analyses in noninferiority studies. This approach uses a confidence interval on the treatment effect as the basis of the stopping rules, and adjusts the width of the intervals so that their intersection has the desired coverage probability. The monitoring rules for the TAILORx study illustrate these points. The rule for stopping in favor of inferiority uses a conventional O’Brien-Fleming boundary to control the type I error rate (for the null hypothesis of equivalence) at an overall one-sided 10% rate. The stopping rule in favor of inferiority uses a two-sided 95% repeated confidence interval on the hazard ratio, and stops only when the ratio corresponding to the noninferiority margin lies above the upper confidence limit. When this occurs, the hypothesis of inferiority is rejected at the conventional one-sided 2.5% level (taking into account the multiple interim analyses). This is stronger evidence than required at the final analysis, since the study is designed to have 95% power rather than 97.5%. Because noninferiority studies typically enroll large numbers of patients with very lengthy follow-up
106
ONCOLOGY CLINICAL TRIALS
periods, noninferiority studies may take a much longer time to complete. For a patient facing a treatment decision, early release of results from a noninferiority study may help guide the choice of therapy. Korn et al. (18) outlined a set of guidelines by which noninferiority studies may undergo early stopping by a data monitoring committee without harm to the conduct or error structure of the study. The proposed method is not data driven and has some useful practical implications that should be considered before implementation of any noninferiority design. The guidelines require that accrual be terminated, that all patients be off treatment, and that a considerably lengthy follow-up period remain before any data can be released. It is also important that both of the therapies being studied in the trial be publicly available at the time of early data release; if an experimental agent is being compared to standard of care, most patients are probably receiving the standard of care anyway, and early release of data will not have an impact on their course of treatment since the other agent is unavailable. If data are to be released early, it must seem unlikely that patients will modify treatment once the trial results are made available, and all results should be clearly communicated as preliminary results, accompanied by confidence intervals and descriptive statistics about follow-up times. The main problem with early release of results from a noninferiority study is not unique to this type of trial; because data cleaning and analysis take place each time results are released, early release of data can potentially increase and burden the workflow of data managers. Another argument against the use of these guidelines is that it violates the formal monitoring guidelines for the trial and they challenge the degree of reliability of the early results in the absence of established stopping rules.
ity trial, but a review of equivalence studies by Ebbutt et al. demonstrated that the conservatism of the results demonstrated by wider confidence intervals around point estimates is due entirely to smaller sample sizes rather than lack of bias from noncompliance (20). Therefore, it is often required to demonstrate noninferiority in both the ITT and PP analysis populations to declare success in such a study. TAILORx again illustrates these points. With nonadherence rates possibly as large as 10% to 15%, there will be substantial reduction in power in the ITT analysis. However, the prognosis of patients choosing to receive chemotherapy when assigned to hormones alone may be different than those who choose not to receive chemotherapy when assigned to receive it, so performing a PP analysis, which excludes both of these groups, could also bias the comparison of the treatment arms. INTERPRETATION When the results of a phase III trial demonstrate strong statistical evidence for rejecting the null hypothesis, interpretation of the results is relatively easy. For example, when the null hypothesis is inferiority, such an instance allows one to conclude that the experimental agent is noninferior or even superior to the standard of care. Noninferiority can be declared if the confidence interval for the estimate of the primary endpoint excludes the noninferiority margin in favor of improved outcomes, and if this interval also excludes the null value in favor of even better outcomes, then superiority can be declared (Fig. 13.2). When the data are not convincing enough to declare superiority or noninferiority, the most important thing to remember is that failure to reject the null
Superiority
ANALYSIS Choosing the appropriate analysis population in noninferiority studies can be quite a challenge because inherent flaws in the study conduct may bias results toward the conclusion of noninferiority (19). The intent-to-treat (ITT) principle induces conservatism into the analysis of superiority trials, since the analysis includes all randomized patients. The inclusion of nonadherent patients biases results toward the null hypothesis of no treatment difference; nonadherence in a noninferiority study, therefore, biases results toward treatment similarity. The per-protocol (PP) population, which consists of all patients who complete treatment without major protocol violations, is sometimes viewed as the more appropriate analysis set for a noninferior-
Noninferiority Inconclusive Inconclusive Inferiority Hazard ratio 1 New treatment better
New treatment worse
FIGURE 13.2 Conclusions to be made based on where the 95% confidence intervals for the hazard ratio fall in relation to the noninferiority margin (Δ).
13 NONINFERIORITY TRIALS IN ONCOLOGY
hypothesis is not the same as accepting the null hypothesis, and the clinical interpretation of a trial in these circumstances can be challenging. If the confidence interval around the estimate of the primary endpoint excludes the noninferiority margin in the direction of worse outcomes, then inferiority should be concluded. Confidence intervals containing the noninferiority margin are inconclusive and should be described as such in a conservative manner.
2.
3. 4.
5.
Concluding Noninferiority from a Superiority Trial Concluding noninferiority from a superiority trial that failed to meet its primary endpoint is a controversial convention that happens too often in the reporting of clinical trials. Superiority trials are typically underpowered for a formal test of noninferiority, so failure to demonstrate that a new therapy provides a statistically significant benefit compared to a standard treatment should not be interpreted as clinical equivalence. That is not to say this does not happen; in a study by Greene et al., 78% of clinical trials concluding equivalence between the years of 1992 and 1996 did not fit their established statistical criteria of a genuine noninferiority study: defining the noninferiority margin, implementing the appropriate sample size, and conducting a formal test for noninferiority (21). Herein lies the debate over the need for concordance between the clinical and statistical interpretation of the data. Concluding noninferiority from a superiority trial if a noninferiority boundary was prespecified in the protocol can still be problematic (22), but it does not impact the sample size or primary hypothesis test of the study. Rather, the confidence interval approach should be used to consider the precision of the estimates with regard to the prespecified noninferiority margin, which should always be clinically relevant and justified. In the absence of a prespecified noninferiority margin, it remains a challenge for the oncology community to avoid the assumption of equivalence when a superiority study fails to meet its endpoint, but it is also unreasonable to expect that each failed superiority trial should be followed by a formal noninferiority study of the same agents. Resources and patients are too limited to do this, and some argue that even the best-conducted noninferiority trial does little to drastically improve outcomes for cancer patients.
6. 7. 8. 9. 10. 11.
12.
13.
14.
15.
16.
17. 18. 19. 20.
References
21. 22.
1. U.S. Department of Health and Human Services, Food and Drug Administration. International Conference on
107
Harmonization Guideline: Guidance on Choice of Control Group and Related Design and Conduct Issues in Clinical Trials. ICH E10. July 2000. U.S. Department of Health and Human Services, Food and Drug Administration. Guidance for Industry: Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics. May 2007. Makuch R, Simon R. Sample size requirements for evaluating a conservative therapy. Cancer Treat Rep. 1978;62:1037–1040. Piaggio G, Elbourne DR, Altman DG, et al. Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement. JAMA. 2006;295: 1152–1160. Zee BC. Planned equivalence or noninferiority trials versus unplanned noninferiority claims: are they equal? J Clin Oncol. 2006;24:1026–1028. Rothmann M, Li N, Chen G, et al. Design and analysis of noninferiority mortality trials in oncology. Stat Med. 2003;22:239–264. Freidlin B, Korn EL, George SL, Gray R. Randomized clinical trial design for assessing noninferiority when superiority is expected. J Clin Oncol. 2007;25:5019–5023. Sparano JA. TAILORx: Trial assigning individualized options for treatment (Rx). Clin Breast Cancer. 2006;7:347–350. Sparano JA, Paik S. Development of the 21-gene assay and its application in clinical practice and clinical trials. J Clin Oncol. 2008:26;721–728. Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351:2817–2826. Paik S, Tang G, Shak S, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol. 2006;24: 3726–3734. Lachin JM, Foulkes MA. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics. 1986;42:507–519. Scagliotti GV, Parikh P, von Pawel J, et al. Phase III study comparing cisplatin plus gemcitabine with cisplatin plus pemetrexed in chemotherapy-naive patients with advanced-stage non-smallcell lung cancer. J Clin Oncol. 2008;26:3543–3551. Le Chevalier T, Scagliotti G, Natale R, et al. Efficacy of gemcitabine plus platinum chemotherapy compared with other platinum containing regimens in advanced non-small-cell lung cancer: a meta-analysis of survival outcomes. Lung Cancer. 2005;47:69–80. Cardenal F, Lopez-Cabrerizo MP, Anton A, et al. Randomized phase III study of gemcitabine-cisplatin versus etoposidecisplatin in the treatment of locally advanced or metastatic nonsmall-cell lung cancer. J Clin Oncol. 1999;17:12–18. Facon T, Mary JY, Hulin C, et al. Melphalan and prednisone plus thalidomide versus melphalan and prednisone alone or reduced-intensity autologous stem cell transplantation in elderly patients with multiple myeloma (IFM 99-06): a randomised trial. Lancet. 2007;370:1209–1218. Jennison C, Turnbull B. Group Sequential Tests with Applications to Clinical Trials. Chapman and Hall/CRC. 2000. Korn EL, Hunsberger S, Freidlin B, et al. Preliminary data release for randomized clinical trials of noninferiority: a new proposal. J Clin Oncol. 2005:23;5831–5836. Sanchez MM, Chen C. Choosing the analysis population in noninferiority studies: per protocol or intent-to-treat. Stat Med. 2006;25:1169–1181. Ebbutt AF, Firth L. Practical issues in equivalence trials. Stat Med. 1998;17:1691–1701. Greene WL, Concato J, Feinstein AR. Claims of equivalence in medical research: are they supported by the evidence? Ann Intern Med. 2000;132:715–722. Le Henanaff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials. JAMA. 2006;295:1147–1151.
This page intentionally left blank
14
Bayesian Designs in Clinical Trials
Gary L. Rosner B. Nebiyou Bekele
In this chapter, we discuss issues that arise when developing, writing up, and implementing clinical study designs that incorporate Bayesian models and calculations. We have had the opportunity to work with many such designs at The University of Texas M. D. Anderson Cancer Center. We feel that study designs that incorporate Bayesian models offer many advantages over traditional frequentist designs, and we will discuss these advantages in this chapter. At the same time, Bayesian models require a lot of thought and close work with the clinical investigators. Also, there may be some reluctance on the part of some clinical investigators to accept a study design that is built on Bayesian considerations. We will provide some arguments and real examples that may help statisticians overcome such reluctance. Although our examples tend to come from the field of oncology, the lessons and underlying ideas have broad application. (See Carlin and Louis [1] for a general introduction to Bayesian methods.)
WHY BAYESIAN DESIGNS What Are Bayesian Designs? Types of Bayesian Designs First we need to define what we mean by a Bayesian design. In the first paragraph, we specifically avoided writing the term “Bayesian design,” choosing instead
the phrase “clinical study designs that incorporate Bayesian models and calculations.” The latter phrase allows us to include many designs that are not fully Bayesian, meaning that they do not choose the design to minimize some risk. Instead, many of these “calibrated Bayes” (2) designs incorporate a Bayesian model, possibly considering prior information, in the stopping rules of the study. An example of this calibration is the following. The statistician and clinical investigators decide on the general form of the criteria for decisions at interim analyses, such as basing decisions on the posterior probability that the treatment’s success probability exceeds a threshold value. Next, the statistician will typically carry out a large number of simulations under various scenarios. The statistician reviews the simulation results with the clinical investigators, allowing them to decide on the criteria that yield the best (to their minds) operating characteristics. This process may include changing the benchmark value against which one compares the posterior treatment-related success probability or the degree of certainty (e.g., 80% or 90%) that one will require before one will consider stopping the study. There also exist more formal Bayesian designs for clinical trials. Berry argues for the application of decision theory in clinical trial design (3, 4). Even if one takes a fully Bayesian view, one will still find that reviewing these a priori simulations serves to make the
109
110
ONCOLOGY CLINICAL TRIALS
transition from frequentist designs to Bayesian ones easier for clinical investigators. Simulations under various scenarios also help reveal sensitivity of the study’s decisions and inferences to prior assumptions. We discuss these ideas later in this chapter through our examples. Advantages of Bayesian Designs Why is one interested in Bayesian designs for clinical trials? One can view a clinical trial as an experiment that will lead to a decision (use the new treatment or do not use the new treatment) or prediction (the new treatment regimen will provide a benefit of so much over the standard treatment). Bayesian methods are ideal for decision making (i.e., minimizing risk or maximizing utility) and for prediction. Additionally, Bayesian methods are ideal for combining disparate sources of information. Thus, one can construct a coherent probability model to combine the information from the current study with historical data and with any information available from ongoing studies using Bayesian considerations. Perhaps a further impetus to the current interest in Bayesian designs is the fact that the Bayesian inference obeys the likelihood principle. Many of the clinical studies we see include interim analyses, and when there is no provision for interim analyses, we often suggest them in our reviews. The likelihood principle is important as it relates to interim analyses of an ongoing study. One develops frequentist stopping rules, such as group sequential designs (5), in a way that preserves the overall type I error under the null hypothesis. Thus, a treatment effect that might have been statistically significant without any prior interim analyses may not be significant after accounting for the number of prior analyses. The likelihood principle, however, requires that data that lead to the same likelihood for the parameter of interest should lead to the same inference (6). A consequence of the likelihood principle is that the number of interim analyses does not affect Bayesian inference, since the likelihood is the same whether the current analysis had been the first or the most recent of several earlier analyses. All that matters to the Bayesian are the data at hand and not what happened before, unless earlier analyses somehow alter the likelihood. Another reason more and more clinical trials are incorporating Bayesian ideas is the desire in many situations to include adaptive randomization. Such clinical trials change the randomization probabilities in light of the accruing data. The study may start randomizing patients to the different treatments with equal probability. Then, perhaps after enrolling some minimum number of patients, the randomization probabilities adapt to favor the better performing treatments.
Bayesian methodology may enter the study by way of using posterior probability calculations to influence the randomization probabilities (7). The ethical idea is to reduce the number of patients who receive inferior treatment while still accruing convincing evidence within the clinical trial. (There have been interesting discussions of the ethics of randomization and adaptive randomization (8–12), but we do not discuss this aspect of clinical trial design here.) We discuss an example of Bayesian adaptive randomization later in this chapter.
REQUIREMENTS FOR A SUCCESSFUL BAYESIAN DESIGN As with all clinical studies, considerable work has to go into the preparation of the study design. The statistician and the clinical investigators need to discuss the study’s aims and objectives. Care must go into selecting end points for the primary and secondary aims of the study. Much of these considerations are discussed elsewhere in this volume, so we will focus more on the aspects that relate to the Bayesian part of the design. In particular, we will talk about the prior distribution and stopping rules. Additionally, if one takes a decision-theoretic approach, one will have to consider the utility function that accounts for the study’s aims. If one wishes to calibrate the design, then one will have to review with the other investigators the implication of various decision-rule parameters on the operating characteristics. Software for Real-Time Updating Real-time updating is an important aspect of modern Bayesian trial designs. These designs incorporate early stopping rules, allowing the investigator to stop early for lack of efficacy, superiority, or excessive toxicity. For example, in a single-arm phase II study that will compare the progression free survival (PFS) associated with a new treatment to historical information for one or several standard treatments, an investigator may desire to stop the study early if there is evidence that the new treatment results in worse outcomes than the historical standard. A Bayesian approach to this problem might assume that PFS follows an exponential distribution (with rate parameter q ) and, with a conjugate gamma prior, that q follows, a posteriori, a gamma distribution. A common stopping rule under this setup is to stop the trial if at any point Pr(q > q *| Data) > C (where q * usually represents some historical event rate and C is some pre-specified threshold value). One computes this probability each time a new patient (or group
111
14 BAYESIAN DESIGNS IN CLINICAL TRIALS
of patients) enters the study or when a patient already enrolled experiences disease progression. Typically, calculation of the above probability requires numerical integration, and one must develop statistical software to carry out the calculations necessary to monitor the accruing data and determine whether the interim stopping boundaries have been crossed. By software, we include R scripts or SAS macros written solely for use by the collaborating statistician, stand-alone desktop computer programs written for use by other statisticians, or even Web-based applications for use by nonstatistical research staff. The kind of application one develops is a function of who the end user will be and how often future studies may use the same sort of design. For example, it may be sufficient for the collaborating statistician to have a function that runs within a general purpose statistical or mathematical package when carrying out a singleinstitution study that will evaluate outcomes for a rare disease with slow accrual in which posterior updating will be necessary only every 4 to 6 weeks. In contrast, a rapidly accruing multicenter, multiarm study may require real-time updating via a Web-based application or a telephone voice-response system. Part of the work involved in implementing Bayesian methods is to determine the exact software needs of the particular study’s design. Below we describe a set of commonly used trials that require real-time updating.
Types of Studies Phase I Oncology Dose-finding Study Many drugs used in oncology are associated with severe toxicities and have a narrow therapeutic window, meaning that there is only a small range of doses that may be efficacious without being overly toxic. Therefore, the initial step in assessing these compounds in humans usually focuses on finding a dose that has an acceptable level of toxicity. Because one of the most important constraints on the conduct of these initial trials is the desire to limit the number of patients who experience severe toxicity, these studies are conducted with dose escalation proceeding in a sequential manner. That is, the study enrolls small cohorts of patients (e.g., three to six) and does not assign a higher dose until each patient in a given cohort has been through at least one cycle of treatment and their outcomes assessed. The toxicity outcomes observed from these (and earlier) patients may enter into an algorithm that the investigators use to select the dose for the next cohort. The purpose of this sequential approach is to decrease the chance that large numbers of patients receive doses that are too toxic.
The assumption underlying this approach in oncology, at least, is that toxicity and response are correlated through dose. That is, higher doses lead to an increase in the toxicity risk and an increase in the probability that a patient will respond to treatment. This assumption was historically reasonable in oncology, where one defined activity in terms of killing cancer cells. Thus, phase I oncology studies have traditionally attempted to determine the highest dose that has an acceptable toxicity level, since by assumption this dose will also lead to greater efficacy than lower doses. Bayesian phase I designs treat a patient’s risk of toxicity at a given dose as a quantity about which the investigator has some degree of uncertainty. One quantifies this uncertainty via a probability distribution. Decisions to escalate the dose, continue with the current dose, or de-escalate from the current dose incorporate the most current data. Given what one has learned to date, one will treat the next patient with the dose with an expected risk of toxicity that is closest to a predefined target toxicity risk. In such a setting, Bayesian methods offer clear advantages. The Bayesian framework provides a means by which one can learn about toxicity risks at the different doses and naturally make decisions based on the data observed in a sequential manner. The increase in knowledge is reflected by a decrease in uncertainty as one moves from prior to posterior. Phase II Adaptive Randomization Trials Bayesian adaptive randomization designs successively (as patients are evaluated for outcome) modify the randomization probabilities based on either posterior or predictive probabilities favoring one treatment over another. In essence, data from patients previously enrolled and evaluated in a study are used so that patients currently enrolling onto the trial will have a higher probability of being randomized to the most efficacious treatments. In these types of designs, subjects are initially randomized fairly (i.e., with equal probability) to the various (at least two) treatment arms. Since many adaptive randomization trials usually have a period in which patients are equally randomized prior to the implementation of adaptive randomization, it is important that the statistician monitor the actual randomization versus expected randomization. Other Trials Other interesting and useful examples of successful Bayesian applications in the design of clinical trials include single-treatment phase II studies that consider efficacy and toxicity, with stopping rules based on both
112
ONCOLOGY CLINICAL TRIALS
end points (13–16). Another interesting innovation is the so-called seamless phase II/III design (17, 18). With this design, randomization begins within the context of a small phase II study that collects survival information but has an intermediate end point as the primary outcome. Based on early results with respect to the intermediate end point, however, the study may expand to a large randomized phase III study with survival as the primary outcome. Berry et al. discuss a design that simultaneously sought the best dose of a drug in an adaptive way and maintained a randomized comparison with placebo (19). Other examples exist in the literature (20). As mentioned earlier, the Bayesian inferential machinery fits well with decision theory. Once one has determined an appropriate utility function, one can set up the design to optimize the utility. Furthermore, one can carry out sequential decision making, either fully via backward induction (21) or by looking ahead one or a few steps. In all cases, one maximizes the utility, taking into account posterior uncertainty. There also exist more formal Bayesian designs for clinical trials. Berry argues for the application of decision theory in clinical trial design (3, 4). Kadane (22) presents an interesting example of a clinical trial, describing the background and development of the study. The literature includes other examples of formal Bayesian designs (23–26). Rossell et al. (27) and Ding et al. (28) present decision-theoretic designs for phase II studies that screen out active therapies from among a sequence of new treatments.
Realistic Priors Historical Priors Often, there have been earlier studies with one or more of the agents under investigation in the current study. These data usually inform the study’s design, either informally (as in determining the null and alternative hypotheses in frequentist designs) or formally via a prior distribution. One may find, however, that if one assumes that the current study’s patients will be exchangeable with the historical information, the historical information will be extremely informative with respect to inference during the current study. In fact, in some cases, it may well be that there is little reason to embark on the current study, given the evidence in the historical information. (In many situations, it may well be appropriate to consider whether there really is a need for the current study, given the strength of historical evidence. That is a topic for another discussion, however.) Since the current study will go forward, one has to find a way to discount the historical information
or choose not to assume that the patients in the current study are exchangeable with the earlier studies. If we consider a binary outcome, such as treatment success or failure (however defined), then we might characterize the historical data by means of a beta distribution. For example, if an early study enrolled 50 patients, and 30 patients experienced a treatment success, we might characterize the uncertainty about the treatment’s underlying success probability by a beta distribution with parameters equal to 30 and 20. One might think of this prior as the posterior distribution arising from an experiment that gave rise to these data and a fully noninformative beta[0,0] prior. (Alternatively, one could consider an initial uniform[0,1] prior or a Jeffreys beta[0.5, 0.5] prior and determine a posterior beta distribution with slightly different parameters[1].) Now, one might feel that the beta[30,20] prior is too informative for this study. For example, this distribution has 95% of the central mass between 0.46 and 0.73. If one wants to entertain the possibility of smaller success probabilities than 0.4, then one may want to discount this prior data in some way. A natural way to keep the prior mean 0.6 but increase the uncertainty is to decrease the prior sample size. For example, one might choose to reduce prior information to the equivalent of a prior sample size of 5 by way of a beta[3, 2] distribution. Now the central 95% of the mass lies between 0.19 and 0.93. A related approach for discounting the historical information is with a power prior (29, 30). The power prior extends the notion of discounting to a general class and allows for inference with respect to the degree of discounting. Briefly, one considers a parameter in the probability model that will characterize the level of discounting for the historical information. The basic idea of the power prior is that the more similar the prior and current data are, the less discounting that takes place and vice versa. Let L (q | D) represent the likelihood function that will characterize the data at the end of the current study (i.e., after collecting the data represented by D). Using the same likelihood function with the historical data DH, the power prior is p(q | DH, δ) ∝ L(q | DH )δ p(q | f), where the parameter f is a hyperparameter for an optional initial prior. The parameter d will serve to discount or down-weight the information content of the historical data when one will apply this prior to carry out posterior inference in the analysis of the current study. Another way people have discounted prior information is less direct: they have modified parameters in the stopping rules to make it more difficult to stop early. In other words, one uses the historical information to generate an informative prior but makes the
14 BAYESIAN DESIGNS IN CLINICAL TRIALS
cutoff for early stopping more stringent than perhaps one would normally consider reasonable. For example, if one is basing the stopping rule on a criterion based on the posterior probability that some parameter or function of model parameters exceeds a threshold, one may require a very high probability (e.g., 99%) of this event before considering early stopping. Making the stopping rule more stringent basically provides a way to keep the prior from dominating early decision making and allows the current study to continue accumulating data. The process of determining the boundary criteria often proceeds iteratively. One determines the criteria for early stopping by carrying out simulations under various scenarios and then deciding which stopping rules lead to satisfactory operating characteristics. Although such devices tend to make the designs acceptable to frequentists, because of the calibrated operating characteristics, they also may tend to undermine the benefit of the underlying Bayesian model. The historical information may become almost neglected or, at most, these data enter into the design as a formality without giving full consideration of their importance to the inferential question under investigation. Elicitation of Experts Elicitation of priors from experts would seem a reasonable approach, especially in the absence of historical data. Carlin et al. (31) describe their experience eliciting prior information for a clinical trial. Problems may occur in a clinical trial for which the experts may have provided a prior that subsequently appears to be at odds with the data. An informative example is discussed by Carlin et al. in the context of a randomized clinical trial evaluating the benefit of prophylaxis against possible infection with toxoplasmic encephalitis (TE) (32). In this study, the five experts whose opinions went into the prior distribution turned out to have been overly optimistic. Each expert anticipated a treatment benefit. Although there was widespread disagreement among these five individuals, none considered the possibility that the treatment would be no better than placebo, let alone worse. The key points resulting from these investigators’ experience with this study are instructive. In particular, the experts may provide point estimates, but there is underlying uncertainty in each expert’s opinion. Perhaps a mixture of these separate prior distributions will be more robust to the analysis than combining the experts’ point estimates into a single prior. Another point brought out in this study was that different experts might find it easier to specify priors for the effect of the treatment on different end points. For
113
example, one expert was not able to provide a prior estimate of the effect of the treatment on the risk of death or TE, whereas the other four could and did. In our experience, it is also important that those whose opinions one seeks see the consequences of their a priori estimates. Graphical displays of uncertainty distributions or of observable quantities, given prior specification, allow the experts to gain insight into the implications of their stated beliefs (16, 33). Quite often, this feedback reveals inconsistencies and leads to revisions. Thus, one has to be careful about incorporating expert opinion into a prior distribution for a clinical trial’s design. Operating Characteristics One of the biggest challenges to utilizing Bayesian methods when designing studies is having software available to assess the operating characteristics of a design. For any Bayesian design used in practice, the collaborating statistician must provide operating characteristics that summarize the behavior of the proposed method under a wide variety of situations (called scenarios). Because these designs typically involve complex models and decision rules, one has to carry out simulations to evaluate the operating characteristics of the proposed design. Some of the characteristics that one typically summarizes are the number of patients assigned to each treatment, the probability of selecting each dose as most efficacious, the probability of stopping a trial if all treatments are too toxic, etc. The statistician typically considers a wide variety of possible scenarios ranging from very pessimistic, such as the case when no treatment provides any benefit, to optimistic cases in which several of the treatments are effective. Purpose of Checking Operating Characteristics (Calibration) Controversy Surrounding Evaluation of Frequentist Properties If one has chosen to demonstrate the frequentist characteristics of the Bayesian design, then one will have to simulate the design under different scenarios. It may seem odd to want to evaluate the frequentist characteristics of a proposed Bayesian design, but some reasons are as follows. First, one may want to convince the non-Bayesian audience that the proposed design offers benefits over standard frequentist designs without incurring a loss in terms of the frequentist characteristics. For example, some sequential designs base their stopping rules on posterior probability calculations, such as Prob(treatment difference > delta | Data) > cutoff. One
114
ONCOLOGY CLINICAL TRIALS
can certainly view these posterior probabilities as test statistics, being functions of the data, even though they differ from more common test statistics. Thus, one can evaluate the operating characteristics. Another reason one might want to estimate the operating characteristics of the proposed design is to evaluate how robust the design is under different scenarios. If one feels that the prior distribution is based on rather limited historical information, for example, then one might want to ensure that the prior does not overly dominate inference in certain situations. Potential Pitfalls Potential pitfalls include not stopping when one should, stopping a study and later regretting it, and the often perceived possibility that the study’s Bayesian analysis will not receive widespread acceptance. The surest way to avoid these problems is to carry out simulations under many, many different scenarios. EXAMPLES OF BAYESIAN DESIGNS What Worked and Why We have seen dozens of successful Bayesian clinical trials at the M. D. Anderson Cancer Center. One characteristic that has contributed to successful implementation is a schedule of regular meetings between the statisticians and the clinical research staff during the trial’s design stage. The meetings serve to educate both groups to the other’s needs and perspectives. After initiation of patient enrollment, meetings between the research staff and the statistician continue for the purpose of interim review of the trial’s progress. Also, the statistician should provide some data management oversight to ensure that the database accurately reflects the trial data. Clear communication between the clinical investigators and statisticians with respect to what a design can and cannot do is essential. It is also vitally important for the statistician to test the computer code and interface to ensure everything is working properly. Is the program computing the posterior probabilities correctly? Do the results and recommendations in different hypothetical situations make sense mathematically and clinically? Is the user interface (for example, a stand-alone graphical user interface or a Web-based application) intuitive and easily navigated by the individuals who will be using it? Does the interface perform appropriately? These are important questions to address while preparing the protocol and well before the study enrolls the first patient if one wants to realize the full potential of the Bayesian design. When clinical studies with Bayesian designs work well, the
benefits of these designs are very much appreciated by the collaborating investigators. Below we give three examples of clinical studies from our institution (from a potential list of dozens). Correlated Ordinal Toxicity Monitoring in Phase I In this example, investigators used a Bayesian design within a new statistical framework for dose-finding based on a set of qualitatively different, ordinal-valued toxicities (34). The objective of this trial was to assess the toxicity profile associated with the anticancer drug gemcitabine when combined with external beam radiation to treat patients with soft-tissue sarcoma. The study’s design allowed for possible evaluation of a total of 10 gemcitabine doses, combined with a fixed dose of radiation. Traditionally, phase I studies in oncology consider a binary end point as the primary outcome. This binary end point is an indicator of whether or not each patient experienced a dose-limiting toxicity, as defined in the protocol. This single end point reduces all toxicity information across grade or severity of the toxicity and across organ systems into a single yes-orno outcome. (Berry et al. discuss the use of a hierarchical model to borrow strength across types of toxicities within organ systems in the context of drug safety monitoring [35]). In most phase I oncology settings, however, the patient is at risk of several qualitatively different toxicities, each occurring at several possible levels of severity. Moreover, the different toxicities often are not of equal clinical importance. The design of this soft-tissue sarcoma phase I study represented a radical departure from conventional phase I study design in oncology. It was based on an underlying probability model that characterized the relationship between dose and the severity of each type of toxicity. The model included a set of correlated normally distributed latent variables to induce associations among the risks associated with the different toxicities. Additionally, there were weights or numerical scores to characterize the importance of each level of each type of toxicity. The statistician met with the physicians prior to initiation of the trial to elicit from them these scores. An algorithm combined the scores associated with each type and level of toxicity with the probability of observing each particular type and level of toxicity. This algorithm produced a weighted average toxicity score. This weighted average toxicity score informed decisions about doses for successive cohorts of patients in this phase I study. Concerns expressed by the oncologists motivated the development of this design. The clinicians wanted a dose-finding method that would account for the fact
14 BAYESIAN DESIGNS IN CLINICAL TRIALS
that, clinically, the toxicities that they had identified are not equally important. Additionally, the different toxicities do not occur independently. The investigators also requested that the dose-finding method utilize the information contained in the grade or severity of an observed toxicity. That is, if patients experience a low-grade toxicity at a given dose, while not dose limiting, this event suggests that higher doses may be more likely to lead to a higher grade of that toxicity. The Bayesian framework of this study’s design was capable of addressing all of the investigators’ concerns regarding characterization of toxicity while also incorporating key design aspects required for institutional approval of the protocol, such as early trial termination for excessive toxicity at the lowest dose. At the end of the study, the model recommended a dose to take forward into phase II, and the investigators were in complete agreement with this choice as the appropriate dose. Joint Modeling Toxicity and Biomarker Expression in a PhaseI/II Dose-Finding Trial In this example, the investigators used a Bayesian framework to model jointly a binary toxicity outcome and a continuous biomarker expression outcome in a phase I/II dose-finding study of an intravesical gene therapy for treating superficial bladder cancer (36). Since the toxicity and efficacy profiles of the gene therapy were unknown, the investigators proposed a phase I/II dose-finding study with four possible doses. This trial’s motivation was partially attributable to the increasing use of biomarkers as indicators of risk or as surrogate outcomes for activity and efficacy. In many contexts, the biomarker is observable immediately after treatment, allowing the investigators to learn about the therapeutic potential of the compound without having to wait months or even years as survival data mature. Unlike conventional phase I studies, this study’s objective was to determine the best dose based on both biomarker expression and toxicity. This dual outcome required a joint model for the two end points. For ethical reasons, the study escalated doses between patients sequentially. An algorithm based on the joint model chose the dose for each successive patient using both toxicity and activity data from patients previously treated in the trial. The modeling framework incorporated a correlation between the binary toxicity end point and the continuous activity outcome via a latent Gaussian random variable. The dose-escalation/de-escalation decision rules were based on the posterior distributions of model parameters relating to toxicity and to activity. The study’s
115
stopping rule called for it to stop if the estimated risk of toxicity appeared excessive or if there was clear evidence that the treatment was not modulating the biologic marker. The Bayesian framework used in this study allowed for flexible modeling of some rather complicated outcomes. In addition, this framework provided a coherent mechanism for incorporating prior information into the modeling process. The study ended, in fact, when it became evident that the drug was not modulating the biologic marker. Adaptive Randomization Investigators wished to evaluate the effectiveness of combinations of three drugs (an immunosuppressive agent, a purine analog anti-metabolite, and an antifolate) to prevent graft-versus-host disease (GVHD) after transplantation (37). The study used adaptive randomization and was to enroll a maximum of 150 patients. A success was defined in this study as “alive with successful engraftment, without relapse, and without a GVHD 100 days after the transplant.” The design called for comparing each treatment to the control arm (i.e., the combination treatment with the immunosuppressive agent and anti-folate) in terms of the probability of success in the following manner. Let p0 be the success probability in the control arm. Similarly, let p1, p2, p3, and p4 be the success rates in the 4 other treatment arms (three-arm combination treatments with varying doses of the purine analog antimetabolite). As information accrued about the treatments, the investigators altered the randomization probabilities from equal randomization to biased randomization based on the posterior probability that each treatment-specific success probability exceeded that of the control arm. That is, the randomization would adapt to favor treatments associated with success probabilities that were greater than that of the control via P(pk > p0 | data) (for k = 1, 2, 3, 4) after appropriate scaling. In addition, the study’s design allowed for early stopping based on predictive probabilities. Specifically, the investigators dropped a treatment arm if the predictive probability that its success probability will be greater than p0 was less than 0.05, given the data at hand and the data yet to accrue. The design was successful in that it limited the number of patients who received the inferior treatments to 18.2% of all of the 110 patients randomized to one of the four experimental arms. By contrast, a design that randomized patients equally to the treatments and did not allow for early stopping would have exposed 50% of patients to these ineffective therapies.
116
ONCOLOGY CLINICAL TRIALS
What Did Not Work and Why When designing clinical studies, the collaborating statistician should be aware of potential pitfalls associated with the design or designs of choice. This is true of Bayesian designs, which may have some unique issues to consider. The most common difficulties include problems with the computer code, such as bugs that lead to incorrect posterior probability calculations; human error in data entry and management; and reconciling differences in how statisticians (or statistical models) define adequate evidence of treatment effects and how physicians define these effects. Below we give examples of three of these potential problems and discuss steps one can take to avoid them. Over time, Bayesian designs have found more application and become more complicated. While most of the designs developed in the early 1990s focused on binary end points, current implementations include models for time-to-event end points that include parameter effects for treatment, patient-specific covariates (e.g., patient’s risk of death) and covariate-bytreatment interactions (e.g., Xian et al. [38]). For very simple designs based on a binary end point, the data management requirements for posterior updating were relatively straightforward. These types of models only require keeping track of the number of patients in the trial and the total number of patients who have experienced the event of interest. In contrast, as the models have become increasingly complex, more data (and more data management) are required for calculation of posterior probabilities. As a consequence, an increase in data management can lead to data entry errors. For example, Maki et al. (39) describe a two-arm open-label phase II clinical study in sarcoma with tumor response as the primary end point. The study employed a Bayesian adaptive randomization procedure that accounted for treatment-by-sarcoma-subgroup interactions. Specifically, the adaptive randomization scheme incorporated information on the type of sarcoma. After randomizing the first 30 patients equally to the two treatment regimens, the design called for adapting the randomization probabilities for subsequent patients to favor the better performing treatment, according to the accrued data. The investigators subsequently found that the initial recorded sarcoma subtypes for some patients were incorrect. The consequence of this incorrect labeling was that, for one sarcoma subtype, the probability of randomization to the top performing arm was less than it should have been, relative to the other treatment arms. While in this example all patients continued to have higher probability of randomization to the better performing treatment arm, it is conceivable that if such an error were
not discovered early, patients could have been randomized to inferior treatments. Therefore, it is extremely important that the statistician be involved with data-management oversight to ensure that such errors do not occur. One of the key considerations in designing Bayesian clinical trials involves navigating the relationship between the proposed Bayesian model and the realities of medical research. A model may indicate that one treatment confers benefit over another (calculated via posterior probabilities), but if one is claiming this benefit on a very small number of patients, one is going to have a hard time convincing a medical audience that the results are robust (robust in an English and not statistical sense). For example, Giles et al. (40) reported a phase II trial that randomized patients to receive one of three treatment regimens: idarubicin and ara-C (IA); troxacitabine and ara-C (TA); and troxacitabine and idarubicin (TI). The study’s Bayesian design adaptively randomized patients to the treatments. Initially, there was an equal chance for randomization to IA, TA, or TI, but treatment arms with higher success proportions progressively received a larger fraction of patients. The adaptive randomization led to a total of 18 patients randomized to the IA arm; 11 patients randomized to the TA arm; and just 5 patients randomized to the TI arm. The small sample size associated with the TI arm left this trial open to concerns that the results were not conclusive. This story is reminiscent of the controversy surrounding the early randomized trials of extracorporeal membrane oxygenation (ECMO) for neonates in respiratory failure. Two early ECMO trials (41, 42) included adaptive randomization algorithms that led to very few babies receiving the non-ECMO treatment. In the end, a vocal part of the medical community seemed to think that these trials included too few patients treated conventionally (i.e., without ECMO) to justify making ECMO the standard treatment for neonates in respiratory distress. (See Ware and related discussion for more information about the ECMO trials [43].) Eventually, a randomized clinical trial without adaptive randomization in the United Kingdom demonstrated the benefit of ECMO (44). The lesson to learn is that one should ensure that the trial will include some minimum number of patients in all treatments (subject to safety assurances) before it begins to adapt the randomization in light of the accruing evidence. A common criticism voiced by some investigators with whom we have collaborated relates to recommendations based on Bayesian models that do not match the investigators’ expectations based on experiences with other designs. This tension is
14 BAYESIAN DESIGNS IN CLINICAL TRIALS
exemplified in the context of dose escalation decisions in phase I studies in oncology. Although we described and illustrated Bayesian phase I oncology studies earlier in this chapter, most of these phase I studies use non-Bayesian algorithms for dose-finding, such as the 3 + 3 design (45). Their popularity is driven by the fact that clinicians can easily understand these trial designs, and the decision rules employed make intuitive sense. Yet, much is left unspecified in the implementation of these methods. For example, algorithmic designs implicitly target toxicity risks smaller than 33% (1 in 3) as being acceptable. In contrast, while Bayesian phase I designs may seem (to some clinicians) to be black boxes, these models make explicit the outcomes being targeted. In particular, all Bayesian designs explicitly specify a target probability of toxicity (usually between 25% and 33%). We believe that one of the main reasons this criticism occurs is a lack of communication between the statistician and the clinical investigator. This lack of communication may result, in part, from difficulty explaining these methods to non-statisticians (46). One way to overcome these difficulties is by making the underlying assumptions of the Bayesian model clear to the investigator. One can illustrate these assumptions by providing the investigator with sample trajectories of virtual trials simulated under different scenarios, in addition to providing the operating characteristics of the trial’s average behavior (as discussed earlier in this chapter). While potentially time consuming, this type of upfront examination and assessment before the study begins will help the clinician understand both the merits and limitations of the design and underlying model contained in the protocol.
SUMMARY OF RECOMMENDATIONS In this chapter, we have illustrated the use of Bayesian methods in the design of clinical studies. Although we work with investigators interested in treating cancer, the examples illustrate ideas that are applicable in all disease areas. The main advantages of Bayesian ideas in the design of clinical trials are the inherent flexibility of Bayesian inference; the ease with which one can incorporate information from outside of the study, including measured outcomes of mixed types (e.g., continuous and discrete); the natural notion of evolving knowledge evinced by the transformation from prior uncertainty to posterior uncertainty based on observations; and the way the Bayesian methodology allows one to make decisions and maximize utility, taking into account all uncertainty captured in the
117
basic probability model. Although our examples concerned novel designs and new methodology, Bayesian ideas are applicable when designing any clinical study.
References 1. Carlin BP, Louis TA. Bayesian Methods for Data Analysis, 3rd ed. Boca Raton: Chapman & Hall/CRC; 2008. 2. Little RJ. Calibrated Bayes: A Bayes/Frequentist roadmap. Am Stat. 2006;60(3):213–223. 3. Berry DA. A case for Bayesianism in clinical trials. Stat Med. 1993;12(15–16):1377–1393; discussion 95–404. 4. Berry DA. Decision analysis and Bayesian methods in clinical trials. In: Thall PF, ed. Recent Advances in Clinical Trial Design and Analysis. Boston: Kluwer Academic Publishers; 1995:125–154. 5. Jennison C, Turnbull BW. Group sequential methods with applications to clinical trials. Boca Raton, FL: Chapman & Hall/CRC; 1999. 6. Berger JO, Wolpert RL. The Likelihood Principle. Hayward, California: Institute of Mathematical Statistics; 1984. 7. Thall PF, Wathen JK. Practical Bayesian adaptive randomisation in clinical trials. Eur J Cancer. 2007;43(5):859–866. 8. Anscombe FJ. Sequential medical trials (Com: p384–387). J Am Stat Assoc. 1963;58:365–383. 9. Armitage P. Sequential medical trials: some comments on F. J. Anscombe’s paper. J Am Stat Assoc. 1963;58(302):384–387. 10. Armitage P. The search for optimality in clinical trials. Int Stat Rev. 1985;53(1):15–24. 11. Bather JA. On the allocation of treatments in sequential medical trials. Int Stat Rev. 1985;53(1):1–13. 12. Royall RM. Ethics and statistics in randomized clinical trials. Stat Sci. 1991;6(1):52–62. 13. Thall PF, Sung HG. Some extensions and applications of a Bayesian strategy for monitoring multiple outcomes in clinical trials. Stat Med. 1998;17(14):1563–1580. 14. Thall PF, Simon RM, Estey EH. New statistical strategy for monitoring safety and efficacy in single-arm clinical trials. J Clin Oncol .1996;14(1):296–303. 15. Thall PF, Simon RM, Estey EH. Bayesian sequential monitoring designs for single-arm clinical trials with multiple outcomes. Stat Med. 1995;14(4):357–379. 16. Thall PF, Cook JD. Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004;60(3):684–693. 17. Inoue LYT, Thall PF, Berry DA. Seamlessly expanding a randomized phase II trial to phase III. Biometrics. 2002;58(4):823–831. 18. Thall PF. A review of phase II/III clinical trial designs. Lifetime Data Anal. 2008;14(1):37–53. 19. Berry DA, Müller P, Grieve AP, et al. Adaptive Bayesian designs for dose-ranging drug trials. In Gatsonis C, Carlin B, Carriquiry A, eds. Case Studies in Bayesian Statistics V. New York: Springer-Verlag; 2001:99–181. 20. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester, UK: Wiley & Sons; 2004. 21. DeGroot MH. Optimal Statistical Decisions. New York: McGraw-Hill; 1970. 22. Kadane JB, ed. Bayesian Methods and Ethics in a Clinical Trial Design. New York: Wiley & Sons; 1996. 23. Berry DA, Wolff MC, Sack D. Decision making during a phase III randomized controlled trial. Cont Clin Trials. 1994;15(5):360–378. 24. Carlin BP, Kadane JB, Gelfand AE. Approaches for optimal sequential decision analysis in clinical trials. Biometrics. 1998;54(3):964–975. 25. Stallard N, Thall PF. Decision-theoretic designs for pre-phase II screening trials in oncology. Biometrics. 2001;57(4):1089–1095.
118
ONCOLOGY CLINICAL TRIALS
26. Stallard N, Thall PF, Whitehead J. Decision theoretic designs for phase II clinical trials with multiple outcomes. Biometrics. 1999;55(3):971–977. 27. Rossell D, Müller P, Rosner GL. Screening designs for drug development. Biostatistics. 2007;8(3):595–608. 28. Ding M, Rosner GL, Müller P. Bayesian optimal design for phase II screening trials. Biometrics. 2008;64(3):886–894. 29. Chen M-H, Ibrahim JG. The relationship between the power prior and hierarchical models. Bayesian Anal. 2006;1(3): 551–574. 30. Ibrahim JG, Chen MH. Power prior distributions for regression models. Stat Sci. 2000;15(1):46–60. 31. Carlin BP, Chaloner K, Church T, Louis TA, Matts JP. Bayesian approaches for monitoring clinical trials with an application to toxoplasmic encephalitis prophylaxis. Statistician. 1993;42(4):355–367. 32. Carlin BP, Chaloner KM, Louis TA, Rhame FS. Elicitation, monitoring, and analysis for an AIDS clinical trial (with discussion). In Gatsonis C, Hodges JS, Kass RE, Singpurwalla ND, eds. Case Studies in Bayesian Statistics, Vol. II. New York: Springer-Verlag; 1995:48–89. 33. Chaloner K, Church T, Louis TA, Matts JP. Graphical elicitation of a prior distribution for a clinical trial. Statistician. 1993;42(4):341–353. 34. Bekele BN, Thall PF. Dose-finding based on multiple toxicities in a soft tissue sarcoma trial. J Am Stat Assoc. 2004;99(465): 26–35. 35. Berry SM, Berry DA. Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model. Biometrics. 2004;60(2):418–426. 36. Bekele BN, Shen Y. A Bayesian approach to jointly modeling toxicity and biomarker expression in a phase I/II dose-finding trial. Biometrics. 2005;61(2):343–354. 37. de Lima M, Couriel D, Munsell M, et al. Pentostatin, tacrolimus, and “mini”-methotrexate for graft-versus-host
38. 39.
40.
41.
42.
43. 44. 45. 46.
disease (GVHD) prophylaxis: A phase I/II controlled, randomized study. Blood (ASH Annual Meeting Abstracts) 2004;104:727. Xian Z, Suyu L, Kim ES, Herbst RS, Lee JJ. Bayesian adaptive design for targeted therapy development in lung cancer—a step toward personalized medicine. Clin Trials. 2008;5(3):181–193. Maki RG, Wathen JK, Patel SR, et al. Randomized phase II study of gemcitabine and docetaxel compared with gemcitabine alone in patients with metastatic soft tissue sarcomas: results of sarcoma alliance for research through collaboration study 002 [corrected]. J Clin Oncol. 2007;25(19):2755–2763. Giles FJ, Kantarjian HM, Cortes JE, et al. Adaptive randomized study of idarubicin and cytarabine versus troxacitabine and cytarabine versus troxacitabine and idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. J Clin Oncol. 2003;21(9):1722–1727. Bartlett RH, Roloff DW, Cornell RG, Andrews AF, Dillon PW, Zwischenberger JB. Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics. 1985;76(4):479–487. O’Rourke PP, Crone RK, Vacanti JP, et al. Extracorporeal membrane oxygenation and conventional medical therapy in neonates with persistent pulmonary hypertension of the newborn: a prospective randomized study. Pediatrics. 1989;84(6): 957–963. Ware JH. Investigating therapies of potentially great benefit: ECMO (with discussion). Stat Sci. 1989;4(4):298–340. UK Collaborative ECMO Trial Group. UK collaborative randomised trial of neonatal extracorporeal membrane oxygenation. Lancet. 1996;348(9020):75–82. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon R. A comparison of two phase I trial designs. Stat Med. 1994;13:1799–1806. Rosenberger WF, Haines LM. Competing designs for phase I clinical trials: a review. Stat Med. 2002;21(18):2757–2770.
15
The Trials and Tribulations of Writing an Investigator Initiated Clinical Study Nicole P. Grant Melody J. Sacatos Wm. Kevin Kelly
Most clinical trials sponsored by industry or specialized groups such as the National Institutes of Health cooperative groups come to the investigator through a well-defined route with accompanying documentation structuring the conduct of the trial. Not so with the investigator-initiated clinical trial. Hence, trial and tribulation may be inevitable when you initiate your own clinical trial if you don’t know what you need to do and plan for the trial. A clinical trial in health care may be defined as “a research study involving human subjects, designed to evaluate the safety and effectiveness of new therapeutic and diagnostics treatments” (www.hss.energy .gov/healthsafety/ohre/roadmap/achre/glossary.html). Others have defined a clinical trial as “a systematic investigation of the specific treatments according to a formal research plan in patients with a particular disease or class of diseases” (www.jhu.edu/wctb/coms/ booklet/book5.htm). The Food and Drug Administration (FDA) defines a clinical trial as “any experiment in which a drug (or biologic) is administered or dispensed to, or used involving, one or more human subjects” (www.accessdata.fda.gov- CFR—Code of Federal Regulations Title 21). Whether we are conducting an early phase I or a phase III study, all trials typically have in common: a hypothesis supported by scientific data, a method to conduct safe research, and a plan to evaluate the outcomes. A well-written clinical study should be a
free-standing document that can support the study rationale and be a complete roadmap to conduct the research. Writing a good clinical trial is both a science and an art which has evolved over the years. Due to the increased regulatory concerns along with more sophisticated and complicated therapies, clinical trials are increasingly becoming more complex. To aid in writing clinical trials there have been multiple templates developed for phase I, phase II, or phase III studies and the details of the trial vary pending on whether the study is an institutional investigator–initiated trial, a cooperative group trial, or an industry-based trial. The basic principles and sections of a study remain constant although the order and depth of involvement may vary between the type, phase, and sponsor of the study. For individuals who want to write a trial, the National Cancer Institute’s (NCI) Cancer Therapy Evaluation Program (CTEP) has made significant investments in the development of standardized protocol templates that cover a wide range of types of studies to facilitate the development of novel therapies by individual investigators. These templates can be accessed at http:// ctep.cancer.gov/protocolDevelopment/templates _applications.htm and all investigators interested in writing protocols should review this site (1). While the development of a protocol is driven by the principal investigator, this is a team effort made of coinvestigators, statisticians, clinical research nurses, data managers, regulatory experts, and editors.
119
120
ONCOLOGY CLINICAL TRIALS
Coordinating all the required inputs to ensure a scientifically sound protocol that is feasible and compliant with all required external and internal regulations, policies, and best practices can be a time-consuming and sometimes tedious task. A designated protocol coordinator can be indispensable in collecting all of the essential perspectives and incorporating them into the protocol for the investigator to review so that the study may move forward in a timely and well-integrated manner. The protocol coordinator may be the research nurse or a protocol developer/regulatory expert or editor with basic understanding and literacy of the science involved. Some academic medical centers offer specific expertise and assistance through their clinical trials office that may help. More recently, computerized collaborative clinical trial writing systems have been developed which have facilitated the protocol writing process (2). Regardless, it is essential that there be a capable individual designated as the coordinator of the development effort and that all of the critical elements of a protocol are considered. This chapter will provide some basic principles on writing a therapeutic investigator-initiated study; and it will highlight some of the pitfalls and the obscurities that may be encountered in writing a protocol. These principles can be applied to most clinical therapeutic trials regardless of the type, phase, or sponsor of the study. Table 15.1 is a list of the basic sections
TABLE 15.1
Protocol Requirements. · · · · · ·
· · · · · · · · · · · · ·
Title page Protocol schema or synopsis Table of contents Study objectives Background Patient selection (eligibility criteria) · Inclusion criteria · Exclusion criteria Registration procedures Treatment plan Dosing delays and dose modifications Adverse event reporting, safety monitoring, and quality assurance Pharmaceutical information Correlative and special studies Study calendar Measurement of effect or outcome measures Data reporting Statistical considerations References Informed consent Appendices
that are required in most protocols, and we will review these sections in more detail below.
PREPARING TO WRITE A CLINICAL STUDY Prior to actually writing the protocol, it is useful to prepare a brief protocol concept sheet or a letter of intent that can be presented to the research team to review. The concept sheet should give a brief rationale of the study, primary objectives, study population, outcome parameters, sample size, and whether correlative studies are to be included. Most importantly the concept sheet should provide why this study or concept is important and whether the risk/benefit ratio to the subject population involved is favorable. How is this research going to change the current treatment practices? How do you build on the result of the trial? What is the next clinical trial if results are positive? Is this study ethically sound? Does the study’s potential benefit(s) outweigh the potential risk(s) to subjects? It is always important to ask what the next step is and whether this study brings benefit to individual subjects or a population of subjects; or increases meaningfully scientific knowledge. If it does not answer these questions, then you need to really consider why you are doing this study. The research group should evaluate the concept critically and pose the following questions: 1. Are the objectives well defined and obtainable? 2. Is the study population well defined to answer the objectives? 3. Is the study population available to study this question? Include estimated anticipated drop-out rate of those screened versus those enrolled and those expected to complete the study. 4. Do we have the appropriate statistical design and power to give confidence in the results? Do the objectives correlate with the outcome measures? Are the outcome measures well defined and measurable? 5. Are the correlative studies appropriate and doable? 6. What is the funding source of the trial and is the funding sufficient to complete the study? It is also important at this stage to consider whether the study will be investigational new drug (IND)-exempt or require an application to the FDA to obtain approval to conduct the trial. When in doubt, a regulatory expert should be consulted as investigatorsponsors have all the responsibilities of an industry sponsor, and this means that resources and systems for regulatory compliance must be available.
15 THE TRIALS AND TRIBULATIONS OF WRITING AN INVESTIGATOR INITIATED CLINICAL STUDY
Once you have addressed your questions internally, it is helpful to get opinions on the concept outside of your research group since outsiders may have different perspectives on the research and often will improve the study.
121
be the sole individual that has control over protocol version and the version number. This will eliminate many difficulties as your team provides comments for the study or the study is amended in the future. Abbreviations and Definitions
CRITICAL COMPONENTS OF AN INVESTIGATIONAL STUDY Title Page The title page is typically the last page written; however, it is one of the more crucial components of the document. The title page lists the title of the study, local or group protocol number, and the coordinating center. Typically, the title includes the clinical trial phase; the overall design (such as randomized, doubleblind); whether the study is a single-site or multicenter study; the name of the investigational product; the class of subject population; and the disease, condition, or disorder under study. Usually the long title is followed by a short protocol title that is used as the every-day identifier. The critical personnel for the study should be listed on the title page. For each study, there can be only one principal investigator, and in therapeutic studies this is typically a physician. The principal investigator has ultimate responsibility for the safe conduct of the study and the integrity of the data, and has oversight over all other research personnel in the study. Other individuals that need to be identified on the title page include coinvestigators, statistician, responsible research nurse, and responsible data manager. It is often helpful to list the institutional contractual and financial point-of-contact. All contact information (address, telephone, fax, and e-mail) need to be listed for each individual on the study. A 24-hour emergency number(s) and points-of-contact who can be reached any day of the week (including weekends), if such coverage is required by the nature of the study, should also be included. It is also recommended, but sometimes not required, to identify who may consent patients on the clinical trial. This is more critical in a single institution, investigator-initiated trial. It is most important to have a very organized system to keep track of the versions of the protocol as you write and conduct the trial. It is very typical that you may have 5 to 10 versions of the study before the final document is produced, and during the trial you may have a half dozen or more amendments. In the footnote or the header of the protocol title page, version number and version date should be listed and updated as protocols are changed or amended. The principal investigator or designated protocol coordinator should
It is helpful to have a listing of some abbreviations and definitions following the title page. Common and basic terms do not have to be included, but esoteric terms relating to specific diseases or procedures should be. This will help expedite the review of the protocol through internal and external review and approval authorities; and help the members of your research team better understand the protocol. Protocol Schema or Synopsis Following the protocol title page, a study schema or synopsis is used to provide a quick overview of the study. Sometimes just a study schema is provided which outlines the treatment regimen or treatment arms of the study. More commonly, a synopsis is provided which presents a quick reference to the study. This should not be a total duplication of the protocol, but an abbreviated version highlighting the salient points of the study. It should be kept to a minimum, and most studies should be summarized in one or two pages. It is important to remember that the synopsis and the protocol body need to be congruent and not contradictory. Table 15.2 lists an example of a template for a protocol synopsis. TABLE 15.2
Example of a Protocol Synopsis Template. TITLE
OF
STUDY
Protocol Number Sponsor/Investigator Project Phase Indication Number of Study Center(s) Study Design Objectives
Methodology Subject Selection Criteria Duration of Treatment Number of Patients Safety Efficacy
Primary: Secondary: Correlative Studies:
122
ONCOLOGY CLINICAL TRIALS
Table of Contents The table of contents should be a quick reference to locate the place in the protocol for specific and critical information, such as eligibility criteria or treatment plan. The table of contents should include at least the major categories in the study, but also may include subsections for more specific details. Unless you have programs that link pages to the table of contents, often page shifts occur when you add or delete sections of the study, requiring changes in the page numbering in the table of contents. This is a time-consuming process and some investigators advocate to keep the number of categories in the table of contents to a minimum. References and appendices should also be included in the table of contents. Study Objectives The study objectives in a protocol are similar to the specific aims in a research grant and should be very specific and concise. These should be “clearly defined and stated in a manner that will allow the objectives to be investigated by a quantitative assessment of appropriate outcomes” (3). They typically are divided into the primary objective for which the study is designed to answer. Secondary and tertiary objectives may also be included, but these are considered to be more exploratory in nature. The following is an example of a primary objective that is vaguely written: “To determine if the administration of Zolendronic acid to patients with newly diagnosed prostate cancer will improve the patient’s outcome.” This primary objective needs to be more specific about the primary outcome parameter that will be assessed in this study. A more concise objective would be: “To determine whether treatment with Zolendronic acid at the time of initiation of androgen deprivation therapy for metastatic prostate cancer will delay the time to first skeletal related event.” Some studies will have correlative science objective also imbedded within the study and these should also be concisely defined. Background This is equivalent to the introduction in a research paper and puts the proposal in context. It should answer the question of why and what: “Why does the research need to be done?” and “What will be its relevance?” A brief description of the most relevant studies published on the subject should be provided to support the rationale for the study. This section should include background information on the study drug and the disease being studied.
Depending on the type of study (e.g., a phase I study administering an IND), detailed prior laboratory and animal testing may need to be included as well as justification for the proposed dosage. If the study involves the administration of an FDA approved drug, whether the proposed use is within the approved labeling or is a new intended use, it should be described and justified. There should also be background information on the correlative studies, if applicable. Key components of this section would include: 1. Description of disease being studied and problem being addressed 2. Description of current therapy (treatment regimen or medical management currently in wide use or prescribed by professional practice standards and/or approved by the FDA) and any shortcomings 3. Description of drug and its activity 4. Description of any comparator drugs and justification for its use as the control 5. Justification for use of a placebo, if applicable 6. Summaries of studies conducted to date 7. Summary statement The background section should be written in a concise manner; it is useful to summaries with large amounts of data in tables. The background section should give a summary of the supportive literature; it does not have to include all the literature. This section should be limited to 3 to 5 pages. Be sure to identify any nonstandard assays, procedures, or materials proposed to be used and indicate basis for safety and reliability of such.
PATIENT SELECTION Patient selection is one of the most critical components of the research plan, and each specific eligibility and exclusion criteria needs to be scrutinized. The conditions under which a patient is eligible to join the study need to be specifically stated. This would include references to pathologic diagnosis, prior therapies, age, performance status, and organ and marrow function. This section will also include criteria that make a patient ineligible for the study, such as treatment with other agents, allergies to the class of agent under study, pregnancy, brain metastasis, and HIV-positive patients. CTEP has developed guidelines that can be used during the protocol authoring process that outline the inclusion of various populations (1). Note that if the patient selection criteria are too stringent, it is often
15 THE TRIALS AND TRIBULATIONS OF WRITING AN INVESTIGATOR INITIATED CLINICAL STUDY
difficult to find such ideal patients; however, if the criteria are not stringent enough, it increases the heterogeneity of your study population compromising your endpoints of your study. Many groups have established acceptable inclusion and exclusion criteria for certain malignancies. However, these inclusion and exclusion criteria may vary depending on study agents being used in the trial. For instance, studies that are using an antiangiogenesis agent such as Bevacizumab may exclude patients with prior thromboembolic events within 1 year. This is due to the known risk for recurrent and serious thromboembolic adverse events that can occur in these patients. If an investigational drug brochure is available, important data relating to inclusion and exclusion criteria may be included. This section should include under what circumstances subject participation in the study may be prematurely terminated; if a subject is prematurely terminated or withdraws voluntarily, describe the plan for: documenting the occurrence, individual subject follow up, and use of data or specimens obtained. If there are any safety considerations associated with premature termination or subject withdrawal before the completion of the study, state them under the Risks section of the protocol. Centralized Institutional Registration Procedures Procedures for patient registration vary between institutions and groups; you should contact your clinical trials office to understand the local process. Most will have a template of the procedure that can be used in your protocol. If more than one center is involved in the study, the registration becomes a little more complex, and this may have already been addressed by your local institution. The basic registration procedures should be specifically outlined, step by step from the beginning to the end and have the appropriate delegation of responsibilities identified in the process. One example of a registration procedure for an investigator-initiated multicenter study is outlined in Table 15.3.
TREATMENT PLAN The treatment plan should be complete, clearly written, and simple to follow. This section needs to include the eligibility screening procedures for study entry as well as the administration schedule for the study drug(s) outlining the specific dose, how it is administered, how long it is administered if an intravenous medication, sequence of administration, route of administration, and concomitant medications for safe administration of the drugs. Notations and nomencla-
123
TABLE 15.3
Institutional Centralized Registration Procedure Example. To register a patient, the following documents should be completed by the research nurse or data manager and faxed (fax #) or e-mailed (e-mail address) to the study coordinator: · · · ·
Copy of required laboratory tests Signed patient consent form HIPAA authorization form Other appropriate forms (e.g., Eligibility Screening Worksheet, Registration Form)
The research nurse or data manager at the participating site will then call (phone #) or e-mail (e-mail address) the study coordinator to verify eligibility. To complete the registration process, the study coordinator will: · Assign a patient study number · Register the patient on the study · Fax or e-mail the patient study number and dose to the participating site · Call the research nurse or data manager at the participating site and verbally confirm registration
ture in the treatment plan should be consistent. Note that the NCI provides detailed instructions and a common standard in its Guidelines for Treatment Regimen Expression and Nomenclature (http://ctep.cancer.gov/ protocolDevelopment/policies_nomenclature.htm). This standard advises: · Drug names should not be abbreviated; use complete generic drug names. · All details of the treatment regimen should be provided, but do not duplicate these details throughout the protocol. · All units, such as kg/m2, q8h, or TID, should be used consistently throughout the protocol. If using units, spell it out; do not use abbreviations, such as U. · Do not trial a whole number with a decimal point followed by a zero (e.g., 5mg, not 5.0mg); similarly in expressing units that are less than the whole number 1, the dosage should be written with a decimal preceded by a zero (e.g., 0.125mg, not .125mg). Additionally, the following should be considered when writing the Treatment Plan: · The protocol should specify if the medications are dosed based on actual, ideal, or lean body weight. In addition, the study should indicate if the dose of
124
ONCOLOGY CLINICAL TRIALS
the study medication(s) needs to be adjusted every treatment or cycle based on change in weight, if applicable. · The number of days of treatment, the day of treatment(s), and duration of treatment should be clearly identified. An example of this is illustrated in Table 15.4. · The route of administration should be clarified. If an oral medication, there need to be clear instructions if this is fasting or nonfasting conditions. With oral medications, it also is important to indicate if there are any other medications or foods that need to be avoided while taking the study medication. For instance, the consumption of grapefruit juice can interfere with the clearance of certain medications (e.g., Ketoconazole) that are metabolized through the CYP3A4 enzymes. A list of the medications that are to be avoided when on the study medication(s) should be listed; if it is a lengthy list, reference it in the protocol and provide the listing as an appendix. · If intravenous medication is involved, indicate if this can be administered through a peripheral vein or needs to be through a central catheter; whether the medication needs to shielded from light; if there are incompatible fluids or drugs that cannot be administered through the same intravenous line; or include
any other potential problems or conditions that optimize the drug administration. · More details on writing the treatment regimen can be found at the website. If part or all of the study takes place on an out-patient basis, indicate methods to check on subject compliance (e.g., pill counts, plasma concentration monitoring, questionnaires, etc.). When randomization is employed, indicate who will generate the code and how the code will be generated, who will have access to the identifiers in the code, and conditions and methods for breaking the code in case of an emergency. Subjects participating in a blinded study are often provided with wallet cards describing how their status may be disclosed if an emergency occurs. Phase I studies will also include a section on the dose escalation rules. The maximal tolerated dose (MTD) and dose limiting toxicity (DLT) should be well defined for the study. This should define the types and grades of adverse events that will be considered intolerable or DLT. Below are examples on how the MTD and DLT can be defined: · The Maximum Tolerated Dose (MTD) is defined as the highest dose level with an observed incidence of DLT in no more than 1 out of 6 patients.
TABLE 15.4
Typical Drug Administration Schedule. DAY
>1
2
Dexamethasone Docetaxel Bevacizumab or Placebo Prednisone 10 mg
X
X X X X
3
X
4
X
5
6–21
X
Rest Rest Rest Daily
Continue treatment until disease progression or unacceptable toxicity. 1 Cycle = 21 days. Days 1-3 of each cycle Premedicate patients with dexamethasone 8 mg orally twice per day starting the day prior to docetaxel administration and continuing until the day after the docetaxel dose (i.e., 6 doses over 3 days). Day 2 of each cycle (every 21 days) Docetaxel: at 70 mg/m2 intravenously over 1 hour. Placebo: (Arm A) IV every 21 days administered after docetaxel. Initial dose given over 90 minutes, second dose over 60 minutes, and all subsequent doses over 30 minutes if prior infusions are tolerated without infusion-associated adverse events. Bevacizumab: (Arm B) 15 mg/kg IV every 21 days administered after docetaxel. Initial dose given over 90 minutes, second dose over 60 minutes, and all subsequent doses over 30 minutes if prior infusions are tolerated without infusion-associated adverse events. Day 1 until completion of the study Prednisone 10 mg po daily.
15 THE TRIALS AND TRIBULATIONS OF WRITING AN INVESTIGATOR INITIATED CLINICAL STUDY
· Dose Limiting Toxicity (DLT) in a patient is defined as Grade 3 or greater toxicity using the NCI Common Toxicity Criteria (Appendix B) during the initial cycle of treatment. There continues to be a debate of the most appropriate dosing schedules for phase I trials but historically a “3 × 3” dose escalation design has been used in phase I studies (Table 15.5) (5). More recently, other dose escalation schemas such the continual reassessment method has gained popularity (6). It is beyond the scope of this chapter to go into details on these trial designs, but there have been multiple published reviews on the subject. The primary endpoint of a phase I study is not always to push the drug(s) to intolerable toxicity, but rather may be to define a biologic endpoint. If this is the primary objective, these endpoints need to be specifically identified in the protocol. For more details the reader is referred to Chapter 8 on design of the phase I studies.
DOSING DELAYS AND DOSING MODIFICATION The investigational study should be very explicit on when and how a study medication should be held or modified. While all complications for the study medications cannot be covered, the major complications
should have a detailed schema outlined for the study staff. “For instances, in patients with neutropenia NCI grade 3 or 4, the dose of study medication will be held until the neutropenia is grade 1 or less. Then the dose is reduced by 25% on the subsequent cycle.” Investigators should provide explicit definitions of the type(s), grade(s), and duration(s) of adverse events that will be considered dose-limiting toxicity(ies), or provide definitions of other endpoints that will be used to determine dose modifications in this section. The NCI protocol templates have well written samples, and more complex examples may be found at the following Web site: http://linus.nci.nih.gov/~brb/ Methodologic.ht.
DATA AND SAFETY MONITORING PLAN The plan should reflect the unique nature of the study and should be commensurate with its risk, size, and complexity. It should describe how the principal investigator will ensure that the study is monitored in terms of safety, generation, and analyses of quality data, and how he/she intends to provide ongoing supervision of the study, including assessing whether appropriate progress is being made. Assuring the safety of every individual in the study is the primary responsibility of the principal investigator, who is also responsible to keep the appropriate
TABLE 15.5
Dose Escalation Schema for Phase I Studies. NUMBER OF PATIENTS WITH DLT AT A GIVEN DOSE LEVEL
ESCALATION DECISION RULE
0 out of 3 ≥ 2
Enter 3 patients at the next dose level. Dose escalation will be stopped. This dose level will be declared the maximally administered dose (highest dose administered). Three (3) additional patients will be entered at the next lowest dose level if only 3 patients were treated previously at that dose.
1 out of 3
Enter at least 3 more patients at this dose level. · If 0 of these 3 patients experience DLT, proceed to the next dose level. · If 1 or more of this group suffer DLT, then dose escalation is stopped, and this dose is declared the maximally administered dose. Three (3) additional patients will be entered at the next lowest dose level if only 3 patients were treated previously at that dose.
≤1 out of 6 at highest dose
This is generally the recommended phase II dose. At least 6 patients must be entered at the recommended phase II dose.
level below the maximally administered dose
125
126
ONCOLOGY CLINICAL TRIALS
regulatory review boards cognizant of any increased risk to the patient. The principal investigator is also responsible to ensure the integrity of the data. Data and Safety Monitoring Plans typically address adverse events, safety monitoring, and quality assurance. Detailed guidance is available from the NCI’s Web site (http://www .cancer.gov/clinicaltrials/conducting/dsm-guidelines). The Adverse Events section would include recording and reporting requirements for your local institutional review board (IRB), along with any required reporting to regulatory agencies or any other organization involved, such as a pharmaceutical company supplying the drug. What is most important is a plan demonstrating that applicable government regulatory and local institutional requirements will be fulfilled, and that the plan will also assure the best possible protection for human subjects is reasonable and feasible, and likely to preserve public trust in the conduct of clinical trials. The adverse events section should address how to identify and then collect, record, and report adverse events as well as describe the expected follow-up. Additionally, the Adverse Event section should address: · The expected adverse events for each drug and reference to supporting documentation, such as the FDA labeling or the investigational drug brochure. · The quality control (within the research team) and quality assurance (external to the functioning of the research team) program and procedures for the conduct of the study, including demonstrated incorporation of good clinical practice (GCP) (7), monitoring by external monitors, use of a Data Safety and Monitoring Board (if applicable), and procedures for ensuring the timeliness and quality of the data (8). · Define what constitutes a nonserious adverse event versus a serious adverse event. · Describe the procedures for tracking all adverse events, regardless of severity, attribution, and reporting requirements, will be identified, monitored, and addressed.
PHARMACEUTICAL INFORMATION The pharmaceutical sections should include information about all investigational drugs and ancillary medications used in the study. A section for each agent to be used in the trial should be included with the following information: 1. Product description: Include the available dosage forms, ingredients, packaging, and labeling, as appropriate. Also state the agent’s supplier.
2. Solution preparation (how the dose is to be prepared): Include reconstitution directions and directions for further dilution, if appropriate. 3. Storage requirements: Include the requirements for the original dosage form, reconstituted solution, and final diluted product, as applicable. 4. Investigational agents: Address stability, purity, and pyrogenicity and adherence to FDA quality assurance manufacturing requirements. 5. FDA approved agents: Include the stability of the original dosage form, reconstituted solution, and final diluted product, if applicable. 6. Route of administration: Include a description of the method to be used and the rate of administration, if applicable. Describe any precautions required for safe administration. Information on availability, ordering, and accountability should be included; include procedures for receipt, use, and eventual disposition. Correlative and Special Research Studies Not all studies will have a Correlative Studies section. This section can be a description of any research activity as simple as evaluating a biomarker such as obtaining a tube of blood to evaluate baseline vascular endothelial growth factor or drug level to correlate with clinical outcome. In other instances it can be a very complicated set of tests, such as serial positron emission tomography (PET) scans or pharmacokinetic studies. Regardless of the study being obtained, this section needs to provide specific aims and rationale for the studies and detail the specifics regarding the handling, preserving, and storage of the tissue or blood to be used; procedure and addresses for shipping bio-specimens; or methods, timing, and procedures for other tests involving the patient. It is particulary critical to understand the biology of the bio-specimen that you are obtaining. For instance, certain proteins in the blood may break down if left at room temperature; and in order to preserve this protein, the plasma needs to be separated immediately and then frozen at −80ºC within 30 minutes. This will require onsite processing which needs to be planned and budgeted for in the protocol; if not, alternative procedures need to be developed. Information on endpoint validation, including additional background (as needed), description of the assay(s) used, materials and methods, and assay validation, should be provided in an appendix. If the sample is going to be shipped out to have the test performed at another location, the protocol needs to include packing instructions (i.e., shipped with dry ice), shipping address, and time when the sample can be shipped. There have been many bio-specimens that have been shipped
15 THE TRIALS AND TRIBULATIONS OF WRITING AN INVESTIGATOR INITIATED CLINICAL STUDY
overnight on dry ice on a Friday afternoon, to find that the recipient does not receive weekend deliveries; thus degrading the integrity of the specimen. STUDY CALENDAR This is one of the most useful and referenced sections of the research protocol, and it will be prudent to ensure it is complete and reflects your study accurately. The study calendar is a quick reference that includes all patient activities, tests (including correlative studies), and outcome measures that will occur before, during, and after the study. Table 15.6 is an example of a study calendar. It is most useful early in the protocol writing process to complete the study calendar first, since it will help organize your thoughts and allow you to refer back to the study calendar as you write the other sections to ensure there is consistency throughout the protocol. The study calendar will also help to generate the study budget and highlight data collection requirements. If this is completed and verified early in the protocol process, this can help expedite the budget, contract, composing of the data collection instruments and case report forms, and ultimately opening of the study. MEASUREMENT OF EFFECT OR OUTCOME MEASURES All your trial endpoints should be defined in this section and correlate with the specific aims of the study. For example, in the phase II studies, changes in tumor volume or delay in progression of the cancer could be typical endpoints. For studies that are evaluating changes in tumor masses seen on radiographs, the Response Evaluation Criteria in Solid Tumors (RECIST) Criteria is typical response criteria utilized in solid tumors (9). The RECIST criterion has been the standard for patients with objective disease; complete definition of the response criteria can be found at http://ctep.cancer .gov/protocolDevelopment/docs/ quickrcst.doc. However, other criteria may be used, such as Cheson Criteria for lymphomas (10) or the International Uniform Criteria for Multiple Myeloma (11). Other studies may use other outcome measures that include progression free survival (PFS) or overall survival; however, the definitions of each of these need to be detailed. DATA MANAGEMENT The protocol should provide information on how the data will be managed, including data handling and coding for computer analysis, monitoring and verification.
127
Multicenter guidelines would also be included in this section, if applicable. Case histories must be prepared and maintained in such a manner that there is a systematic and accurate record of all observations and other data pertinent to the study for each individual subject. For the investigator initiating a study this poses a challenge, as he/she must anticipate the data that should be collected and maintained in data collection forms and case report forms. An “include it all approach” unnecessarily burdens resources and may obscure critical data, although an “overly minimalist approach” may preclude collection and analyses of essential data. A good reference and collection of templates for case report forms may be found in the Manual for the Completion of the NCI/DCTD/CTEP Clinical Trials Monitoring Service, Case Report Forms, prepared by Theradex® (http://www.theradex.com/CTMS/CTMS _CRF_Manual_313.pdf). This section should describe the procedures for data collection and the data collection forms that will be used for collecting, storing, and processing study data, including outlining specific case report forms. The protocol should clearly identify the timing of when the data will be recorded, revised, and stored; which category of staff member(s) will be entering the data; proposed validation and quality control methods (e.g., cross validation) that are going to be used in the study; and the source documentation required. All case report forms, data collection forms, and checklists (such as eligibility checklists and checklists for specific visits) should be attached as appendices. Whenever possible, the documents should be formatted in a standard manner and must include identifying headers, dates, and initials or signatures of those individuals recording or revising data on them. Remember that federal legislation and the publication policies of many professional journals require online registration of the study in clinicaltrials.gov. Requirements to post results of clinical trials are also required by federal legislation. Investigators must be familiar with these requirements and plan for the collection and maintenance of appropriate data for reporting purposes.
CONDUCTING AN INVESTIGATORINITIATED MULTISITE STUDY When serving as lead principal investigator for a multisite study, a plan documenting how management and coordination responsibilities will be carried out should be provided. For relatively simple and minimal risk studies, this plan may be presented within the
128
ONCOLOGY CLINICAL TRIALS
TABLE 15.6
Study Calendar Example. Schedules shown in the study calendar below are provided as an example and should be modified as appropriate. Baseline evaluations are to be conducted within 1 week prior to start of protocol therapy. Scans and X-rays must be done ≤4 weeks prior to the start of therapy. In the event that the participant’s condition is deteriorating, laboratory evaluations should be repeated within 48 hours prior to initiation of the next cycle of therapy. PRE- WK STUDY 1 Study Agent Other Agent(s) Informed consent Demographics Medical history Concurrent meds
A B X X X X
Physical exam Vital signs Height Weight Performance status CBC w/diff, plts Serum chemistrya EKG (as indicated) Adverse event evaluation
X X X X X X X X
Tumor measurements
X
Radiologic evaluation B-HCG Other tests, as appropriate Other correlative studies
X
WK 2
WK 3
B
WK 4
WK 5
A B
B
WK 6
WK 7
WK 8
A B
B
WK 9
WK 10
WK 11
A B
B
WK OFF 12 STUDYC
X—————————————————————————————————— ——————————————————————————————————— ——————X X X X X X X X X
X X
X X X X
X X X X
X X
X X
X X X X
X X
X X
X X X X
X X
X X
X X X X
X X
X X
X—————————————————————————————————— ——————————————————————————————————— ——————X Tumor measurements are repeated every [# weeks] weeks. Documentation (radiologic) must be provided for patients removed from study for progressive disease. Radiologic measurements should be performed every [# weeks] weeks.
X Xc
Xc
Xb
A: Study Agent: Dose as assigned; administration schedule B: Other Agent(s): Dose as assigned; administration schedule a: Albumin, alkaline phosphatase, total bilirubin, bicarbonate, BUN, calcium, chloride, creatinine, glucose, LDH, phosphorus, potassium, total protein, SGOT [AST], SGPT [ALT], sodium b: Serum pregnancy test (women of childbearing potential) c: Off-study evaluation
protocol. Complicated studies and/or moderate or high risk studies most likely should have a standalone manual of procedures (MOP) included as an appendix. The MOP should detail programmatic structure and procedural requirements across institutional lines of authority and responsibilities
to ensure optimal safety of subjects, regulatory compliance, and protocol-specific consistency. Minimal elements include: · Means by which to document approval for start-up of research activity at the participating site
15 THE TRIALS AND TRIBULATIONS OF WRITING AN INVESTIGATOR INITIATED CLINICAL STUDY
· Confirmation of applicable regulatory requirements such as having current federal wide assurances and local IRB approvals (as applicable) · Provisions to ensure qualification and training of participating site personnel and documented agreements for collecting, maintaining, and distributing research data and time lines for such; including procedures for central registration at the principal investigator’s site; and, especially important, reporting of serious and adverse events · Monitoring procedures to verify compliance and consistency in all activities of the study · Documented acceptance of the protocol and all updates by the participating site
STATISTICAL CONSIDERATIONS Prior to writing the study, the investigators should meet with the statistician to discuss the study, the primary endpoints of the study, the appropriate statistical design, and methods used for evaluating the outcomes of the study. This section should clearly provide the rationale for the statistical design; justification for the sample size selected; power of the study and level of significance to be used; the procedures for accounting for any missing or false data; and specific methods to evaluate the outcome measures. A statistical section for any correlative studies should also be included. If correlative studies are only exploratory, then this should be stated. Other chapters in this book outline more details in the statistical design of the studies and should refer to these sections.
REFERENCES A protocol is a scientific document and the appropriate references should support the research plan and other factual statements in the protocol.
· · · ·
Note that the collection and use of human biomaterial (parts of a subject) deserves attention in the protocol. Aside from retrieval methods, whether specimens are associated with individually identifiable information and the potential risks associated with the data accompanying the specimen primarily drives the level of concern within a subject protection context. Genetic germ-line testing which may reveal a proclivity for disease or illness, for example, when associated with individual subject identifiers requires diligence in minimizing risk of health insurance discrimination, employability, or family/subject distress. Means to minimize such risks should be included. INFORMED CONSENT The informed consent is a critical component of any investigational protocol that describes the risk and benefits of the research study. This is covered in detail in chapter 35 and one should refer to this chapter for more details on writing an informed consent. APPENDICES The appendices should contain any additional information that will help conduct the study. For instance, the Eastern Cooperative Oncology Group (ECOG) performance status scale may be included to ensure that all investigators consistently report the performance status as well as other information, such as detailed process for collecting correlative blood sample or an oral medication log. Typical appendices may include:
SECTION ON PROTECTION OF HUMAN SUBJECTS
· · · ·
Means to minimize risk and enhance protections to participants in clinical trials should be included in the protocol. Minimal issues that should be addressed include:
·
· Recruitment methods · Continuation of benefit (if the investigational treatment proves beneficial) · Consent process (including parental permission for minors, surrogate permission for unable-to-consent
adult subjects, and adolescent and child assent, as appropriate) Subject payments (if any) Cost to subject for participation In case of injury procedures Subject confidentiality and privacy
· · · · · ·
Data collection sheets Case report forms Screening and enrollment log Checklists to ensure consistency and compliance drug accountability Log Manual of procedures for complex and/or multisite studies Signature and delegation of duty/authority log Performance status criteria Consent form template NCI CTC 3.0 Drug-drug interaction lists Quality of Life or other questionnaires
129
130
ONCOLOGY CLINICAL TRIALS
· Study diaries or patient drug accountability logs · Eligibility checklists
CONCLUSION Writing a clinical study is a dynamic process and continues from the inception of the study to the final analysis of the data. One cannot expect to write a perfect study that addresses all the unforeseen obstacles and scenarios that present to the investigators; and the research team needs to meet on a regular basis to review and update the protocol to address the issues that arise. This does take considerable time and effort from the principal investigator to ensure this does occur on a timely basis so it does not hinder study accrual or patient safety. It is always important that patient’s safety comes first in any study and it is the study team’s responsibility that this occurs.
References 1. Protocol templates and guidelines. Cancer Therapy Evaluation Program. 2008 [cited]. (Accessed May 2009 from http://ctep .cancer.gov/protocolDevelopment/templates_applications.htm#p oliciesAndGuidelines.)
2. Weng C, Gennari JH, McDonald DW. A collaborative clinical trial protocol writing system. Stud Health Technol Inform. 2004;107(Pt 2):1481–1486. 3. Gebski V, Marschner I, Keech AC. Specifying objectives and outcomes for clinical trials. Med J Aust. 2002;176(10): 491–492. 4. Guidelines for Treatment Regimen Expression and Nomenclature. (Accessed May 2009 from http://ctep.cancer.gov/protocol Development/policies_nomenclature.htm.) 5. Eisenhauer EA, O’Dwyer PJ, Christian M, Humphrey, JS. Phase I clinical trial design in cancer drug development. J Clin Oncol. 2000;18(3):684–692. 6. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990;46(1):33–48. 7. International conference on harmonization of technical requirements for registration of pharmaceuticals for human use. 1996. (Accessed May 2009 from http://www.ich.org/LOB/media/ MEDIA482.pdf.) 8. Guide for writing a research protocol for research involving human participation. World Health Organization. 2008. (Accessed on October 11, 2008, http://www.who.int/rpc/ research_ethics/guide_rp/en/index.html.) 9. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate the response to treatment in solid tumors. European Organization for Research and Treatment of Cancer, National Cancer Institute of the United States, National Cancer Institute of Canada. J Natl Cancer Inst. 2000;92(3):205–216. 10. Cheson BD, Pfistner B, Juweid ME, et al.; International Harmonization Project on Lymphoma. Revised response criteria for malignant lymphoma. J Clin Oncol. 2007;25(5):579–586. 11. Durie BG, Harousseau JL, Miguel JS, et. al.; International Myeloma Working Group. International uniform response criteria for multiple myeloma. Leukemia. 2006;20(9):1467–1473.
Data Collection
16 Eleanor H. Leung
The data collection process in clinical trials has been extensively described from the perspectives of statisticians and data managers (1–9). As a result, a common but erroneous assumption is that once a protocol is written, the statisticians and data managers assigned to a study will take care of the data collection. In reality, the data collection process is a complex team effort and the primary purpose of this chapter is to show that the principal investigator also plays a critical role—during the forms design stage in particular. The principal investigator will be asked to: · Verify that all the key data elements needed to meet study objectives will be captured in case report forms. · Specify how often adverse events, laboratory, imaging, and clinical assessments have to be performed to obtain valid outcome measures. · Identify the critical variables that will require verification with source documentation. The principal investigator will be performing these tasks early during the forms development process when the case report forms (CRFs) are being designed. Since these forms must reflect the entry, treatment, and follow-up requirements set by the protocol, it may be tempting to postpone forms review until a protocol is finalized and approved. However, because the forms development team involves so many parties working
behind the scenes, preliminary forms review is crucial. Fortunately, the initial set of CRFs the principal investigator is asked to review will already include many of the key data elements that need to be captured. These preliminary drafts will have been gleaned from actual forms used in previous studies investigating similar agents and treatment regimens as well as National Cancer Institute (NCI) form templates developed as part of the NCI cancer Biomedical Informatics Grid (caBIG) and Common Data Elements (CDE) initiatives seeking to make datasets across phase III cancer clinical trials comparable and thus facilitate multidisciplinary data sharing across institutions. This chapter is divided into two sections. The first section will focus on the role the principal investigator plays in determining the content of CRFs and will be the more extensive one. The second section will discuss quality control procedures used by data management and statistical teams to ensure the completeness and accuracy of collected data.
COLLECTING KEY DATA VARIABLES Because only the data captured in CRFs will be entered into the database and form the basis of the final report, the responsibility of ensuring that all the key data variables needed to meet study objectives are collected prospectively in the CRFs will rest with the principal
131
132
ONCOLOGY CLINICAL TRIALS
investigator. Let’s examine how this responsibility is carried out with four categories of data in clinical trial studies: patient characteristics, treatment compliance, adverse events, and outcome variables. Screening and Baseline Patient Information Patient characteristics that could serve as prognostic and predictive variables of outcome are collected at study entry using two types of case report forms: the registration form and the on-study form. Patient demographics (i.e., height, body surface area, age, gender, race, ethnicity, zip code and country of residence, performance status, and medical payment method) are captured in a standard registration form used by all NCI cooperative groups. The registration form also collects the following information: · Patient identifiers—Initials, treating institution and physician, medical record number, social security number. The latter is optional but recommended to help track survival status via the Social Security Death Index or local tumor registries if patients don’t return for follow-up visits. Patient identifiers are always stored in a secure limited access database separate from the clinical research database. · Regulatory information, such as, institutional review board (IRB) approval and patient informed consent dates. · Enrollment information, such as date(s) of registration and randomization, projected treatment start date, stratification factors, and treatment assignment. · Assigned unique patient identification number, linking all data collected for a patient from baseline through final outcome. Although not a CRF in the strictest definition because it is not entered into the database, the study eligibility checklist deserves a brief discussion. Using an easy-to-follow format, it lists inclusion criteria (i.e., disease stage and histologic type, prior treatments allowed, performance status, age range, gender, and acceptable range of laboratory values) and exclusion criteria (i.e., banned prior therapies, treatments and medications to be discontinued, and preexisting medical conditions). Because the eligibility checklist is used as a screening tool, it should also provide thorough explanations of difficult to understand eligibility prerequisites to enrollment. For instance, now that electronic nomogram calculators are increasingly being used to stratify patients prior to randomization, the eligibility checklist should include instructions for accessing these nomogram calculators and
stipulate the values required for the nomogram calculation. The checklist should emphasize that unavailability of any of these values would preclude patient enrollment. Unlike the eligibility checklist which establishes minimum entry criteria, the on-study form captures information about a patient’s actual disease status and disease extent at baseline, prior treatments, preexisting comorbidities, and current medications. The on-study form deserves close scrutiny; it should capture baseline values of every variable that might affect whether a patient would benefit or not from the experimental treatment under investigation. The fact that the NCI has developed over 50 disease-specific on-study form templates for use in patients with early stage (previously untreated/ adjuvant/localized) vs. advanced (metastatic/recurrent/non-localized) disease is further evidence that the on-study form must be carefully customized to meet study requirements. To illustrate this point, let’s compare the on-study information collected in CALGB 90401 to that in the NCI on-study form template. CALGB 90401 is a randomized phase III closed trial comparing docetaxel and prednisone treatment with and without bevacizumab in men with hormone refractory prostate cancer. Both on-study forms collect Gleason grades and sum from initial diagnostic biopsy specimen, start and end dates of all prior therapies, metastatic/recurrent sites, presence/absence of measurable and non-measurable disease, and baseline prostate specific antigen (PSA) value at study entry. However, the CALGB 90401 study team requested that the following supplementary information be also captured at study entry: · Measures of bone disease extent in terms of the number of bone lesions present in the baseline bone scan (i.e., no bone disease, <6, 6–20, >20 bone lesions, or superscan (10) · Baseline laboratory values (absolute neutrophil count, platelet count, bilirubin, creatinine, aspartate transaminase, and urine to protein creatinine ratio) confirming patient could proceed safely to treatment, plus those (hemoglobin, alkaline phosphatase, and lactate dehydrogenase) that have previously been found to be of prognostic significance in evaluating treatment of castration-resistant prostate cancer (11) · Information about patient prior histories of myocardial infarction, coronary artery bypass graft, cardiac stent/angioplasty, coronary artery disease, hypertension, thrombosis, vascular disease, and cardiovascular medications, as these factors have previously been shown to increase the risk of arterial thromboembolic events (12)
133
16 DATA COLLECTION
· Three additional pre-study PSAs to verify that patients enrolling on the basis of biochemical progression met the disease progression criteria set by the Prostate-Specific Antigen Working Groups I/II (13–14) Thus, the principal investigator is faced with difficult decisions early on. Achieving the proper balance between too much information and not enough will be the predicament the principal investigator will repeatedly face during the forms design process. Treatment Compliance To confirm that study treatment is administered per protocol, standard treatment forms have been created to collect dosages, dose delays and reductions, and dose modification reasons (due to toxicity, drug order delays, scheduling issues, etc.) of treatment agent(s) on a cycle-by-cycle basis, without considering whether all this information is actually necessary. To avoid this common pitfall, data collection should be governed by the information that will be reported in the final treatment summary table, namely: · Number of patients treated with each agent (or, combination of agents) · Median number of cycles administered (and range) · Percent of patients experiencing dose delays · Percent of patients experiencing dose reductions · Percent of major protocol deviations · Percent of patients who completed maximum treatment allowed vs. those who discontinued treatment early due to disease progression, adverse event, consent withdrawal, death, or other reason In addition, the principal investigator will also need to: · Be familiar with the definitions of major vs. minor deficiencies used by the Clinical Trials Monitoring Branch (CTMB) of the Cancer Therapy Evaluation Branch (CTEP) of the NCI during institutional audits (15). When asked to classify treatment deviations as major or minor, investigator responses should be consistent with NCI CTMB guidelines. NCI CTMB defines a major deficiency as “a variance from protocol-specified procedures that makes the resulting data questionable.” Examples of major treatment deficiencies are: · Giving a blinded drug assigned to another patient or a commercial agent instead of the blinded drug supplied by the NCI, or omitting one of the agents in a particular cycle
· Continuing to administer study treatment when protocol requires discontinuation, as when progression criteria have been met or unacceptable toxicity occurs · Making dose deviations or modifications, leading to dosing errors greater than +/− 10% · Modifying doses not prescribed in the protocol · Delaying treatment to provide drug holidays not allowed per protocol A minor or lesser deviation is “a deficiency that is judged to not have a significant impact on the outcome or interpretation of the study and is not described above as a major deficiency. An unacceptable frequency of lesser deficiencies should be treated as a major deficiency in determining the final assessment of a component” (15). Examples of minor deficiencies are treatment delays due to holidays, extended vacation, unexpected scheduling problems, or delayed delivery of an NCI-supplied drug. Since the number of major deviations will be reported in the final publication, steps should be taken to track deviations and classify them as major or minor on an ongoing basis during the course of treatment. · Decide how adherence to oral antineoplastic or preventive agents will be tracked. Self-reports of adherence, medication calendar diaries, pill counts, records of prescription refills, and use of microelectronic monitoring system (MEMS) have all been tested and used, but none has yet been adopted as the definite gold-standard method for measuring adherence to oral agents (16). · Identify the relevant concomitant medications that need to be tracked and whether their dosage and usage frequency will also be needed. · Determine whether an off-treatment form should be used in addition to treatment-by-cycle forms. Such a form would capture total number of cycles completed for each agent, dates of first and last cycles, and reason for ending (or never starting) protocol treatment. This off-treatment form would allow sites that have fallen behind in their submission of treatment-by-cycle forms to report treatment discontinuation as soon as it happens. Toxicity Assessment Adverse events (AEs) in cancer clinical trials are monitored using two types of reporting: routine and expedited. Routine reporting of adverse events uses study-specific AE forms submitted at regular intervals during study treatment, while expedited reporting of serious adverse events (SAEs) is required when an
134
ONCOLOGY CLINICAL TRIALS
adverse drug experience is life-threatening “or results in death, inpatient hospitalization or prolongation of existing hospitalization, a persistent or significant disability/incapacity, or a congenital anomaly/birth defect.” Currently, SAEs are reported via the NCI Adverse Event Expedited Reporting System (AdEERS). Whether an SAE needs to be reported depends on the severity grade of the SAE, whether the primary SAE is expected or unexpected, the degree to which the SAE can be attributed to study treatment (i.e., unrelated, unlikely, possibly, probably, or definitely related) and time since last course of treatment. For more extensive discussion of AdEERS submission requirements or to access the AdEERS Web application, refer to the following CTEP Web sites: http://ctep.info.nih.gov/protocolDevelopment/ electronic_applications/docs/newadverse _2006.pdf https://webapps.ctep.nci.nih.gov/openapps/plsql/ gadeers_main$.startup This chapter will focus instead on three issues concerning routine reporting of adverse events. 1. Which toxicities should be monitored? Because one of the goals in a clinical trial is to determine the cost/benefit ratio of a new treatment agent or regimen, the principal investigator and study team must select the AEs that will be systematically monitored during treatment, and thus, specified in routine AE CRFs so that the events can be graded on a cycle-by-cycle basis. Which, if not all, of the following adverse events should be regularly monitored? · The most common ones (e.g., nausea, diarrhea, anorexia, pain, neuropathy, rash, fatigue) that might lead to early patient withdrawal from study treatment? · Low-grade, persistent AEs (hypertension, confusion, dizziness) that are not serious, but that might be early warning signs of life-threatening SAEs? · Platelet, neutrophil, and hemoglobin counts and kidney and liver function test results that monitor patient well-being and safety? · Exam-based assessments of AEs (e.g., neuropathy, muscle weakness, dyspnea, cardiac arrhythmia) that would otherwise go unreported? · Potentially serious AEs included in the CAEPR (Comprehensive Adverse Event and Potential Risks) list issued by the NCI for each investigational agent?
2. What is the best way of assessing toxicities? To promote consistency in adverse event reporting in oncology trials, the CTEP branch of the NCI first developed the NCI Common Toxicity Criteria (CTC) in 1982 to provide the terminology and severity grading system for 49 commonly occurring AEs encountered during oncology treatment. By 2003, the table (CTCAEv3) had expanded to over 1,000 terms, and the latest version (CTCAEv4.0) now covers about 790 terms (17,18). As a result, toxicities in oncology trials have traditionally been assessed using the AE terms in the CTCAE version in effect at the time an AE form is designed. Recent studies (19–22) have questioned the usefulness of this type of AE reporting in the case of patient symptoms. When asked to assess symptoms such as fatigue, pain, nausea, vomiting, constipation, cough, and hot flushes using patient-reported outcome (PRO) questionnaires, patients can grade these symptoms just as reliably (if not more reliably) than clinical research associates (CRAs) who often abstract toxicity information from medical reports onto AE forms without any direct patient contact. Similarly, since measures of hypertension, kidney and liver functions, and blood chemistries can be obtained objectively, it might be more efficient to ask CRAs to report actual levels of blood pressure, creatinine, bilirubin, neutrophil/platelet/ white blood cell counts, together with the lower and upper limits of their normal range, than to grade the events themselves. Upon receipt, statistical center programs would translate the lab values into severity grades, per CTCAE grading guidelines (http://safetyprofiler-ctep.nci.nih.gov/ CTC/CTC.aspx). In summary, principal investigators play a major role in determining how toxicities will be assessed. They select the AEs that will be most closely monitored during treatment. They decide whether AE reports will consist of CTCAE-based AE forms, PRO questionnaires, or study-specific CRFs that capture actual laboratory values. 3. The principal investigator will also need to decide how often toxicities will be assessed during treatment. Traditionally, toxicities have been assessed at the end of every treatment cycle to identify potentially serious AEs early during treatment. However, since toxicity summaries report number of patients experiencing an AE at each grade level, and neither the total count of AEs experienced by all patients, nor the number of cycles associated with each AE, principal investigators should
16 DATA COLLECTION
consider decreasing the frequency of toxicity assessments to alleviate the reporting burden of participating sites. Indeed, insisting that sites submit AE forms every 3-4 weeks when no toxicities are occurring is a waste of time, especially in prevention studies (e.g., selenium, statins, aspirin) where expected toxicities are few and rare, or in studies where treatment has been so successful that patients are being treated beyond 12 months! As this chapter is being written, a new Web-based software tool—caAERS–for the reporting of both CTCAE-based routine and serious adverse events in cancer clinical trials is being developed by the NCI-caBIG group. caAERS will be used by sites to report both CTCAE-based routine and expedited serious AEs in cancer clinical trials. Scheduled to be released in 2010, Web demonstration of this application can be accessed via the NCI caBIG Web site: https://cabig.nci.nih.gov/ tools/caAERS. Although implementation of caAERS will allow collection of AE data on a realtime basis and eliminate discrepancies in reporting of SAEs associated with current routine and expedited AE reporting systems, the principal investigator’s role in collecting adequate toxicity data in a clinical trial will remain unchanged. He/she will still have to identify the AEs requiring regular monitoring, decide whether CTCAEbased approach of collecting toxicity information needs to be supplemented with patient-reported outcomes and objective laboratory assessments, and determine the most appropriate schedule for toxicity assessments. Outcome Measures Overall survival is the most common primary end point reported in randomized cancer clinical trials (23) and the easiest outcome variable to collect. NCI cooperative groups automatically send monthly delinquency notices or expectation reports to participating institutions if a patient’s survival fails to be updated every 6 months during the first 2 years after enrollment and annually thereafter. If a patient is still on study, survival dates are obtained from regularly submitted laboratory reports, treatment, AE, follow-up, and tumor measurement CRFs. The real challenge starts when protocol treatment is discontinued and the patient has not yet progressed but refuses to return for follow-up exams. An even greater challenge is posed by patients who never start assigned treatment in phase III studies. Because phase III studies adhere to the intent-to-treat principle, even these patients need to be followed until study end
135
points have been met. Since these two groups of patients could become lost to follow-up, CALGB does not officially declare patients as lost or remove them from data delinquency lists used to evaluate institutional performance until patients have been lost for two consecutive years. Sites that have proactively requested contact information for relatives, employers, and primary care physicians at enrollment and obtained permission to contact them, may reduce these losses. Although the Social Security Death Index and local tumor registries can be regularly checked to determine whether patients are still alive, these registries cannot be used if patients have withdrawn consent from survival follow-up or Social Security numbers are unavailable. Because of the numerous difficulties encountered with long-term follow-up and the increasing need to evaluate new promising treatments at a faster rate, surrogates of overall survival have been proposed (24–25). The validity of using progression-free survival (PFS), time-to-progression (TTP), failure-free survival (FFS), objective disease response/progression, biomarker response/progression, or other measures as surrogate end points will be addressed in other chapters. Instead, this chapter will focus on the problems associated with collecting accurate outcome measures from follow-up, tumor measurement, and biomarker forms. Standard information collected in follow-up forms consists of: Patient vital status (alive or dead), date of last contact (or, date of death), cause of death (if patient is dead), and if patient is alive, dates of last clinical assessment, first local-regional progression (and site), first distant progression (and site), first non-protocol therapy (and type), new primary (and site), and reports of any long-term toxicities (Grade >=3) prior to recurrence or new primary. These variables usually provide all the information needed to calculate time to progression or failure in most adjuvant and neoadjuvant solid tumor studies where the patient has been diagnosed as disease-free at the end of study treatment. However, if the patient has metastatic disease, is not eligible for surgery, and is allowed to continue study treatment until progression, determining when the patient has responded or progressed becomes more complex if he/she is being followed with both tumor and biomarker measurements. If these patients are prematurely diagnosed as having progressed, treatment will be discontinued early, depriving them of the full benefit of treatment to which they are entitled. Sometimes, sites may not even realize patients have progressed and erroneously deliver extra courses of treatment. Special efforts should be made to ensure that tumor measurements are reported accurately. For
136
ONCOLOGY CLINICAL TRIALS
example, if target lesions are present at baseline, but omitted from baseline tumor measurement forms, they may be misrepresented as new at restaging, and consequently, misinterpreted as evidence of progression. Moreover, if treatment efficacy is being evaluated in terms of tumor response, radiologists should be trained and certified in the tumor measurement system— RECIST 1.0 (26) or RECIST 1.1 (27)— designated in the protocol so they can precisely identify, follow, and evaluate changes in target and nontarget lesions. They should be encouraged to complete study measurement forms themselves or submit the equivalent ones used at their local institution. Completion of measurement forms by untrained CRAs on the basis of measurements provided in scan reports should be discouraged, especially if different radiologists are performing the evaluations. Optimally, all previous scans (including baseline) should be available for comparison at each restaging so that all target and nontarget lesions are consistently followed at each timepoint. Timing of baseline and subsequent assessments is also important. Care should be taken to collect baseline values within the pre-treatment time frame set in the protocol. Take the case of a prostate cancer patient with rapidly rising PSAs. If PSA assessment is not repeated before Cycle 1, and an earlier, lower PSA acts as a reference value, PSA progression criteria may be met prematurely, cutting short a treatment that might have started to take effect and resulted in PSA declines had an extra course been administered. With respect to restaging scans, investigators should determine a priori whether they should be scheduled at fixed time intervals or tied to the number of cycles completed, and decide whether restagings should be delayed when a cycle is delayed. Indeed, timing of restaging scans impacts on estimates of time to progression and should be carefully considered prior to study activation and specified in the study protocol. In studies where patients continue study treatment until progression, two important issues also need to be discussed. First, study protocol progression criteria have to be clearly defined, and if different from established convention or practice, the differences should be underscored. For example, in CALGB 90401, bone progression has not occurred if two or fewer new bone lesions are detected at restaging. Measurable disease progression criteria have not been met either if the sum of the longest diameters of target lesions has increased <20 % or two new (non-bone) lesions have not yet been observed. In other solid tumor studies, the appearance of one sole new lesion—measurable or non-measurable—constitutes progression. If treating physicians, radiologists, or CRAs are unaware of these differences, progression may be incorrectly diagnosed
and lead to premature discontinuation of treatment. Second, if a patient’s tumor and biomarker assessments conflict with each other with respect to progression, under what circumstances should patients be allowed to continue study treatment? For instance, should treatment be allowed to continue if PSA progression criteria have been met but measurable disease is in partial response and bone disease has been stable? Finally, investigators should review follow-up forms thoroughly to ensure that all outcome measures needed to meet study objectives have been included. For example, in the case of prostate cancer patients, the follow-up forms may need to be modified to capture not only measurable disease, bone disease, and biochemical progression dates, but also to report the progression criteria met in each case. If non-protocol therapy information is collected, will information about first non-protocol therapy be sufficient, or does the sequence and type of all non-protocol therapies need to be captured? The sooner these decisions are made, the higher the probability that all appropriate outcome measures will be systematically and correctly reported in the CRFs.
DATA COLLECTION, DATA MANAGEMENT AND QUALITY CONTROL Data quality control starts at the form design stage. Some principles of good form design are listed below (4, 7–8). · In accordance with the parsimony rule, only those variables needed to answer study objectives should be collected. · Headers of CRFs should include study name and number, form name and number, unique patient identification number, patient initials, treating institution name or ID, date of visit or start date of reporting period covered by the form to differentiate multiple submissions of the same form. An “amended” form box should also be placed at the top of the first page to identify re-submitted amended forms. · The body of the forms should consist of questions arranged in logical order. Closed-ended questions that provide a list of preselected answers from which a respondent selects either one or all selections that apply are preferable because they allow responses to be easily tabulated and analyzed. If only one answer applies, choices should be mutually exclusive, and an “other” category be available in case choices are not exhaustive. The “select one” approach (i.e., No, Yes, or Unknown) is preferable
16 DATA COLLECTION
·
· · ·
when collecting baseline symptom information, to make sure every symptom or preexisting condition is evaluated. Open-ended questions that have to be answered with text strings subject to misspellings and interpretation should be avoided wherever possible. Footers should specify form and version number and the date each version was created in order to ensure that the latest version of a form is used. Each form should end with identification of the person completing the form and date form was completed. Questions should be kept as short as possible and employ user-friendly terms. Redundant (e.g., age and birth date) and compound questions should be avoided. Same coding conventions should be followed in every CRF. If a question requires a “No/Yes” answer, the choices should always appear in the same sequence, and the same codes used to represent “No” (1) or “Yes” (2). Similarly, if a distinction doesn’t need to be made between “unknown,” “not applicable, “not available,” and “not done,” it should always be represented by the same code (e.g., 99). When requesting dates, their sequence should be labeled (e.g., as mm-dd-yyyy, or mm-yyyy if actual day is not needed) to avoid confusion. When using decimal points, decimal points should be preprinted in the forms.
Carefully designed forms also need to be accompanied with well-written instructions, and be field tested by experienced and novice clinical staff members. Experienced staff can verify that the questions are presented in logical order and do not request redundant information, that the number of boxes and units available for recording of laboratory values are correct, and that alternatives in multiple choice questions are mutually exclusive and exhaustive if responder is to select one choice only. Testing will have been successful if beginner CRAs are able to efficiently complete the CRFs using information abstracted from medical records. If data are electronically entered, or scanned in, built-in checks at the point of data entry are best. For example, rules can be set up to reject values of an incorrect field type (i.e., numeric vs. character), out of range or missing values, duplicate data or data out of sequence, and data from patients with ID numbers that do not match those of patients registered to a study. But, the best way to optimize implementation of a study protocol and data collection processes is the proactive way, by creating a “Manual of Operations” that describes in detail the procedures and data
137
submission requirements during each phase of a study (3, 7). Below is a suggested list of contents for such a manual. · Study schema · Procedures for clinic and physician certification with the Clinical Trials Support Unit (CTSU), including submission of the institutional review board (IRB) approval documentation to the Central Regulatory Office, and of Food and Drug Administration (FDA) Form 1572 investigator registration materials to the NCI Pharmaceutical Management Branch (PMB) · Drug ordering policies if agents are not commercially available; drug accountability requirements if agents are supplied by the PMB or pharmaceutical company · Handouts/letters used to recruit physicians and patients · Screening instructions—eligibility requirements, list of required prestudy imaging and laboratory assessments and their scheduling prior to enrollment, special instructions for calculating variables based on two lab values (e.g., urine protein:creatinine [UPC] ratio or creatinine clearance) required for eligibility purposes · Instructions for collecting, submitting, and storing specimens, especially those needing central morphology review to determine patient eligibility · Reminder to collect baseline questionnaires, prior to randomization, from patients who have consented to participate · Description of information required for registration and of the actual patient registration process via the Web or telephone · Description of companion studies (optional vs. mandatory) and whether they require separate informed consent or registration · Latest date by which treatment must start · General data submission instructions for remote data entry versus mailed or faxed paper forms and originals versus amended forms · Patient study calendar and sample checklist cataloging all procedures and tests to be done and the forms to be completed: · At the end of each cycle of treatment · At restaging · At the end of each phase of treatment (if more than one) · At discontinuation of all protocol treatment · During posttreatment long-term follow-up for patients who have or have not yet progressed · At the time of new primaries or secondary malignancies · At death
138
ONCOLOGY CLINICAL TRIALS
· Copies of all case report forms and detailed forms instructions · Instructions for requesting and reporting protocol exceptions or deviations (treatment delays, missing tests or samples) · Definitions of response and progression criteria (i.e., biomarker levels, RECIST criteria of measurable/ nonmeasurable disease, bone progression) highlighting those that are study-specific · Definition of criteria for stopping protocol treatment or switching to a new treatment phase (e.g., from double-blinded to open-label agent, from induction to consolidation therapy) · Types of source documentation (e.g., laboratory, imaging, and pathology reports; flow sheets; physical exam forms) that must accompany certain CRFs to verify eligibility, response, progression, and any other critical events to be determined by the study chair · Description of query tracking system and appropriate responses to queries While the study protocol with supplementary appendices and handouts currently have provided some of the preceding information, the Manual of Operations would organize procedures in the sequence they are to be carried out from study entry until study end points are attained. The manual itself would be frequently updated to reflect questions fielded by the principal investigator, protocol and data coordinators, and thus form the basis of a constantly evolving training program. The manual could direct institution personnel to training Web sites for recently implemented systems such as, RECIST 1.1, CTSU OPEN web registration, and in the coming year, caAERS and the new CALGB specimen tracking system. The advantages of such a manual are irrefutable, but the manual will probably remain unwritten without encouragement and support from the principal investigator. In the long run, the manual will allow data managers in large phase III studies to spend less time fielding questions and more time reviewing incoming data and maintaining a clean database—all on a real-time basis. After data have been entered into a central database, whether via paper forms or electronic remote data entry, the data cleanup phase will be initiated. Edit programs will be run by the statistical center to confirm patients have been properly randomized and to identify missing or out-of-sequence forms, inconsistent data between different forms (i.e., treatment and AE forms) covering the same reporting period, or the same form submitted at multiple time points, impossible values, and discrepancies arising from combined or derived data sets (9, 28). In addition, data management will:
· Manually perform eligibility checks from information provided in on-study forms and laboratory and imaging reports. · Reconcile discrepancies in the SAEs reported in routine AE forms and AdEERS reports. (Note: These discrepancies will be automatically checked at data entry in the soon-to-be released Web-based NCI caAERS reporting system.) · Verify that measures of response and progression reported in follow-up and solid tumor evaluation forms agree and are supported by imaging or laboratory documentation. · Send out data correction and clarification queries when errors and inconsistencies are discovered during edit checks and manual reviews. After query responses are received, corrections will be made to the central database, if necessary, and the query tracker system will note whether queried issues have been resolved. In addition, site audits will be performed on a sample of patients to verify that regulatory forms have been submitted; drugs supplies are accounted for; eligibility criteria have been met; and data reported in on-study, treatment, AE, and follow-up forms are confirmed by source documentation. While we would like to depend on software developed for electronic data capture and clinical data management to automatically monitor all incoming data for quality assurance purposes, whether treatment, toxicity, and outcome measures have been accurately and completely reported for each patient will ultimately be determined by the principal investigator at the time of case evaluations. Generally, case evaluations are not allowed until just prior to the first external publication. We propose that the principal investigator be allowed and encouraged to review the records of the first 10 to 20 patients who reach their primary end points to confirm that all the key data variables and source documentation have been collected satisfactorily. Although this practice may seem irregular and radical at first, it will facilitate the implementation of necessary modifications to the study protocol and CRFs and improve the accuracy of data collected from the remaining patients. This practice will also confirm that CRFs have been properly designed to answer the scientific questions posed by a study.
SUMMARY If NCI-caBIG efforts are successful, all patient data, regardless of source, will eventually reside in a centralized database maintained by the study
16 DATA COLLECTION
coordinating group. The ultimate suite of software tools will screen patients for eligibility; register eligible patients; create patient study calendars to schedule treatment, laboratory assessments, and specimen collections; collect treatment compliance data, routine and SAEs, together with results of laboratory, imaging, and biomarker assessments; monitor protocol deviations; and issue and track queries until missing and inconsistent data issues are resolved. The principal investigator data collection responsibilities described in this chapter will nevertheless remain unchanged and are summarized below. In collaboration with study team members, the principal investigator’s primary data collection duty is to guarantee that the data gathered in CRFs will meet study objectives by: · Identifying the disease-specific key variables collected in on-study forms (i.e., baseline characteristics of disease extent [imaging, pathology, and biomarker data], health and functional status, and medical history [prior treatment, concurrent medications, and preexisting conditions]) · Selecting the treatment variables to be collected on a cycle-by-cycle basis, at the end of each treatment phase, and at discontinuation of treatment · Deciding whether treatment data should be limited to study agents only or extended to concurrent medications and how oral agent compliance will be assessed · Selecting the toxicities to be regularly monitored and AE assessment instruments (i.e., CTCAE-based adverse event forms, patient-reported outcome questionnaires, quality-of-life/pain questionnaires) · Choosing the outcome variables best fitted to evaluate treatment efficacy · Identifying the critical variables that need to be verified with source documentation · Supporting the development of training materials (i.e., manual of operations and forms instructions) that promote clinical data management best practices · Performing thorough case evaluations of treatment adherence, toxicities, and outcome measures on a timely basis before final results are published
4.
5. 6. 7. 8. 9. 10.
11. 12.
13.
14.
15.
16. 17. 18.
19. 20.
21.
References 1. Girling DJ, Parmar MKB, Stenning SP, Stephens R, Stewart L. Clinical Trials in Cancer: Principles and Practice. Oxford, England: Oxford University Press; 2003. 2. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC; 2003. 3. Holbrook JT, Shade DM, Wise RA. Data collection and quality control. In Schuster DP, Powers WJ, eds. Translational and
22.
23.
24.
139
Experimental Clinical Research. Philadelphia, PA: Lippincott Williams & Wilkins; 2005:136–151. Voorhees J, Scheipeter ME. Case report form development. In Schuster DP, Power WJ, eds. Translational and Experimental Clinical Research. Philadelphia, PA: Lippincott Williams & Wilkins; 2005:122–135. Good PI. A Manager’s Guide to the Design and Conduct of Clinical Trials. 2nd ed. New York: Wiley & Sons; 2006. Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing Clinical Research. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2007. McFadden E. Management of Data in Clinical Trials. 2nd ed. Hoboken, NJ: Wiley & Sons; 2007. Prokscha, S. Practical Guide to Clinical Data Management. 2nd ed. Boca Raton, FL.: Taylor & Francis; 2007. Introduction to Statistical Methods for Clinical Trials. Cook TD, DeMets DL, eds. Boca Raton, FL: Chapman & Hall/CRC; 2008. Kelly WK, Scher HI, Mazumdar M, Vlamis V, Schwartz M, Fossa SD. Prostate-specific antigen as a measure of disease outcome in metastatic hormone-refractory prostate cancer. J Clin Oncol. 1993;11:607–615. Halabi S, Small EJ, Kantoff PW, et al. Prognostic model for predicting survival in men with hormone-refractory metastatic prostate cancer. J Clin Oncol. 2003;21:1232–1237. Scappaticci FA, Skillings JR, Holden SN, et al. Arterial thromboembolic events in patients with metastatic carcinoma treated with chemotherapy and bevacizumab. J Natl Cancer Inst. 2007;99:1232–1239. Bubley GJ, Carducci M, Dahut W, et al. Eligibility and response guidelines for phase II clinical trials in androgen-independent prostate cancer: Recommendations from the PSA Working Group. J Clin Oncol. 1999;17:3461–3467. Scher HI, Halabi S, Tannock I, et al. Design and end points of clinical trials for patients with progressive prostate cancer and castrate levels of testosterone: recommendations of the Prostate Cancer Clinical Trials Working Group. J Clin Oncol. 2008;26:1148–1159. Guidelines for monitoring of clinical trials for cooperative groups, CCOP research bases, and the clinical trials support unit. October 2006. (http://ctep.info.nih.gov/branches/ ctmb/clinicalTrials/docs/2006_ctmb_guidelines.pdf) Partridge AH, Avorn J, Wang PS, Winer EP. Adherence to therapy with oral antineoplastic agents. J Natl Cancer Inst. 2002;94:652–661. Trotti A, Colevas AD, Setser A, Basch E. Patient-reported outcomes and the evolution of adverse event reporting in oncology. J Clin Oncol. 2007;25:5121–5127. Chen, A. Introduction to NCI common toxicity criteria adverse event (CTCAE) v4. Presented at Cancer and Leukemia Group B Clinical Research Associates Committee meeting, June 25, 2009. (http://www.calgb.org/Public/meetings/presentations/2009/ summer_group/cra_committee/04_CTCAE-Chen_062009.pdf) Basch E, Artz D, Dulko D, et al. Patient online self-reporting of toxicity symptoms during chemotherapy. J Clin Oncol. 2005;23:3552–3561. Basch E, Iasonos A, McDonough A, et al. Patient versus clinician symptom reporting using the National Cancer Institute Common Terminology Criteria for Adverse Events: results of a questionnaire-based study. Lancet Oncol. 2006;7:903–909. Bruner DW. Should patient-reported outcomes be mandatory for toxicity reporting in cancer clinical trials? J Clin Oncol. 2007;25:5345–5347. Bruner DW, Bryan CJ, Aaronson N, et al. Issues and challenges with integrating patient-reported outcomes in clinical trials supported by the National Cancer Institute-sponsored clinical trials network. J Clin Oncol. 2007;25:5051–5055. Matholin-Pelissier S, Gourgou-Bourgade S, Bonnetain F, Kramar A. Survival end point reporting in randomized cancer clinical trials: a review of major journals. J Clin Oncol. 2008;26:3721–3726. Burzykowski T, Buse M, Piccart-Gebhart MJ, et al. Evaluation of tumor response, disease control, progression-free survival,
140
ONCOLOGY CLINICAL TRIALS
and time to progression as potential surrogate end points in metastatic breast cancer. J Clin Oncol. 2008;26:1987–1992. 25. Halabi S, Vogelzang NJ, Kornblith AB, et al. Pain predicts overall survival in men with metastatic castration-refractory prostate cancer. J Clin Oncol. 2008;26:2544–2549. 26. Therasse P, Arbuck SG, Eisenhauer EA, et al. New guidelines to evaluate the response to treatment in solid tumors (RECIST guidelines). J Natl Cancer Inst. 2000;92:205–216.
27. Eisenhauer EA, Therasse P, Bogaerts J, et al. New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1). Eur J Cancer 2009;45:228–247. 28. Grady D, Hulley SB. Implementing the study and quality control. In Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB, eds. Designing clinical research. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins, 2007: 271–289.
17
Reporting of Adverse Events
Carla Kurkjian Helen X. Chen
An essential component in the development of a cancer drug is reporting of adverse events (AEs) during all phases of clinical trials. The accuracy of AE reporting can have far-reaching consequences on the safety of patients on study as well as the subsequent use and development of cancer therapeutic agents. All physicians who sign the Food and Drug Administration (FDA) 1572 investigator registration form commit to accurate reporting. Information gained from AE reporting such as cumulative rates of events and details of specific major events could significantly impact ongoing and future trials utilizing the study agent, particularly with regard to patient inclusion criteria, safety monitoring plans, prophylaxis, and management of toxicities. The development of toxicity profiles for a given agent is the culmination of an extensive process that includes the patient and clinician interaction, the recording of signs and symptoms, and the appropriate reporting of AEs. In this chapter, we will discuss the basic principles of AE reporting from an investigator’s perspective.
BACKGROUND OF ONCOLOGY ADVERSE EVENT REPORTING The routine reporting of AEs dates back over 20 years with the advent of prospective oncology tri-
als. From a common terminology to describe toxicities to publication of clinical trial results, the accurate reporting of encountered AEs is paramount. As a means of standardizing AE reporting among the many trial sites, the National Cancer Institute (NCI) developed the Common Terminology Criteria (CTC), a lexicon of commonly seen AEs which has been in use since 1984. The CTC was revised in 2002, which resulted in the NCI Common Terminology Criteria for Adverse Events (CTCAE) and includes over 1,000 terms to encompass events not only associated with oncology drug administration but also those resulting from radiation therapy and surgical procedures. In addition to providing a listing of AE terms, the CTC and CTCAE systems also include severity grading for each AE (from 1 = mild, to 5 = resulting in death). Another database utilized internationally is the Medical Dictionary for Regulatory Activities (MedDRA system) (1/5). This system was developed by the International Conference on Harmonisation and contains over 65,000 terms, which include AEs for all areas of medicine. Unlike the NCI CTCAE, MedDRA is not restricted to oncology disciplines and does not include a severity grading system. The MedDRA system is commonly used by industry sponsors for reporting to regulatory agencies,
The content of this chapter does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does it mention trade names, commercial products, or organizations that imply endorsement by the U.S. Government. 141
142
ONCOLOGY CLINICAL TRIALS
TABLE 17.1
Numeric Coding Used to Describe the Attribution of AE Based on Relatedness to the Treatment. ATTRIBUTION Unrelated Unlikely Possible Probable Definite
CODE 1 2 3 4 5
The AE is clearly not related to the study treatment. The AE is doubtfully related to the study treatment. The AE may be related to the study treatment. The AE is likely related to the study treatment. The AE is clearly related to the study treatment.
however, the NCI CTCAE has been utilized as the worldwide standard dictionary for reporting AEs in cancer clinical trials and in scientific journals.
BASIC PRINCIPLES OF AE REPORTING An AE in oncology trials is defined as any unfavorable or unintended sign (including abnormal laboratory or radiographic result), symptom, or disease that is temporally associated with the use of a drug, regardless of whether the event can be attributed to the agent. Once an AE is identified through clinical or laboratory evaluation, it must be graded according
to the CTCAE and an attribution must be assigned. The attribution of the AE is a determination of the relatedness of the AE to the medical intervention(s) investigated in the trial (see Table 17.1). All adverse events are graded on a 5 point scale of general proportionality: 0 = no event, within normal limits; grade 1 = mild adverse event; 2 = moderate event; 3 = severe and undesirable; 4 = life-threatening or disabling; 5 = death. In general, AEs in a trial are collected by two mechanisms: routine reporting through submission of Case Report Forms (CRFs) for all AEs as required by the protocol at fixed intervals, and expedited reporting of serious AEs (SAEs) through dedicated reporting systems. The criteria of AEs required for routine or expedited
TABLE 17.2
AE Reporting Responsibility of Investigators for Protocols Under IND. PERIODIC REPORTING ALL AES MANDATED BY THE PROTOCOL
FOR
For trials under CTEP-held IND
For trials under group or industry IND
· Phase I/II: Report to Cancer Treatment Monitoring System (CTMS) or Cancer Database Updating System (CDUS) · Phase III (by cooperative groups): Report to the cooperative group data center · Report to site/group data center
For trials under investigator-held IND (Investigator/group also functions as the IND holder)
· Report to FDA through Annual Report or New Drug Application
EXPEDITED REPORTING FOR SAES · Report to CTEP through AdEERS per AdEERS guidelines
· Report to data center per protocol-defined criteria for expedited AE reporting · Report to the FDA per FDA’s guidelines for unexpected serious events attributable to the investigational therapy
17 REPORTING OF ADVERSE EVENTS
reporting in a specific protocol may vary with the level of experience with the study regimen, the clinical setting, or the phase of the clinical trial. For clinical trials conducted under an investigational new drug (IND) application, the sponsors of the IND are responsible for reporting to the regulatory authorities (e.g., FDA in the United States). Therefore, the guidelines instituted by an IND sponsor should also ensure compliance with requirements by the regulatory authorities. Typically, investigational agents are the subject of an IND, however, occasionally an IND may be filed for a commercial agent if it is being tested for a novel indication. The responsibilities of an investigator or a research group in AE reporting depend on whether an IND application related to the trial has been made and whether those conducting the research are holding the IND (Table 17.2). In principle, the individual or research group is responsible for reporting to the sponsor of the IND, and the sponsor is responsible for reporting to the FDA. If the investigator or research group is also the holder of the IND, then they would be directly responsible for reporting to the FDA. A detailed description of the guidelines for routine and expedited AE reporting for NCI sponsored trials is beyond the scope of this manuscript. The following are highlights of key elements of the requirements for the investigators and the sponsors.
FDA REPORTING GUIDELINES All clinical trials conducted in the United States must comply with regulatory requirements as set forth by the FDA. Clinical trials that include sites outside the United States must also comply with the rules of the local regulatory authorities. For trials proposed in an IND application, the FDA requires the sponsor to submit IND annual reports which should include a comprehensive list of all AEs reported in association with the agent over the year. For clinical trials that are used to support approval in a therapeutic indication, comprehensive toxicity profile should also be provided at the time a New Drug Application (NDA) or Biologic License Application (BLA), or supplemental NDA/supplemental BLA is completed. In addition, FDA requires expedited reporting of unexpected serious AEs which occurred, with reasonable probability, as a result of the investigated drug. SAEs may be defined as the following: those that result in death or are life-threatening, those that cause hospitalization or prolongation of a hospitalization, those that result in significant disability or congenital anomaly, and those that require medical or surgical intervention to prevent one of the above listed outcomes (2). Unexpected AEs are defined as
143
those not listed in the investigator brochure or package insert of the agent (if the agent is FDA-approved). If an investigator brochure is not available, the agent information in either the treatment protocol or the IND application may be used to determine expectedness. In short, for AEs that are both serious and unexpected, and with possible/probable/definite relationship to the agent, expedited reporting to the FDA is required. In addition, expedited reporting is required for AEs that result in mutagenicity or teratogenicity. Expedited reporting to the FDA means the notification must occur as soon as possible within 15 calendar days after the sponsor’s knowledge of the event. A form is provided from the FDA (MedWatch Form 3500A) for notification purposes. Information to be included in the MedWatch Form can be found at the following Web site: (3). Alternatively, a narrative text is accepted but must indicate that the document for submission is an “IND Safety Report.” With each written IND safety report must be included any previously filed safety report related to a similar adverse experience with the IND. Any unexpected adverse event that is life-threatening or fatal and is associated with the use of the drug must also be reported to the FDA by telephone or facsimile no later than 7 calendar days after the sponsor first learns of the event. The FDA defines the phrase “associated with the use of the drug” as a reasonable possibility that the event may have been due to the investigational agent.
NCI/CTEP REPORTING GUIDELINES Among the many functions of the Cancer Treatment Evaluation Program (CTEP) of the NCI are its sponsorships of trial funding as well as of IND applications. In this capacity, CTEP must ensure that sponsored trials are conducted in compliance with federal regulations, in addition to monitoring the safety and proper conduct of the trial. Consistent with the general practice of clinical trials, AE reporting for CTEP-sponsored trials is also divided into routine periodic reporting and expedited reporting. Therefore, after an AE has been identified, graded, and assigned an attribution, the investigator must then decide if the AE qualifies for expedited reporting to CTEP. Although all AEs, which require reporting as outlined by a given protocol, should be documented on CRFs and submitted periodically to the data center, those that meet CTEP’s requirement for expedited reporting should also be submitted to CTEP’s Adverse Event Expedited Reporting System (AdEERS).
TABLE 17.3(A)
For Trials under CTEP IND: AdEERS Expedited Reporting Requirements for Adverse Events That Occur within 30 Days1 of the Last Dose of the Investigational Agent Phase 1 Trials.* PHASE I TRIALS
GRADE 1
GRADE 2
GRADE 2
GRADE 3
UNEXPECTED
UNEXPECTED
AND
WITH
GRADES 4 & 52
GRADE 3
WITHOUT
EXPECTED WITH
HOSPITALIZATION HOSPITALIZATION
UNEXPECTED WITHOUT
AND
HOSPITALIZATION
EXPECTED 24-Hour; 5 Calendar Days 24-Hour; 5 Calendar Days
EXPECTED
UNEXPECTED
EXPECTED
HOSPITALIZATION
Unrelated Unlikely
Not Required
Not Required
Not Required
10 Calendar Days
Not Required
10 Calendar Days
Not Required
Possible Probable Definite
Not Required
10 Calendar Days
Not Required
24-Hour; 5 Calendar Days
24-Hour; 5 Calendar Days
10 Calendar Days
Not Required
1 Adverse events with attribution of possible, probable, or definite that occur greater than 30 days after the last dose of treatment with an agent under a CTEP IND require reporting as follows: AdEERS 24-hour notification followed by complete report within 5 calendar days for: · Grade 3 unexpected events with hospitalization or prolongation of hospitalization · Grade 4 unexpected events · Grade 5 expected events and unexpected events 2 Although an AdEERS 24-hour notification is not required for death clearly related to progressive disease, a full report is required as outlined in the table. December 15, 2004 * For updates on the guidelines, refer to CTEP Web site, www.CTEP.Cancer.gov.
TABLE 17.3(B)
For Trials under CTEP IND: AdEERS Expedited Reporting Requirements for Adverse Events that occur within 30 Days1 of the Last Dose of the Investigational Agent Phase 2/3 Trials.* PHASE II
GRADE 1
GRADE 2
GRADE 2
PHASE III TRIALS
GRADE 3
UNEXPECTED WITH
EXPECTED
UNEXPECTED
EXPECTED
Not Required Not Required
Not Required 10 Calendar Days
Not Required Not Required
GRADES 4 & 52
GRADE 3
UNEXPECTED
AND
Unrelated Unlikely Possible Probable Definite
AND
WITHOUT
GRADES 4 & 52
EXPECTED WITH
WITHOUT
HOSPITALIZATION HOSPITALIZATION HOSPITALIZATION HOSPITALIZATION UNEXPECTED EXPECTED 10 Calendar Days 10 Calendar Days
Not Required 10 Calendar Days
10 Calendar Days 10 Calendar Days
Not Required Not Required
10 Calendar 10 Calendar Days Days 24-Hour; 10 Calendar 5 Calendar Days Days
1 Adverse events with attribution of possible, probable, or definite that occur greater than 30 days after the last dose of treatment with an agent under a CTEP IND require reporting as follows: AdEERS 24-hour notification followed by complete report within 5 calendar days for: · Grade 4 and Grade 5 unexpected events AdEERS 10 calendar day report: · Grade 3 unexpected events with hospitalization or prolongation of hospitalization · Grade 5 expected events 2 Although an AdEERS 24-hour notification is not required for death clearly related to progressive disease, a full report is required as outlined in the table. December 15, 2004 * For updates on the guidelines, refer to the CTEP Web site, www.CTEP.Cancer.gov.
146
ONCOLOGY CLINICAL TRIALS
Routine AE Reporting through CRFs Events Collected in AE CRFs The CRF is a documentation of all reportable AEs experienced by a patient, and is inclusive of AEs that also meet the criteria for expedited reporting. Data Elements of AE CRFs All AE CRFs should collect grade and attribution of the AEs. Depending on the agent and/or the clinical setting under study, additional data elements could include onset/resolution date of the AE and the consequence of the AE (e.g., dose reduction or discontinuation, or requirement of concomitant medication). Expedited AE Reporting through AdEERS The expedited reporting through AdEERS is designed for the investigator to inform NCI CTEP of serious AEs associated with the investigational agents in a timely manner. CTEP uses this mechanism to ensure compliance with the FDA regulation for SAE reporting and to identify unexpected safety signals (in addition to the routine, protocol-specified monitoring plan) that may impact the safety and conduct of the trial. For trials utilizing investigational agents, the reporting guidelines for phase I or phase II/III trials are described in Table 17.3(A) and Table 17.3(B). The timeline for AdEERS submission is defined to ensure timely review by CTEP drug monitors and (when indicated) subsequent reports to FDA within the FDA-defined SAE reporting timeline. Please note that the CTEP’s guidelines for expedited reporting (AdEERS) are subject to change to accommodate new regulatory requirements and/or emerging drug development needs. The investigators should refer to the CTEP Web site for most updated reporting guidelines. Detailed information on CTEP AE reporting and the use of AdEERS can be found at the CTEP Web site (4). All submitted AdEERS reports are reviewed by a designated NCI/CTEP drug monitor. Additional data or source documents may be queried after initial review to clarify the event information. Based on the drug monitor’s review, NCI/CTEP may update information on the AdEERS report in certain areas, including IND agent attribution, event grade, causal factor, or adverse event term. As noted above, CTEP also uses these reports to make the decision about whether a given SAE meets the criteria for expedited reporting to the FDA. After the NCI/CTEP drug monitor has completed an independent assessment of the event, this evaluation is then communicated to the investigator for his/her reference. The process from submission of an AdEERS report to CTEP assessment and ending with a report to
the FDA and communication with the investigators is depicted in Figure 17.1. Utilizing AdEERS for Expedited Reporting of SAEs Determining if an AE Requires an AdEERS Expedited Report As outlined in Table 17.3(A) and Table 17.3(B), whether an AE requires expedited reporting depends on the expectedness and grade of the AE, attribution to the investigational agent/regimen, and the temporal relationship to the protocol therapy. A detailed description can be found at: (5). The following is an explanation of certain element of the guidelines that commonly cause confusion to new investigators. Expectedness In CTEP-sponsored trials is defined by a list of expected AEs (called Agent Specific AE List [ASAEL]) within the Comprehensive Adverse Event and Potential Risk List (CAEPR) for the IND agents. ASAEL is comprised of a subset of AEs from the comprehensive list that is considered to be expected for the purpose of expedited reporting requirements. As such, ”unexpected” AEs are those which are not found in the ASAEL. Of note, not all AEs with known relatedness to the agent are listed under the “expected” list ASAEL, because CTEP may deem it necessary to require more stringent reporting for certain known drug toxicities. As an example, reversible posterior leukoencepholapthy syndrome (RPLS), which has been associated with the use of bevacizumab, is included in the CAEPR for the drug, but not in the ASAEL since expedited reporting of this side effect is still considered important by CTEP for their sponsored trials. Temporal Relationship In the AdEERS guidelines is defined as an event occurring within 30 days of the last study therapy received. However, for certain agents with a prolonged effect on normal organ functions, AEs that develop more than 30 days after the last treatment could also be considered to be related to therapy, and therefore should be reported. For example, alemtuzumab is known to be associated with lymphopenia and immunosuppression for months and years after cessation of therapy. Therefore, a life-threatening or fatal opportunistic infection occurring 6 months after the last dose of alemtuzumab could potentially be considered related to the agent and would require expedited reporting. In addition, AEs that result in persistent disability as well as those that cause a congenital anomaly or birth defect are to be reported regardless of when they occur, even if beyond 30 days after treatment.
17 REPORTING OF ADVERSE EVENTS
At local sites: ∞ SAE reporting within 1–10 days of knowledge of the event ∞ Site assess AE based on available information
147
At the Cooperative Group central office (for cooperative group trials): two parallel process: ∞ Maintains the AdEERS database as assessed by investigators (+ internal process for reconciliation with CRF database) ∞ Formal submission of AdEERS reports to CTEP within 10 days (sometimes, after initial review by group’s central office)
At CTEP ∞ Drug monitors assess AdEERS reports and may request additional information (e.g., discharge summary, radiographic reports) ∞ AdEERS events as assessed by CTEP forms the CTEP-assessed AdEERS database maintained by CTEP ∞ CTEP provides feedback of CTEP’s assessment to the sites
Reporting of the AdEERS events to the FDA: ∞ All AdEERS events in the CTEP-assessed AdEERS database are reported to the FDA in the IND annual reports. ∞ For a subset of AdEERS events that meet the FDA criteria for expedited reporting (Serious, unexpected and related AEs), CTEP is responsible for submitting the expedited IND Safety Letter. All IND Safety Letters are also distributed to investigators of CTEPfunded or CTEP-sponsored trials using the same investigational agent.
FIGURE 17.1 Process of AdEERS reports from submission, to CTEP assessment and to reporting to regulatory authorities.
Key Elements of an SAE Report General An accurate and adequately informative SAE report is essential to independent assessment by the sponsor, as he/she does not have first-hand knowledge of the patient and yet must make timely judgment concerning conduct of the trial and administration of the agent. As with the CRFs in routine reporting, the key components of the AdEERS report are the AE terms, grades, and attributions. Mandatory data fields in the AdEERS reports also include patient characteristics (disease stage, prior therapies, concomitant medications, etc.), protocol information, and treatment administered, as well as details of the AE such as onset/resolution date and attribution. In addition, the investigators are asked to provide a narrative description of the event as well as pertinent supplemental documents such as a discharge summary, laboratory results, and radiographic reports; the purpose of such information is to enable the sponsor to ensure accurate coding and grading of the AEs and to independently assess the attributions. Also essential is
updating the AE reports with additional information as it becomes available. Appropriate Selection of the AE Terms and Grades Selection of the terms and grades of an AE is based on the CTCAE dictionary which is available in paper form and through Web access. CTEP also provides online tools to help investigators search the appropriate terms using key words and an index (6). In addition to locating the right terms in the CTCAE dictionary, appropriate selection of the AE terms also requires appropriate understanding of the clinical picture. If known, the AE terms best representing the underlying cause of the symptoms should be included in the AE list of the report. For example, if a patient was admitted to the hospital for severe shortness of breath and clinical evaluation revealed findings consistent with a diagnosis of congestive heart failure, simply reporting Grade 4 dyspnea would be inadequate. Unfortunately, it is not uncommon to receive a
148
ONCOLOGY CLINICAL TRIALS
report including only nonspecific symptoms even if the definitive diagnoses were available. Sometimes, a clear etiology may not be available at the time of initial submission of the SAE report. In that case every effort should be made by the investigators to obtain additional information. The SAE report should be appropriately amended based on any new information. On the other hand, if a presumed diagnosis or etiology does not adequately explain the clinical picture, the AE terms should retain the major clinical presentations. Sometimes the reporting physicians may not be directly involved in the work-up and management of the SAEs being reported, and he/she must rely on outside medical records to complete the report. The investigators should use their best clinical judgment to decide which AE terms would best capture the events experienced by the patient. One example may include a patient who developed bilateral pulmonary interstitial infiltrates with a low grade fever, and was diagnosed with infection even though microbiological studies were not obtained or were negative. In this case, despite the working diagnosis of infection (pneumonia), the clinician cannot definitively ascertain whether the pulmonary findings are truly due to infection or other causes such as the protocol therapy. In order to capture pneumonitis for safety analysis, it may be more prudent to report “Pneumonitis/ pulmonary infiltrate” (rather than “Infection— pneumonia”) with infection and treatment listed as possible causes. Assessment of the Attributions of an AE By definition, AEs reported in oncology trials have a close temporal relationship to the study therapy. However, assessing the attribution of each AE to treatment versus tumor versus comorbid conditions oftentimes can be challenging. For AEs that are unique (e.g., neutropenia, hand-foot syndrome, infusional reactions) or those that wax and wane with each treatment cycle, attribution to the drug is usually obvious. For AEs that overlap with common noncancer diagnoses (e.g., myocardial infarction) or tumor-related complications (e.g., perforation or bleeding at the tumor site), the precise contribution by the study drug often cannot be ascertained within an individual patient, and the investigator must use his/her judgment based on the patient’s underlying risk factors for the specific events, other agents in the protocol regimen, concomitant medications, as well as the temporal relationship to the study regimen. Given the inherent subjectivity and limitations in the assessment on individual cases, association of such AEs to an experimental agent commonly requires
knowledge of the cumulative rate of the specific event within and across trials in comparison to what is observed in historical or concurrent controls. Thus, the most important tasks for the reporting physicians are accurate coding and grading of the specific events and provision of relevant supporting documents that may assist in the independent assessment of events within and across individual cases. Narratives and Supplemental Documents Narrative description of the AE is an electronically accessible and integral part of the AdEERS report which should represent a brief summary of the event. A high quality narrative summary can considerably facilitate the reviewer’s assessment of the seriousness and attribution of the SAE and reduce subsequent queries for additional information. Depending on the nature of the events, the narratives may include the context of the SAE (e.g., patient’s baseline conditions and relevant comorbidities), diagnostic studies, medical interventions, as well as the patient’s current status. A narrative that simply repeats the AE terms (e.g., “The patient was admitted for confusion”) is not acceptable for adequate assessment of the event. If additional information is not available at the time of report submission, every effort should be made to update the report in a timely manner. The extent of supplemental documentation with an SAE report depends on the complexity and nature of the event. It is generally recommended that the site submit the admission and discharge notes if the patient was admitted for the SAE. For those SAEs possibly related to tumor progression (e.g., severe fatigue or liver function abnormalities), documentation of the extent of tumor involvement would be critical to the accurate assessment of whether the event was a consequence of the therapy or of the cancer.
INTEGRATING THE AE DATABASE FROM ROUTINE AND EXPEDITED REPORTING Given the two reporting mechanisms of AE reporting, there exist at least two AE datasets for a given trial (Table 17.4). Different groups and investigators may have differing rules of how the AE databases should be reconciled and integrated. The primary goal of this process is to provide a comprehensive listing of the AEs for analysis of the trial and reporting of the results. At the minimum, the investigator should ensure that all the SAEs reported through the expedited reporting mechanisms are included in the
149
17 REPORTING OF ADVERSE EVENTS
TABLE 17.4
Mechanisms of AE Reporting for NCI-Sponsored Clinical Trials with Investigational Agents Under CTEP IND. REPORTING
THROUGH
AE CRFS
REPORTING
THROUGH
ADEERS
· Inclusive of all AEs required by the protocol for reporting, including those also reported to AdEERS
· Expedited reporting for unexpected or SAEs that meet the agent/protocol-specific criteria for AdEERS reporting · CTEP maintains the database for investigatorassessed AdEERS reports as submitted to CTEP AND AdEERS reports as assessed by the CTEP drug monitors Contributions of the two databases to the final AE dataset of the trial: · The AE CRFs serve as the primary database for groups’ reporting of the trial outcome. · Groups/investigators should have ongoing standard procedures to reconcile the AE CRFs and the submitted AdEERS reports to ensure that SAEs reported through AdEERS are reflected in the overall AE database. · CTEP assessments that are provided to the investigators can serve as a supplemental/reference AE dataset for the investigators to double check the completeness/accuracy of their own database and for more comprehensive understanding of the AE profile.
master AE database. In addition, CTEP’s assessment of the AdEERS events, which are communicated back to the investigator, may be used as a reference to reconcile the submitted reports and the master database and ensure accuracy.
CONCLUSION In summary, accurate documentation of AEs and adequate, timely reporting is necessary of the investigators involved in clinical trials as required by federal and/or international regulations. Clinicians and investigators must be familiar with their responsibilities for AE reporting to the IND sponsors and/or the regulatory authorities. To ensure accurate documentation of the AEs, clinicians should be closely involved in the process of translating clinical symptoms/signs or laboratory abnormalities into the appropriate AE terms, grades,
and attributions. Finally, to compile a comprehensive AE profile for safety monitoring, analysis, and publication, all collected AE data should be thoughtfully considered and integrated, including information from routine and expedited AE databases.
References 1. Information on MedRA: http://www.meddramsso.com 2. FDA reporting guidelines can be found in: Title 21: Code of Federal Regulations, Article 312 3. FDA MedWatch reporting: http://www.fda.gov/MedWatch/ report/instruc.htm 4. Details on AdEERS can be found at: http://ctep.cancer.gov/ reporting/ctc.html and http://ctep.cancer.gov/reporting/adeers .html 5. For a detailed description of CTEP guidelines for adverse event reporting: http://ctep.cancer.gov/reporting/newadverse_2006.pdf 6. List of documents for CTCAE v3.0 Document Search, CTCAE Dictionary and Index, and CTCAE Online Instructions and Guidelines: http://ctep.cancer.gov/reporting/ctc_v30.html
This page intentionally left blank
18
Toxicity Monitoring: Why, What, When?
A. Dimitrios Colevas
The purpose of gathering information about outcomes associated with medical interventions is to develop an understanding of overall patient outcome. This means one must weigh the benefits of an intervention, such as extension of life, improvement of quality of life and reduction in symptoms against the toxicity or side effects of the particular intervention. It is by balancing these positive and negative aspects of an intervention that one can derive a therapeutic index (T.I.) of a medical intervention. Why do members of the health-care profession want to know about toxicity? Reasons are largely dependent upon our roles. See Table 18.1. A clinician wants to know what to expect from a drug, so he/she can predict side effects, be on the lookout for them, and educate the patient concerning the likelihood and severity of those effects. For example, the physician prescribing an ACE inhibitor will be aware of the possibility of alteration in renal function or angioedema, and therefore will consider drug modification should these side effects occur. Knowledge of a population-based side effect profile may also contribute to a physician’s choice among options to treat a certain condition. Buproprion’s association with seizures, while rare, might influence the decision concerning the choice of an antidepressant in a patient with a seizure history, for example. For the clinical trialist testing a drug in humans for the first time, knowledge of the overall side effect
profile is of interest, even if the effects are clinically inconsequential. Low-grade laboratory abnormalities may be useful in planning dose escalation studies and may highlight the possibility of more severe toxicities in other patients treated at the same or higher doses, and therefore will be useful in directing proactive surveillance for that particular side effect. An example of how a hospital might use AE data for quality and safety assurance would be if a particular device such as a surgical stapler is associated with an unacceptable risk of infection, the process by which the device is sterilized and used could be explored to find the root cause of the risk, which may not be inherent in the device. While many drugs are used based on benefit established through years of use, in order for a pharmaceutical company to license and market a drug, that drug must undergo a series of preclinical and clinical evaluations which are then submitted to a governmental regulatory agency for consideration for approval. In the United States, this process is conducted through the Food and Drug Administration (FDA) through a formal drug approval application process (1). It is the regulations governing this process that largely determine the toxicity information of interest to pharmaceutical companies and governmental regulatory agencies. This drug approval process is divided into two components. The first is the Investigational New Drug Application (IND). It is through the IND that a sponsor can
151
152
ONCOLOGY CLINICAL TRIALS
TABLE 18.1
Why We are Interested in Toxicity Evaluation. WHO
REASONS
Clinician
Individual therapeutic index evaluation Drug dose modification Choice of drug Highlights what to monitor actively in a patient on treatment To determine AE profile of new agent Safe dosing for subsequent clinical trials Determination of absolute T.I. Determination of relative T.I. vs. another treatment Quality and safety assurance
Clinical trialist/drug developer
Heath-care institutions and organizations Pharmaceutical industry
Governmental regulators General public
request authorization from the FDA to conduct clinical trials in patients using an investigational (not commercially approved) agent. The IND covers authorization for human testing ranging from small first in human studies through large definitive phase III clinical trials. The purpose of an IND is to provide regulatory oversight for investigational testing in humans and has no direct or implied governance concerning drug licensing and approval. An IND can be held by an individual investigator or an organization such as an academic cancer center, the National Cancer Institute (NCI) or a pharmaceutical company. The second part of the drug approval process in the United States involves a New Drug Application (NDA) to the FDA. The NDA is a formal request for approval to market a drug commercially in the United States. The NDA is typically filed by a pharmaceutical company after promising results from trials conducted under an IND lead a pharmaceutical company to believe that the FDA will grant approval after reviewing the NDA and deciding that the drug is in fact safe and effective. Approval is limited to marketing an agent for the specific indication and patient population requested. For example, in the case of Cetuximab, the agent is presently approved by the FDA for use in patients with head and neck cancer or colorectal cancer. Details concerning the agreement between the FDA and a drug manufacturer concerning approved use of that agent and background for that agreement are summarized in the drug’s package insert (2). Because the package insert is a regulatory document, it does not address all the known potentially beneficial uses of a drug, nor all of the known
Similar issues to clinical trialist Necessity for licensing Postmarketing surveillance Is the drug “safe and effective”? Assurance that drug is safe and effective Education
side effects. However, once a drug is approved for marketing for one indication in the United States, FDA regulations permit physicians to prescribe approved drugs for uses other than those cited in the package insert. This off-label use is determined by physician judgment and in the past has been largely unfettered. Recently, with extremely expensive drugs, off-label use has effectively come under the control of insurance companies whose authorization for coverage is the determining factor for patient access to these drugs. Understanding the IND and NDA processes facilitates understanding the regulatory requirements of AE collection for clinical trials conducted under an IND. The rules governing this process are often misrepresented. Every clinical trialist in the United States should familiarize him/herself with the relevant federal code governing INDs which can be found at: http://ecfr.gpoaccess.gov/cgi/t/text/text-idx?c = ecfr& sid = 989651683bfd9ebd47fbb7e1ccc167eb&rgn = div 5&view = text&node = 21:5.0.1.1.3&idno = 21#21: 5.0.1.1.3.2.1.1. CFR title 12, part 312, sections 32 and 33, which deal with requirements for IND safety reports and annual reports, are particularly relevant.
DEFINITIONS A common definition of toxicity is “the degree to which something can be considered poisonous or harmful” (3). Within this definition is the implication of causality linking a particular intervention (drug,
18 TOXICITY MONITORING: WHY, WHAT, WHEN?
surgery, radiation, etc.) with an unfavorable outcome. Generations of clinical investigators have found that toxicity, while used in the vernacular by clinicians, is not an ideal descriptor when studying new treatments. Toxicity carries the presumption of causality and therefore blame, which can lead to underreporting. Also, many events associated with medical interventions which are of interest to clinicians and clinical investigators may not be toxic per se. For example, a particular lab value outside of a reported normal range may be of interest to a clinical investigator, drug company, or regulator, but may be completely harmless to the patient. As a result of this understanding, international efforts to precisely define terms relating to the positive and negative outcomes of medical interventions have led to the use of AE rather than toxicity or side effect to describe negative outcomes associated with a therapeutic intervention. The NCI’s Cancer Therapy Evaluation Program (CTEP) has recently acknowledged this issue by changing the title of their AE dictionary from the Common Toxicity Criteria (CTC) to the Common Terminology Criteria for Adverse Events (CTCAE) (4).
This chapter will focus on clinical trials relating specifically to new drug development in oncology. A Background on Terminology Development In 1990, representatives of pharmaceutical manufacturers and governmental regulatory agencies from Europe, Japan, and the United States met in Brussels to plan an international conference on the harmonization of regulatory requirements for the approval of new drugs. The International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) was a direct outgrowth of a prior European Community effort in the preceding decade to respond to drug companies’ requests for a more consistent regulatory process so as to avoid unnecessary duplication of efforts to satisfy divergent regulatory processes. One of the initial efforts of the ICH was to formalize definitions relevant to pharmaceutical development. These definitions are now widely accepted and used by virtually all national regulatory agencies, including the FDA (5). The ICH definitions for adverse event (AE), adverse drug reaction (ADR), and other relevant terms are listed in Table 18.2.
TABLE 18.2
ICH Definitions for Adverse Events (8). TERM
DEFINITION
Adverse Event (or Adverse Experience)
Any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have to have a causal relationship with this treatment In pre-approval clinical experience: All noxious and unintended responses to a medicinal product related to any dose should be considered ADRs In marketed medicinal products: A response to a drug which is noxious and unintended and which occurs at doses normally used in man for prophylaxis, diagnosis, or therapy of disease or for modification of physiological function An adverse reaction, the nature or severity of which is not consistent with the applicable product information An AE that at any dose: · results in death, · is life-threatening, · requires inpatient hospitalization or prolongation of existing hospitalization, · results in persistent or significant disability/incapacity, or · is a congenital anomaly/birth defect
Adverse Drug Reaction
Unexpected Adverse Drug Reaction
Serious Adverse Event (SAE) or Adverse Drug Reaction
153
154
ONCOLOGY CLINICAL TRIALS
The ICH carefully crafted the AE definition to be inclusive of occurrences without respect to causality or temporality. They also intentionally chose the term untoward, leaving open the possibility of situation specific interpretation of an event. On the one hand, this vagueness can be very useful when considering unusual circumstances such as an event like suicide attempt after a patient has heard about a possible medical intervention but prior to administration of a medicine. On the other hand, the vagueness of the AE definition leaves one wanting another term for the more commonly held definition of toxicity. For this, the ICH chose the term ADR. In the case of ADR, a causal and temporal relationship is not merely implied—it is explicit. Furthermore, it is not merely an event of questionable clinical relevance. Rather, it is noxious (toxic) and unintended (side effect). In general, ADRs can be thought of as the subset of AEs related to a drug which have been analyzed by clinical investigators, drug companies, and governmental regulators and determined to be: (1) caused by the drug, (2) noxious, and (3) unintended. Early in the clinical development of investigational agents there can be a great deal of subjectivity concerning which AEs qualify to be designated as ADRs. Often the designation of an AE as an ADR only comes via statistical analysis of a large number of patients treated with an agent. Occasionally, such an ADR is only discovered after a drug has been approved and used by many thousands of patients. A well-known recent example of this is the increase in cardiovascular events associated with the use of some nonsteroidal anti-inflammatory drugs (NSAID) (6). Conversely, when a drug or drug class is thought to be associated with an increased risk of a particular ADR, it is often necessary to collect disease and patient specific characteristics across many clinical trials and in the postmarketing setting in order to determine whether the relative risk of a particular ADR is disease or even disease stage specific, or exists at all. The emerging data concerning the risk of pulmonary hemorrhage events associated with bevacizumab illustrates this point well (7). In both of the preceding examples it may be impossible to know for any specific patient whether an AE represents an ADR or not. For example, if an elderly patient with a history of coronary artery disease experiences a myocardial infarction while being treated with an NSAID or a patient with a lung squamous cell carcinoma experiences a pulmonary hemorrhage while being treated with bevacizumab, it is not possible in either of these individual events to determine if they were drug related or not. For previously approved drugs, an ADR only applies when standard accepted doses are administered, specifically excluding overdose associated AEs
and AEs experienced in a clinical setting outside of the situation in which the drug is normally used. An ADR in the setting of an investigational agent applies to all noxious and unintended responses without regard to dose because in that setting there is no standard accepted dose. It is important to be able to differentiate between AEs and ADRs in this setting because it is the ADRs which ultimately are used to define acceptable doses for an investigational agent. AE Terminology Dictionaries AE reporting lexicons for use in clinical trials have evolved along two major pathways, each with its own strengths and weaknesses. One approach has been the development of the Medical Dictionary for Regulatory Activities (MedDRA). This is a medically valid terminology developed initially by the ICH primarily for use for regulatory purposes. European and Japanese regulatory agencies mandate that MedDRA is to be used for the reporting of AEs for safety reporting. The FDA does not require MedDRA for safety reporting but it has become the standard for AE reporting in the United States by the pharmaceutical industry. MedDRA is owned by a federation of pharmaceutical companies and is maintained and distributed by the Northrup Grumman Corporation on a contractual basis for the pharmaceutical industry. MedDRA is a huge dictionary of medical terminology containing a subset of AE terms grouped by organ system. Within MedDRA there is also a hierarchy of terms including high level group terms, high level terms, preferred terms, and lowest level terms. An example of this hierarchy in the order cited would be: bacterial infections disorders, staphylococcal infections, staphylococcal meningitis, and staphylococcus epidermidis meningitis. One advantage of MedDRA is the extremely fine granularity of terminology and large number of terms to choose from. Another advantage is the rigorous curation of MedDRA so that there is no redundancy or fusion of two concepts into one term (i.e., febrile neutropenia = fever + neutropenia, two distinct concepts). Because MedDRA is updated twice annually, subscriber change requests to add new AEs can be incorporated very early on during the life cycle of a new drug with a unique AE profile. Disadvantages of MedDRA include that it is not an open source dictionary. Only subscribers to MedDRA have access to it. This means that most academic clinical trialists cannot practically conduct clinical research using MedDRA as the AE dictionary for investigator initiated trials because most cancer centers do
155
18 TOXICITY MONITORING: WHY, WHAT, WHEN?
not license it. Furthermore, because of the complexity of MedDRA, many clinical trial data managers are not sufficiently educated in its use to allow for primary AE coding in MedDRA terminology by the clinical trial personnel closest to the patient. Finally and most importantly, there is no system within MedDRA for severity assessment of AEs, and therefore MedDRA itself cannot be used as a tool to capture what is one of the most important attributes of an AE—its severity. Another approach has been the NCI’s development of a relatively limited oncology specific AE dictionary, the CTCAE (formerly the CTC) (9). The present version of the CTCAE represents a synthesis of multiple prior oncology clinical trials AE reporting terminology systems, including several systems used primarily by radiation oncologists and medical oncologists. It also includes terminology essential to surgical and pediatric oncology clinical researchers (10, 11, 12, 13). The CTCAE is oncology specific, developed to a great extent by and for medical oncologist clinical researchers. It is the universally accepted AE terminology used for reporting clinical trials outcomes in peer reviewed literature and at medical meetings worldwide. The FDA accepts regulatory submissions using the CTCAE system. The present version (CTCAE v. 3) is a lexicon of approximately 370 terms organized into 28 categories which are largely defined by organ systems. Each of
these terms is subdivided into five grades of severity. In most cases, the methodology for grade assignment is specific to the particular AE. See Table 18.3. The severity of a grade does not necessarily imply seriousness. For example, in the case of neutrophils, the grade determination is based only according to the neutrophil count, not any clinically relevant sequelae of that neutrophil count. The advantages of the CTCAE system are largely benefits to the clinical trialist, practicing physician, and data manager. The tool is widely distributed, easily carried in printed format, and uses the language of oncology. It is also universally available as a Web-based open source document (14). The major disadvantage of the CTCAE is that it is not accepted outside of the United States as an AE terminology for regulatory purposes and does not offer the subtle shades of AE distinction available in MedDRA. Furthermore, the updating cycle of the CTCAE is on the order of years, and is not standardized. Other disadvantages include the lack of rigorous curation of AE grading so that there is no consistent severity and terminology use of grades across categories of AEs within the CTCAE. Most clinical trialists need not choose between MedDRA and the CTCAE when conducting a clinical trial. Because the former is available only to subscribers, it is largely invisible to the clinical trialist as
TABLE 18.3
CTCAE Grading Scale Examples. GRADE
AE EXAMPLE NEUTROPHILS/ GENERIC
0 1
No AE or within normal limits Mild
2
Moderate
3
Severe and undesirable
4
Life-threatening or disabling Death related to AE
5
NAUSEA
GRANULOCYTES
Loss of appetite without alteration in eating habits
Oral intake decreased without significant weight loss, dehydration or malnutrition; IV fluids indicated <24 hrs Inadequate oral caloric or fluid intake; IV fluids, tube feedings, or TPN indicated ≥24 hrs Life-threatening consequences
<1500—1000/mm3 <1.5—1.0 × 109/L
<1000—500/mm3 <1.0—0.5 × 109/L <500/mm3 <0.5 × 109/L
156
ONCOLOGY CLINICAL TRIALS
an AE terminology lexicon. Instead, for virtually all clinical trials, investigators capture AEs either using verbatim descriptions or, particularly with investigator initiated and government sponsored trials, use the CTCAE. With either approach, the verbatim AE description or CTCAE term is translated by sponsoring clinical research organization (CRO) personnel or drug company safety monitoring department into MedDRA terminology. In the case of the CTCAE to MedDRA translation, the conversion is facilitated by a frequently updated publicly available mapping between the two terminology systems available on the Cancer Therapy Evaluation Program Web site (14). There are collaborative efforts underway between the MedDRA maintenance organization and NCI to improve one-toone mapping between the two terminologies and to facilitate the use of a CTCAE-type grading scheme when using MedDRA as the AE dictionary. The translation of verbatim AE descriptions from a clinical site into MedDRA by drug company or CRO personnel distant from the patient for regulatory purposes is more problematic because of questions concerning the fidelity of this translation and the possible introduction of bias. It is worth noting that generally these translations are not approved by either the verbatim AE submitter or responsible clinical investigator. On the other hand, it is also well recognized that many of the data managers at clinical trials sites are suboptimally trained to document and code AEs using any terminology system. This inadequate training is probably a contributor to the lack of agreement between what regulatory sponsors have captured in their databases and what is reported in the medical literature concerning AEs from the same clinical trial (15). AE Collection Methods for Clinical Trials The most common method of collecting AEs is for a member of the clinical trial team (usually a physician or nurse or their assignees) to interview a patient, and then to capture AEs based upon that interview, the physical exam, and additional laboratory and radiological tests. Sometimes the AE collection process is in sequence with the standard medical care, other times it is done in parallel. Typically, during the normal clinic visit, AE data will be elicited from the patient and then those data will be incorporated into the medical record. Subsequent to that, the AE data will be extracted from the medical record by clinical trial personnel at the clinical site and entered onto a patient case report form (paper or electronic), which is subsequently further coded into a format acceptable to regulatory authorities. See the discussion above on verbatim versus CTCAE translation into MedDRA. At every step of this
process there is the opportunity for corruption of data based upon interpretation further and further away from the truth of the patient experience. Unfortunately, this system, because it is passive and not directive, can lead to a variable collection rate of AEs. It is also subject to lack of standardization when converting from the medical record vernacular into AE terminology such as the CTCAE both in terms of AE description and severity. For example, a care provider might dictate into a clinical note that a patient had severe nausea. One data manager might interpret that as a grade 3 nausea because of the descriptor severe, yet another might grade this as a grade 1 because there was no documentation of an alteration in oral intake, a critical part of nausea grading in the CTCAE. Furthermore, depending upon the circumstances, it is possible that the term nausea itself might be incorrect, in that the patient may have been describing queasiness and wooziness which were a result of dehydration from diarrhea and therefore not nausea at all. An alternative strategy is to capture subjective AE data directly from the patient using guided survey methodology and then translate that data into acceptable AE terminology before transmission to the clinical record, institutional research database, and regulatory sponsor database. While this process could be optimized by direct electronic data capture from the patient, this is not necessary, nor practical, for few institutions are set up to gather data in this way. Another practical approach now in use is the Radiation Therapy Oncology Group (RTOG) AE grading tool which can be filled out to a large extent by the patient and reviewed by the clinical researcher at follow-up visits. This allows for capture of many subjective AEs along with the grading of the AE directly from the patient. An example of one such RTOG AE grading tool can be seen at: http://www.rtog.org/members/protocols /0522/AEGradingTool.html (16). Investigators at Sloan Kettering Cancer Center have piloted the used of Web-based patient AE reporting system for patients being treated with radiation or chemotherapy (17–19). These investigators have found that while patient use of such systems is variable (high for females with gynecological malignancies, low for males with lung cancer), it allowed for unique AE evaluation and management strategies. For example, a Web- based patient AE reporting system which alerted physicians in real time via email about grade 3 and 4 AEs resulted in a substantial number of clinical actions in response to these notifications. A randomized controlled trial comparing clinical outcomes of patients with access to Webbased AE reporting to those being managed routinely is underway.
18 TOXICITY MONITORING: WHY, WHAT, WHEN?
The gold standard for AE collection methodology is not well established, but several principles are worth discussion. The more subjective the AE, the more important it is to get as close as possible to the ideal patient reported outcome. For a more analytical AE, such as a serum chemistry value, it probably makes very little difference when and by whom that AE is captured. AEs can be divided into four domains: analytic, objective, clinician evaluated, and patient reported (20). As one moves from the former to the latter, there is an increasing risk of loss of fidelity in translation in the steps illustrated in Figure 18.1(A). Therefore for the less objective AEs, capturing the information in its final form as close as possible to the actual patient experience is desirable. See Figure 18.1(B). Patient reported outcome (PRO) methodology is an evolving science. It is extremely labor intensive, and it is not clear in what phases of clinical trials PROs are most useful (21).
AE Collection Methodology for Patients on a Clinical Trial Phase I Trials For first in human or novel phase I combination studies, there should be a prospective plan to actively interrogate patients and laboratory results for suspected
subjective and analytic AEs, based on preclinical data and mechanism of action of an agent. See Table 18.4. For virtually every investigational agent tested in humans, there will be preclinical safety data which have defined not only a toxic dose in at least one species of mammal, but also the organ system or specific AE which defined that toxicity. These data are summarized in the investigators brochure for an investigational agent. For example, if an agent caused splenic atrophy in animal models, following a trial subject’s white blood cell count and a differential would be essential. Similarly, animal models with histological evidence of gastrointestinal mucosal damage should direct the clinical investigator to craft an AE monitoring section which accentuates anorexia, dyspepsia, vomiting, diarrhea abdominal cramping, and other manifestations of gastrointestinal (GI) toxicity. Another area where this has become relevant is in the case of agents with preclinical evidence of possible cardiac conduction system effects. In this case, a well-defined plan for interval EKG or Holter monitoring, for example, might be important. An agent’s mechanism of action can also direct an investigator to focus on certain AEs for active query. Agents known to interfere with the mitotic apparatus are likely to have both hematological and GI toxicity, and inhibitors of the epidermal growth factor receptor
Patient experiences AE ( absolute truth)
Fidelity: Are these the same?
Regulatory database AE truth
Clinic visit
Clinician interprets patient report of AE
Sponsor coding of AE (MedDRA) Clinician charts AE Data download to sponsor
Site database AE truth
DM extracts verbatim AE from medical record, codes and grades AE into local database language (CTCAE)
DM = Data Manager
FIGURE 18.1(A) Standard AE collection methodology.
157
Medical record report of AE
158
ONCOLOGY CLINICAL TRIALS
Patient reports AE directly into guided AE collection module
Patient experiences AE (absolute truth)
Analytic (lab value) AE Clinical trialist reviews AES
Regulatory database AE truth
AE electronically translated into acceptable terminology and grade
=
Fidelity: Are these the same?
Site database AE truth
=
Medical record report of AE
FIGURE 18.1(B) Alternative model for AE data collection based on PROs.
are likely to have dermatological toxicity. Therefore the clinical trial using agents of these classes should include in the AE section guidelines for active ascertainment of appropriate organ specific AEs: hematological, GI , and dermatological. Often there are AEs associated with a class of drug for which the specific mechanism of action is not understood. For example, the hand-foot dermatological syndromes associated with sorafenib and sunitinib would suggest that for a new drug that is thought to inhibit a similar spectrum of kinases hand-foot syndrome, AEs should be actively collected. The list of AEs for which there is a prospective collection plan should be set up in such a way that within a given early phase clinical trial, that list can be readily amended. For example, in a first in human drug study if a high-grade unanticipated AE is seen in the first patient cohort, that AE should be added to the active query list for all future patients, especially if there is no plausible alternative explanation other than an ADR. In order to make active ascertainment of AEs practical, the investigator should be mindful that a large list
TABLE 18.4
High Probability AEs Necessitating Active Ascertainment in Early Clinical Trials. 1. AEs based on preclinical toxicology models 2. AEs based on mechanism of agent action 3. AEs from clinical experience with other drugs in class 4. AEs from preliminary clinical experience with the particular IND agent
is not practical. In most cases, a list of 20 AEs or less could be listed with grades on a single page as a protocol appendix for use as the active AE ascertainment tool for direct patient reported outcome collection, for example. Phase II and Beyond By the time most drugs have undergone phase I testing, there is a considerable amount of data concerning common high grade AEs. Rigorous AE collection and assessment in subsequent trials continues to be essential in order to better define the frequency and severity spectrum of these known AEs. Just as important is the ascertainment of less frequent and/or less severe AEs, which are largely undiscovered in phase I testing. It is because of the latter issue that in addition to having a prospective plan for ascertainment of AEs identified in early clinical trials, there must be a plan in place to provide a wide net for collection of rare and low-grade events. It is not practical to develop a specific AE query driven prospective plan, for there are too many different possible AEs. In this case, a more useful approach is to ensure that there is rapid broad dissemination of information concerning unanticipated high-grade AEs to all clinical investigators involved with a particular agent, as well as early attribution assessment. A good recent example of this issue was the identification of interstitial lung disease (ILD) in patients treated with erlotinib. In this case, the rapid dissemination of early SAE reports of ILD led to a prospective active ascertainment of ILD AEs in subsequent trials, including randomized controlled trials in both non-small-cell lung cancer (NSCLC) and pancreatic cancer. The conclusion that the incidence of erlotinib associated ILD was low, and in NSCLC in particular
18 TOXICITY MONITORING: WHY, WHAT, WHEN?
confounded by alternative attributions allowed for a more balanced interpretation of this ADR in subsequent clinical trials and clinical use. Practical Considerations and Suggestions for the Clinical Investigator Default recommendations for AE collection and reporting: · Collect all serious and severe AEs without regard to attribution · Collect only intervention-associated low-grade AEs · Specify in advance the subset of trial specific AEs of high priority · Develop a tool to investigate actively high priority AEs, limit to < 20 specific AEs · The high priority AE tool should use patient reported outcomes for nonanalytic AEs · Collect AEs: baseline, every treatment cycle, and baseline after treatment termination · Ensure there is a plan for distinguishing recurrent high-grade AEs from persistent AEs in the data collection plan · Ensure all AE data collected are reported consistently to peer literature and regulatory authorities Many novice clinical trialists are intimidated by the process of clinical trial protocol development. Concerns over which AE data are important to collect can be crippling for the novice, especially because the reports of clinical trials in the medical literature tend to focus on the efficacy evaluation methods rather than AE evaluation methods used. In most cases it is not necessary to start from scratch. One excellent source of guidelines and templates is CTEP’s Web site (22) where one can find both guidelines and templates for protocols for both early and advanced stage clinical trial protocols. Additionally, most pharmaceutical companies are willing to share their agent specific protocols or protocol templates with investigators using their agents, even if those investigators are not conducting the studies under that company’s sponsorship. It is important to remember that companies are interested in keeping abreast of all developments concerning their investigational agents and commercially approved agents, so assisting the independent investigator in development of protocols including the AE collection and reporting sections are in their interest as well. The details of collection of AEs in clinical trials is largely driven by the desired AE output sought, and can in general be lumped into two categories: required SAE output for regulatory purposes, and routine/cumulative AE output for toxicity evaluations.
159
There is very little latitude concerning the schedule of SAE reporting required by regulatory agencies, but depending upon the prior human experience and other prior knowledge of possible AEs (see Table 18-4) with a drug, the investigator can substantially limit which AEs are necessary to report in an expedited manner as an SAE. In general, regulatory agencies only require expedited SAE reports for events which are serious and unexpected. Therefore, an investigator, when crafting a protocol can create a list of expected AEs which do not have to be reported in an expedited manner even if they are considered serious. For example, exclusion of expedited SAE reporting of febrile neutropenia episodes for a known bone marrow suppressive agent or of hospitalization for uncomplicated dehydration for a diarrhea inducing agent can dramatically reduce the administrative reporting burden on the investigator if these AEs are specifically cited in the protocol AE reporting section as being excepted from SAE reporting. This work reduction can be exponential in the case of multi-institutional studies where SAE reports generated at one site not only require reporting to the FDA but also require investigator and institutional review board (IRB) review at all participating sites. For routine cumulative AE collection and evaluation, there are almost no hard rules concerning the collection and reporting of AEs. In general, all serious and severe AEs should be collected, without regard to attribution. Most investigators using the CTCAE as the AE lexicon interpret this as all grade 3 or higher AEs. For lower-grade AEs, the common practice is to only collect agent attributed AEs, or ADRs. An obvious problem with this plan is that it assumes that investigators can readily render attributions for a particular AE, which is often not the case. However, for most lifethreatening oncological situations where there are a plethora of low-grade AEs, collection of all AEs is often impractical and felt to be unnecessary. For more chronic, nonlife threatening illnesses, collection of all low-grade AEs would be more important. The clinical trial design can also determine which AEs are necessary to collect. For a phase I trial using the classical 3 + 3 escalation design or continual reassessment method based on dose limiting AEs, strictly speaking only severe or serious AEs need be collected in order to conduct the trial and report a recommended phase II or maximally tolerated dose. For the accelerated titration design, however, because grade 2 AEs are an inherent part of the dose escalation decision-making process, AE collection must include these low-grade AEs (23). Because most anticancer treatments are either administered episodically or cyclically, the most common practice of AE evaluation interval is to collect AE
160
ONCOLOGY CLINICAL TRIALS
data once each cycle. In general documentation of the presence, severity, and attribution of each AE is adequate, but there are a few specific situations where the duration of a particular AE is important. In early clinical trials where dose escalation is part of the study knowing whether a particular AE is transient or prolonged may be essential. When time to resolution or diminution of an AE is relevant, such as in the case of postoperative intestinal dysmotility or dysphagia following neck radiation, attention to duration of and change in grade of a particular AE may be important. Typically, this information is gathered by several planned interval AE assessments, such as at 1 week, 1 month, 3 months, and 1 year following neck irradiation, for example. There are two instances in which cyclical AE assessment as discussed above is not applicable or is problematic. The first is in the case of chronic administration of a drug, typically an orally administered agent. The rule of thumb I recommend would be for initial intervals of assessment to parallel multiples of the predicted or known pharmacological half-life of the agent. For example, in a first in human study, one could consider initial AE assessments (including analytic laboratory assessments) on the order of one, three, and ten half-lives of the investigational agent, and subsequent intervals to be determined by the AE profile seen in the initial cohort of patients. The second case is when an agent is being administered cyclically and AE assessments are made once per cycle. If an agent is administered once every 3 weeks and a grade 3 AE is reported during every cycle, it cannot be determined if that AE is persistent or recurrent. Many AE reporting systems, including the NCI routine reporting system, cannot distinguish between these two situations. It is impractical to document onset date and resolution date for most AEs, but the occurrence of identical or similar AE reports in adjacent cycles should prompt investigators to consider more detailed collection plans in subsequent patients treated for that specific AE. Once a trial has been completed, all AE data should be included in manuscripts. In an era where more and more data are available, there have been attempts to summarize or reduce the amount of AE data reported in the medical literature. Examples of inadequate AE data representation include summary tables which include only AEs occurring above a specific percentage of treated patients or treatment related AEs. The availability of online supplemental dataset repositories with many publications makes possible the inclusion of high-grade investigational agent-attributed AEs within the body of the manuscript with explicit reference to supplemental complete data tables elsewhere as a way to streamline
manuscript length without omitting full representation of the AE data. Another problematic area is including only worstgrade AE per patient data. Trotti and colleagues have shown how such a data reduction strategy can cover up important cumulative toxicity differences in radiation treatment trials, for example (24). Finally, it is important to ensure that AE datasets held by different entities for the same clinical trial are not inconsistent. At a minimum, the regulatory sponsor’s AE dataset should be compared with the investigator’s and discrepancies resolved prior to publication and final regulatory submission so that any publication of AE data is consistent with those data submitted to regulatory agencies. Also, since serious ADRs are often evaluated, processed, and submitted to regulatory sponsors and agencies by different personnel (e.g., clinical nurse and/or physician) from those who gather and submit routine AE data (e.g., clinical trial managers, institutional clinical trials office staff), it is a valuable quality control check to compare the serious ADR dataset with the routine AE dataset as part of the clinical trial data review and reconciliation process.
References 1. Drug Approval Application Process. (Accessed October 5, 2009, from http://www.fda.gov/Drugs/DevelopmentApprovalProcess/ HowDrugsareDevelopedandApproved/ApprovalApplications/ InvestigationalNewDrugINDApplication/default.htm.) 2. Erbitux Highlights of Prescribing Information. (Accessed October 5, 2009, from http://www.accessdata.fda.gov/drugsatfda _docs/label/2009/125084s167lbl.pdf.) 3. Definition of Toxicity. 2006. (Accessed at http://dictionary .reference.com/browse/toxicity.) 4. Common Terminology Criteria for Adverse Events. 2006. (Accessed at http://ctep.cancer.gov/protocolDevelopment/ electronic_applications/docs/ctcaev3.pdf.) 5. IND Safety. (Accessed October 5, 2009, from http://ecfr.gpoaccess.gov/cgi/t/text/text-idx?c=ecfr&sid=989651683bfd9ebd 47fbb7e1ccc167eb&rgn=div5&view=text&node=21:5.0.1.1.3 &idno=21#21:5.0.1.1.3.2.1.7.) 6. Fosbøl EGG, Jacobsen S, Folke F, et al. Risk of Myocardial Infarction and Death Associated with the Use of Nonsteroidal Anti-Inflammatory Drugs (NSAIDs) among Healthy Individuals: A Nationwide Cohort Study. 7. Clinical Pharmacology and Therapeutics; 2008. 8. Cohen MH, Gootenberg J, Keegan P, Pazdur R. FDA drug approval summary: bevacizumab (Avastin) plus Carboplatin and Paclitaxel as first-line treatment of advanced/metastatic recurrent nonsquamous non-small-cell lung cancer. Oncologist. 2007;12:713–718. 9. Clinical Safety Data Management: Definitions and Standards for Expedited Reporting; 1994. (Accessed at http://www.ich .org/LOB/media/MEDIA436.pdf.) 10. Trotti A, Colevas AD, Setser A, et al. CTCAE v3.0: development of a comprehensive grading system for the adverse effects of cancer treatment. Seminars in Radiation Oncology. 2003;13:176–181. 11. Cox JD, Stetz J, Pajak TF. Toxicity criteria of the Radiation Therapy Oncology Group (RTOG) and the European Organization for Research and Treatment of Cancer (EORTC). International
18 TOXICITY MONITORING: WHY, WHAT, WHEN?
12. 13.
14.
15.
16. 17. 18.
Journal of Radiation Oncology, Biology, Physics 1995;31: 1341–6. Miller AB, Hoogstraten B, Staquet M, Winkler A. Reporting results of cancer treatment. Cancer. 1981;47:207–214. Denis F, Garaud P, Bardet E, et al. Late toxicity results of the GORTEC 94–01 randomized trial comparing radiotherapy with concomitant radiochemotherapy for advanced-stage oropharynx carcinoma: comparison of LENT/SOMA, RTOG/EORTC, and NCI-CTC scoring systems. Int J Radiation Oncol Biol Physics. 2003;55:93–98. Rubin P, Constine LS, Fajardo LF, Phillips TL, Wasserman TH. RTOG late effects working group. Overview. Late Effects of Normal Tissues (LENT) scoring system. Int J Radiation Oncol Biol Physics. 1995;31:1041–1042. Welcome to the Common Terminology Criteria for Adverse Events (CTCAE) v3.0 Online Frequently Asked Questions (FAQ). (Accessed at https://webapps.ctep.nci.nih.gov/webobjs/ ctc/webhelpfaq/Welcome_to_CTCAE.htm.) Scharf O, Colevas AD. Adverse event reporting in publications compared with sponsor database for cancer clinical trials [see comment]. J Clin Oncol. 2006;24:3933–3938. RTOG 0522 H&N Adverse Event Grading Tool. (Accessed at http://www.rtog.org/members/protocols/0522/AEGradingTool.html.) Basch E, Artz D, Dulko D, et al. Patient online self-reporting
19. 20.
21. 22. 23. 24. 25.
161
of toxicity symptoms during chemotherapy. J Clin Oncol. 2005;23:3552–3561. Basch E, Iasonos A, Barz A, et al. Long-term toxicity monitoring via electronic patient-reported outcomes in patients receiving chemotherapy [see comment]. J Clin Oncol. 2007;25:5374–5380. Basch E, Artz D, Iasonos A, et al. Evaluation of an online platform for cancer patient self-reporting of chemotherapy toxicities. Journal of the American Medical Informatics Association. 2007;14:264–268. Trotti A, Colevas AD, Setser A, Basch E. Patient-reported outcomes and the evolution of adverse event reporting in oncology. J Clin Oncolo. 2007;25:5121–5127. Wagner LI, Wenzel L, Shaw E, Cella D. Patient-reported outcomes in phase II cancer clinical trials: lessons learned and future directions. J Clin Oncol. 2007;25:5058–5062. Protocol Development. (Accessed at http://ctep.cancer.gov/ protocolDevelopment/default.htm.) Simon R, Freidlin B, Rubinstein L, Arbuck SG, Collins J, Christian MC. Accelerated titration designs for phase I clinical trials in oncology. J Natl Cancer Inst. 1997;89:1138–1147. Trotti A, Pajak TF, Gwede CK, et al. TAME: development of a new method for summarizing adverse events of cancer treatment by the Radiation Therapy Oncology Group [see comment]. Lancet. 2007;8:613–624.
This page intentionally left blank
19
Interim Analysis of Phase III Trials
Edward L. Korn Boris Freidlin
Phase III randomized clinical trials typically involve hundreds, or even thousands, of patients and can take years to complete; see Chapter 11 for a detailed discussion of phase III trials. The ultimate goal of a randomized trial is to provide evidence on the benefit-torisk ratio that is sufficiently compelling to change medical practice. Due to the unique nature of clinical research, the design of randomized trials requires careful consideration to maintain the delicate balance between acquiring as much evidence as possible about treatment efficacy versus releasing results as soon as possible so that the patients in the trial and future patients can be best treated. In other words, the study should not continue longer than necessary to answer the relevant clinical question. Therefore, if it becomes “obvious” what the conclusion of a trial will be before its regularly scheduled end, then it is important to stop the trial and release the results to the public. Besides benefiting current and future patients by giving them access to the trial results, this enables the patients who would have participated in the trial to participate in other trials. However, an undisciplined examination of accruing outcome data can lead to the release of results that do not have acceptable statistical validity. For example, if one examines accruing outcome data from a trial and stops the trial and declares the new treatment better than the standard treatment as soon as the p-value <.05, then the chances of declaring the new treatment better when it is actually not (a type 1 error)
can be much higher than 5%, for example, 25% with 20 examinations. On the other hand, if one stops a trial based on a subjective appraisal of accruing data that the new therapy is not looking promising, then truly effective therapies could easily be dismissed. This chapter discusses formal interim analyses of outcome data, which preserve the operating characteristics of a trial (type 1 error, power) while allowing the possibility of stopping early when the trial results become definitive early. In the next sections, we consider, in turn, specifying an interim analysis plan; stopping when one treatment arm is clearly superior to another treatment arm; interim analysis for futility, that is, stopping a trial when it is clear that a new treatment will not be better than a standard treatment; interim analysis for randomized clinical trials that are addressing a noninferiority question; the role of independent data monitoring committees in the interim monitoring process; and the potential costs of interim monitoring. We end with a summary.
SPECIFYING AN INTERIM ANALYSIS PLAN The study protocol should include a formal interim analysis plan that prospectively designates the times when the interim analyses will be performed as well as the decision rules that will be used at each time to decide if the outcome data are extreme enough to stop.
163
164
ONCOLOGY CLINICAL TRIALS
The analysis times are typically considered in terms of the information fraction, which is the proportion of the information available in the data at the interim analysis time compared to the information available at the regularly scheduled end of the trial (100%, by definition). For example, an interim analysis of the response rates of the first 100 patients of a 400-patient trial with a response rate end point would be considered at 25% information. With timeto-event data (survival data), the information is quantified by the number of events (e.g., deaths), so that an analysis after the first 100 deaths in a trial design with a definitive analysis with 200 deaths would be considered at 50% information. In practice, it is frequently more convenient to specify the interim analysis time in terms of a calendar time (e.g., every 6 months at the biannual meetings of the Data Monitoring Committee). Many common interim monitoring designs accommodate calendar timing using the “spending function approach” (1). For a given interim analysis time the prespecified decision rule would typically be specified in terms of a p-value (e.g., stop if p < 0.0005), a Z-value representing a normalized test statistic (e.g., stop if Z > 3.29), or a confidence interval (e.g., stop if the 99% confidence interval for the hazard ratio is completely above 0.75).
STOPPING FOR SUPERIORITY (EFFICACY) Superiority or (efficacy) monitoring refers to monitoring for convincing evidence that one of the study arms is superior to another arm with respect to the relevant clinical outcome. Most randomized trials in oncology are typically designed to establish that a experimental therapy (arm E) is better than the control therapy (arm C) by comparing (a) new therapy (E) versus standard therapy (C), (b) new therapy (E) versus observation/placebo (C), or (c) new plus standard therapy (E) versus standard therapy (C). With rare exception, these comparisons are one-sided in nature because there is only interest in whether E is better than C or not, and no interest in
proving C is better than E. In terms of interim monitoring, one will want strong evidence that E is better than C to stop for superiority, but would not require strong evidence that C is better than E to stop for futility as described in the next section. A formal interim monitoring plan for superiority is designed to ensure that the type I error of the trial is preserved at the nominal level (e.g., a one-sided type I error equal to 0.025) and that the power of the trial is the desired level (e.g., 90%) with the given sample size. Even with these constraints, there are a large number of analysis plans possible (2, 3). Although there are different methodologies for superiority monitoring, the methods are all completely characterized by their monitoring boundaries. Two popular monitoring plans are shown in Table 19-1 for interim analyses at 25%, 50%, and 75% of the information. The Haybittle-Peto monitoring boundary (4) uses the same very small p-value (0.0005) at each interim look, whereas the O’Brien-Fleming boundary (5) has increasing p-value cutoffs. Note that the p-value cutoffs for the final analysis are less than the nominal 0.025 type I error of the trial. That is, if the trial does not stop by crossing the p-value threshold at one of the interim analyses, a p < 0.0246 (Haybittle-Peto) or p < 0.022 (O’Brien-Fleming) is required to declare statistical significance (at the p < 0.025 level) at the end of the trial. A trial without interim monitoring would only require p < 0.025 at the end of the trial. These interim monitoring plans necessitate a negligible increase in the sample size to obtain the same power as required for a trial without interim monitoring: 1% and 2% increases in sample size for the Haybittle-Peto and O’Brien-Fleming designs described in Table 19.1. Other Considerations Although, as mentioned above, there are many possible interim monitoring plans, there are some general considerations involved in choosing a plan. First, the p-values required for stopping should not get smaller
TABLE 19.1
One-Sided p-Values for Two Different Monitoring Plans for Testing the Superiority of a New Treatment over a Standard Treatment (One-Sided Type I Error = 0.025). INFORMATION FRACTION Monitoring Plan Haybittle-Peto O’Brien-Fleming
25% p < 0.0005 p < 0.000007
50% p < 0.0005 p < 0.0015
75% p < 0.0005 p < 0.0092
100% (final) p < 0.0246 p < 0.022
165
19 INTERIM ANALYSIS OF PHASE III TRIALS
with increasing information because it is desirable to have more extreme evidence to stop a trial earlier. Secondly, the p-value at the final analysis should not be too much smaller than the type I error of the trial. Intuitive motivation for this requirement is that it minimizes the possibility that a study that proceeded to the full sample size would come to a different conclusion depending on whether it employed interim monitoring or not. In addition, keeping the final analysis p-value close to the type I error minimizes the sample size required to preserve the power of the trial. Thirdly, to ensure that the study provides credible information on all aspects of the benefit-to-risk ratio, one should not start the interim analyses too early, for example, when there is less than 25–35% information. Finally, once one begins the interim monitoring, there is very little cost in having frequent monitoring from then on (6). For example, the NCI-sponsored Cancer and Leukemia Group B cooperative group formally monitors trials every 6 months after the monitoring begins. In the unusual circumstance that the clinical question being asked is symmetric, (comparing two new drugs), then one can use two-sided symmetric superiority interim monitoring. For example, one might stop at an interim analysis if the two-sided p-value for testing the null hypothesis of no difference is <0.001. This can be compared with an asymmetric boundary involving futility monitoring that can be used for one-sided questions as described in the next section.
A
Three Examples The Radiation Therapy Oncology Group conducted RTOG-9001 to see whether concurrent chemotherapy (versus none) would improve the overall survival of high-risk advanced cervical cancer patients receiving radiation therapy (7). The trial was designed to accrue 400 patients over 4 years, with an additional followup of four years for the definitive analysis when there were expected to be 199 deaths (7). At a scheduled interim analysis 8 months after accrual was complete, the outcome data crossed an interim-monitoring boundary for superiority with p = 0.0027 (and 115 deaths) (K. Winter, personal communication, 2008). The data monitoring committee recommended releasing the data and they were published (7) with an additional follow-up of 4 months; the survival curves are displayed in Figure 19.1(A) (p = 0.004). With an additional 4 years of follow-up, the results were confirmed (8) (Fig. 19.1(B), p < 0.0001, 161 deaths). The interim analyses allowed the public to be informed of these results approximately 4 years earlier than if there had not been interim analyses. For the second example, consider the joint analysis of the National Surgical Adjuvant Breast and Bowel Project trial NSABP B-31 that compared chemotherapy with or without trastuzumab for women with surgically removed HER2 positive breast cancer, with the same two treatment arms of the three-armed trial N9831 conducted by the North Central Cancer Treatment Group (9). The primary outcome was disease-free survival, with total accrual of 4,900 and the definitive
100 90 80
B
100
60
80
50
Surviving (%)
Survival (%)
70
40 30 20
60 40 20
10 0
0
12
24
36
48
60
0
2
4
6
8
Time (years)
Months
FIGURE 19.1 Overall survival of advanced cervical cancer patients treated with radiotherapy and chemotherapy (top curves) or radiotherapy alone (bottom curves) on RTOG-9001. (A) Curves 4 months after data released. (B) Curves with an additional 4 years of follow-up. (Adapted from Figure 1 of Morris et al., 1999, and Figure 1 of Eifel et al., 2004.)
166
ONCOLOGY CLINICAL TRIALS
100 Disease-free survival (%)
analysis scheduled when 710 events had occurred (9). A scheduled interim analysis stopped the trial for superiority with p < .0.0001 (394 events in 3,676 patients); the disease-free survival curves are displayed in Figure 19.2(A) (9). An updated analysis (10) 2 years later confirmed the results (Fig. 19.2(B), p < 0.0001, 619 events). The interim analyses allowed the public to be informed of these results earlier and allowed for fewer patients to be treated on the trials than if there had not been interim analyses.
80 60 40 20 0 0
Disease-free survival (%)
A
20
30 40 Time (months)
50
60
70
100
FIGURE 19.3
90
Disease-free survival of hepatocellular cancer patients treated with adjuvant I131 (dashed line) or not (dotted line). (Adapted from Figure 2 of Lau et al., 1999.)
80
70
60
50
0
0
1 2 3 4 Years after randomization
5
B 100 Alive and disease-free (%)
10
80 60 40 20 0 0
1
2
3 4 Follow-up (yrs)
5
6
7
The third example presents a problematic application of interim monitoring for superiority. It was a trial of a single postsurgical dose of intra-arterial lipiodol-131 for resectable hepatocellular cancer (11). The primary end point was disease-free survival, and the trial was designed with a total sample size of 120 (11). The trial had a single prespecified interim analysis after 30 patients had been evaluated, using a so-called Pocock boundary. The Pocock boundary, which is not advocated by Pocock (12), uses the same p-value cutoff for interim monitoring as for the final analysis. In the present example with one interim analysis, the p-value cutoff is 0.029 to ensure a 0.05 type I error. Pocock boundaries are not recommended because the p-value cutoffs are not small enough for early looks and the final analysis p-value cutoff is too far from the nominal level. In the trial under question, the p-value after 30 patients had been evaluated was p = 0.01 and the trial was stopped and the results reported (13). A later report and an update (11, 14) included in the analysis another 13 patients who were randomized before the trial was stopped; Figure 19.3 displays the survival curves (n = 43, p = 0.037). With such a small number of patients and modestly significant p-value for early stopping, one could argue that it would have been better not to stop this trial when it had been stopped (12).
FIGURE 19.2 Disease-free survival of HER2-positive breast cancer patients treated with trastuzumab (top curves) or not (bottom curves) on NSABP B-31/N9831. (A) Curves when trial was stopped. (B) Curves with an additional 2 years of follow-up. (Adapted from Figure 2 of Romond et al., 2005, and slide 12 of Perez et al, 2007.)
INTERIM ANALYSIS FOR LACK OF BENEFIT (FUTILITY) For one-sided questions (e.g., new plus standard therapy [arm E] vs. standard therapy alone [arm C]), we described in the previous section how one could use
167
19 INTERIM ANALYSIS OF PHASE III TRIALS
interim monitoring to stop a trial early based on evidence that E was doing very much better than C. Suppose that part way through a trial, C was doing very much better than E? Clearly one would want to consider stopping the trial early in this situation. However, there is generally no need to prove definitively that E is more harmful than C for there to be no interest in E. The interim monitoring guidelines should reflect this asymmetry in the underlying therapeutic question, that is, less stringent evidence of lack of benefit is often sufficient for early stopping and concluding that E is no better than C than would be required to show sufficient benefit for stopping early for superiority of E. Monitoring of this latter type is often referred to as futility monitoring. A simple type of futility monitoring is the following: After 50% information
has been accrued in the trial, if the results are going in the wrong direction (i.e., C is doing better than E), then stop the trial for futility (15, 16). Note that one is not requiring C to be doing significantly better than E here, just better by any amount. A slightly more complex but popular methodology (17) tests at various information fractions whether the data are consistent with the alternative hypothesis with some stringent type I error level (e.g., 0.005), and, if not, the trial is stopped for futility. Another methodology (18) calculates at various information fractions the conditional power (the probability that the trial will be positive if it went to its regularly scheduled end), and, if the conditional power is too low, the trial is stopped for futility. As with superiority monitoring, although there are different methodologies for
TABLE 19.2
Suggested Aggressiveness of Stopping Guidelines for Lack of Benefit Based on the Type of Control Arm and Toxicity Profile of the New Therapy (More Aggressive Means Easier to Stop for Lack of Benefit). TYPE
OF
CONTROL ARM
TOXICITY
PROFILE
OF THE NEW THERAPY
EARLY
LACK OF
BENEFIT BOUNDARY
(<50%
OF
INFORMATION)
LATE
LACK OF
BENEFIT BOUNDARY
(≥50%
OF
INFORMATION)
Placebo or observation (a)
Low-toxicity
Moderately aggressive early stopping For example: Stop for lack of benefit if the alternative hypothesis can be rejected at 0.001 level (b)
Moderately aggressive late stopping For example: Stop for lack of benefit if the alternative hypothesis can be rejected at 0.01 level
Placebo or observation (a)
High-toxicity
Aggressive early stopping For example: Stop for lack of benefit if the alternative hypothesis can be rejected at 0.005 level
Aggressive late stopping For example: Stop for lack of benefit if the alternative hypothesis can be rejected at 0.02 level
Active therapy with proven benefit
Less toxic than control
Moderately aggressive early stopping For example: Stop for lack of benefit if the alternative hypothesis can be rejected at 0.001 level
Conservative late stopping For example: Continue unless the hazard ratio of the new over the standard treatment is greater than 1 (i.e., experimental arm appears worse)
(a) Similar approach is recommended for add-on design, i.e., control arm = standard therapy, and experimental arm = standard + new therapy. (b) All levels are one-sided in the table.
168
ONCOLOGY CLINICAL TRIALS
futility monitoring, the methods are all completely characterized by their futility monitoring boundaries. A major consideration in futility monitoring is how aggressive it should be in stopping trials that appear to be futile. We have discussed this in detail elsewhere (19) and summarize our recommendations here (Table 19.2): In studies with no active therapy in the control arm (placebo or observation) the aggressiveness of the futility monitoring should depend on the degree of morbidity of the experimental arm: the more morbidity, the more aggressive the monitoring (rows 1 and 2 of Table 19-2). The stopping boundaries correspond approximately to observing a negative trend during the first half of the study (with < 50% of information) and observing no benefit or a even a small positive trend in the second half. In studies that compare a new less-toxic therapy to an established active control, futility monitoring should take into consideration that a marginal improvement over the active control may still result in a valuable contribution to the treatment of the disease. In this setting, a moderately aggressive stopping boundary should still be used in the first half of the trial to guard against unnecessary exposure of trial participants to suboptimal medical care (row 3 of Table 19.2). However, in the second half of the trial, stopping for futility if a small improvement is observed may be counterproductive (especially in disease with few effective therapies like pancreatic cancer). Instead, a conservative boundary that allows the study to continue unless the new therapy appears worse is recommended. In fact, for new agents with favorable toxicity profiles that are expected to provide only a marginal improvement over the standard therapy, so-called hybrid designs which allow testing for noninferiority as well as superiority of the new therapy may be more appropriate (20).
Some trials designed to assess therapies intended to improve QOL-related outcomes include mortality only as a secondary end point (or not at all). Even if one does not expect these therapies to make the underlying disease worse, these trials should incorporate formal stopping guidelines for detriment in survival (or any other efficacy outcome that captures relevant clinical events, e.g., disease-free survival) (25). Particulars of these guidelines will be determined by the clinical setting, but they should be present. For example, in a trial of epoetin alpha in non-small-cell lung cancer (26), the primary end point was quality of life, but the trial was stopped early when a non-prespecified interim analysis showed the epoetin arm had worse survival than the placebo arm (hazard ratio = 1.84, p = 0.04). Three Examples The Cancer and Leukemia Group B conducted CALGB-89803 to see whether the addition of irinotecan to fluorouracil plus leucovorin would improve the overall survival of stage III colon cancer patients (27). The final analysis was to be based on accruing a sample size of 1,260 with additional follow-up to observe an expected 356 deaths to have 82% power to detect a hazard ratio of 0.77 (one-sided type I error = 0.05) (27). A prespecified futility boundary called for stopping the trial if the two-sided 99% lower confidence interval for the hazard ratio was greater than 0.77 at one of the formal interim analysis times. The futility boundary was crossed 26 months after accrual was completed (with 67% of the total 356 expected deaths observed) and the results were released (28). Further follow-up (27) confirmed the lack of benefit of the irinotecan (Fig. 19.4, one-sided p = 0.74, 352 deaths).
Other Considerations
1.0 Proportion surviving
Common implementations of two popular approaches to constructing a futility boundary (21, 22) have the property that at later interim analyses the boundary will suggest stopping for futility of E even though E is doing moderately better than C. We recommend that, in most cases, futility boundaries be chosen to avoid this from happening (23). The inclusion of futility monitoring slightly lowers the type I error of the trial because there is less chance of declaring E better than C than when there is no futility monitoring. In theory, this could be used to slightly lower the required sample size of the trial. However, we do not recommend doing this because a trial may not be stopped after crossing a futility boundary for a variety of reasons (24) thus inflating the overall type I error above the design level.
0.8 0.6 0.4 0.2
0
2 4 Time from study entry (years)
6
FIGURE 19.4 Overall survival of colon cancer patients treated with irinotecan (dotted line) or not (dashed line) in CALGB89803. (Adapted from Figure 1 of Saltz et al., 2007.)
169
19 INTERIM ANALYSIS OF PHASE III TRIALS
Proportion progression free
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
100 80 Percentage
The futility monitoring allowed these results to be made public years earlier than if there had been no monitoring. For the second example, consider the Gynecologic Oncology Group trial GOG-0165, which was designed to assess whether protracted venous infusion of flourouracil was superior to the standard treatment of weekly cisplatin in advanced cervix cancer patients who were receiving radiation (29). The primary end point was progression-free survival. The final analysis was to be based on a sample size of 416 patients with 26 months of follow-up to observe 150 events, with the trial designed to have 80% power to detect a hazard ratio of 0.67 (one-sided type I error = 5%) (29). A prespecified futility monitoring rule was to stop the trial for futility after 50 events if the progression-free survival was worse (by any amount) for the experimental arm as compared to the standard treatment (29). In fact, at the time of the interim analysis (58 events in 316 eligible patients) the experimental arm was doing worse than the control arm (hazard ratio = 1.35) and accrual was stopped (29). With further follow-up (143 events), the experimental arm continues to look worse than the control treatment (Fig. 19.5, hazard ratio = 1.29, 95% confidence interval for hazard ratio = 0.93 to 1.74). The futility monitoring allowed 25% fewer patients to be treated on this trial than if there was no futility analysis. We note that the futility monitoring used in this trial was more aggressive than typically used; the usual futility rule with one interim analysis is that after 50% information (not 33% as in this trial) stop the trial if the experimental treatment is looking worse (by any amount) than the control treatment.
60 40 20 0 0.0
3.0
6.0
9.0
12.0
15.0
18.0
Time (months)
FIGURE 19.6 Overall survival of advanced pancreatic cancer patients treated with gemcitabine (dotted line) or Bay 12–9566 (dashed line). (Adapted from Figure 1 of Moore et al., 2003.)
Our third example presents a problematic application of interim monitoring for futility. It was a National Cancer Institute of Canada Clinical Trials Group trial of the matrix metalloproteinase inhibitor BAY 12–9566 versus gemcitabine for advanced pancreatic cancer (30). As gemicitabine was a standard active treatment in this setting, one should use moderately aggressive futility monitoring early on in the trial (Table 19.2). Instead, the comparative analysis used a symmetric boundary in which the trial would only stop for futility if, at 50% information, the gemcitabine treatment arm was statistically significantly worse than the experimental arm with a two-sided p-value < 0.0056 (30). As it happened, the trial did stop at this interim analysis with p < 0.001 (Fig. 19.6) (30). Earlier interim analyses with an appropriate asymmetric boundary might have stopped this trial earlier.
MULTI-ARM AND FACTORIAL DESIGNS
0
12
24 Months on study
36
48
FIGURE 19.5 Progression-free survival of cervical cancer patients treated with cisplatin (dashed line) or fluorouracil (dotted line) in GOG-0165. (Adapted from Figure 1 of Lanciano et al., 2005.)
To allow a single study to address several clinical questions at one time, some randomized trials employ factorial or multi-arm designs; see Chapter 12 for a detailed discussion of these types of designs. Interim monitoring plans for these designs should take into account the effect of the early stopping of one of the questions on the ability of the study to address other study questions. One commonly used multi-arm design is the factorial design. Consider a 2 × 2 design that evaluates two new drugs A and B. Patients are randomized among the four arms (1) Observation, (2) A alone, (3) B alone, or (4) A + B. The two main comparisons (A vs. no A, and B vs. no B) can have independent monitoring plans. If, for example, the drug B question is resolved
170
ONCOLOGY CLINICAL TRIALS
Two Examples The European Organization for Research and Treatment of Cancer conducted EORTC-08923 to address two questions in a 2x2 factorial design with smallcell lung cancer patients (33). One question was whether an intensified dose of cyclophoshamidedoxorubicin-etoposide chemotherapy would increase survival compared to the standard dose; the second question was whether prophylactic antibiotics would decrease the incidence of febrile leukopenia, a common toxicity of the chemotherapy (33). The design called for a total 240 patients (60 in each arm). After 163 patients were randomized, an interim analysis showed the incidence of febrile leukopenia was reduced from 43% (34/79) to 24% (20/82) by the use of the antibiotics (p = 0.007) (33). At that point the two without-antibiotics treatment arms were dropped and the next 81 patients were all given the antibiotics and randomized between the standard and intensified doses of the chemotherapy. A subsequent report (34) showed no survival benefit for the
intensified chemotherapy (median survival = 52 weeks) over the standard chemotherapy (median survival = 54 weeks). We note that at the time of the interim analysis, there was a suggestion of an interactive effect of the treatments on the incidence of febrile leukopenia: for those receiving the standard chemotherapy, the incidences were 24% with the antibiotics and 29% without the antibiotics, whereas for those receiving the intensified chemotherapy, the incidences were 24% with the antibiotics and 56% without the antibiotics. This suggests another reasonable strategy would have been to drop the without-antibiotics intensified chemotherapy arm and use a 2:1:1 randomization for the remaining three arms (i.e., randomizing twice as many patients to the antibiotics intensified chemotherapy arm relative to either standard chemotherapy arms). This approach might have improved the study’s ability to answer the relevant chemotherapy and antibiotic prophylaxis questions. The second example is the trial E-3200, a threearmed trial conducted by the Eastern Cooperative Oncology Group to assess survival differences between chemotherapy (FOLFOX4) plus bevacizumab (arm A) versus the control arm of chemotherapy (arm B), and between bevacizumab alone (arm C) versus arm B in previously treated metastatic colon cancer (35). At an interim analysis, arm C suggested inferior survival to the control arm (35). From that time, patients continued to be randomized to arms A and B with no more patients randomized to arm C. At a later interim analysis, and after accrual was complete, arm A compared to arm B crossed a superiority boundary and the results were released (36). Figure 19.7 displays the survival curves with further follow-up.
Probability
early then the remaining patients are randomized between A + B and B (if B was shown to be efficacious) or between A and observation (if B was shown to lack benefit). With multi-arm trials that are not factorial designs, interim monitoring guidelines should depend on the aims of the trial and can become more complex. Some multi-arm designs are focused on comparing a number of experimental trial arms to a single control arm. Interim analyses for futility can proceed in each experimental arm independently because dropping one or more of the experimental arms does not stop the rest of the trial from continuing. However, if one of the experimental arms is deemed superior to the control arm, then the control arm would need to be discontinued, potentially stopping the whole trial (unless comparing the experimental arms is considered justified and feasible). Because of this, we recommend more conservative superiority interim monitoring for a multiarm trial than for a two-armed trial (31). When several therapies are evaluated in a setting with no established standard treatment, a multi-arm trial design needs to compare each arm with each other arm in a pair-wise manner. For example, the Southwest Oncology Group trial SWOG-8203 was designed to compare doxorubicin, mitoxantrone, and bisantrene arms in metastatic breast cancer (32). In a multi-arm design like this, the monitoring plan can be based on dropping any treatment arm that is shown to be sufficiently inferior to any other treatment arm at an interim analysis time.
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3
6
9
12 15 18 21 24 27 30 33 36 Months
FIGURE 19.7 Overall survival of metastatic colorectal cancer patients treated with (A) FOLFOX4 plus bevacizumab (dotted line), (B) FOLOFOX4 (control treatment, dashed line), or (C) Bevacizumab (solid line) on ECOG-3200. (Adapted from Figure 1 of Giantonio et al., 2007.)
19 INTERIM ANALYSIS OF PHASE III TRIALS
NONINFERIORITY TRIALS In a noninferiority trial, one is trying to demonstrate that a standard treatment can be replaced by a lesser treatment. For example, can a new less toxic regimen (E) be substituted for a standard regimen (S) without resulting in inferior efficacy. At the end of the trial, one concludes that E is noninferior to S, or does not conclude this. In a properly designed trial with sufficient sample size, not concluding E is noninferior to S is equivalent to demonstrating the E is inferior to S. Chapter 13 discusses noninferiority trials in detail. As with a one-sided superiority trial, there are two ways one might consider stopping a trial early at an interim monitoring time. One might stop a trial early (1) when it became clear that E was inferior to S, or (2) when it became clear that E was noninferior to S. The first type of stopping is more critical in that patients are not getting an effective standard therapy (i.e., S). On the other hand, even if it becomes clear that E is noninferior to S by the margin of noninferiority prespecified in the protocol, say, 20%, there may still be interest in narrowing the confidence interval for the relative efficacy of E versus S. This suggests aggressive interim monitoring for stopping for E being inferior to S, and less aggressive interim monitoring (or possibly no interim monitoring) for stopping for E being noninferior to S. Other Considerations In consideration of stopping early when it is clear that E is inferior to S, there is the issue as to whether one needs to be confident that E is inferior to S by any amount, or whether one additionally needs to be confident that E is worse more than S by more than Δ, where Δ is the prespecified margin of noninferiority. For example, suppose the prespecified margin of noninferiority is 20%, and one observes at a monitoring time that E is 30% inferior to S with standard error ±10%. Then one could be very confident that E is worse than S, but not very confident that E was more than 20% worse than S. Although both types of monitoring for noninferiority have been discussed (37), we recommend stopping when one is confident that E is worse than S by any amount (38). For a well designed trial, stopping using this method will also ensure that the observed estimate of the treatment effect is well above Δ. Two Examples The German Breast Group conducted the GEPARDUO trial to assess whether one preoperative chemotherapy regimen (ADOC) was noninferior to another more
171
toxic regimen (AC-DOC) in terms of pathologic complete response rate (pCR) for patients with operable breast cancer (39). At the second interim analysis, at a time when 913 of a planned 1,000 patients had been randomized, the trial was stopped when a large significant difference was noted between the pCR rates of the two treatment arms (40). With further follow-up, the pCR rates were 7.0% in the ADOC arm and 14.3% in the AC-DOC arm (p < 0.001), demonstrating the inferiority of ADOC as compared to AC-DOC. The second example is the Southwest Oncology Group trial SWOG-8412, which was designed to show that carboplatin-cyclophoshamide was noninferior to cisplatin-cyclophoshamide in advanced ovarian cancer (41). The prespecified margin of noninferiority was a 30% decrement in the hazard of overall survival, and the monitoring boundary included declaring the carboplatin-cyclophoshamide noninferior if the null hypothesis that this arm was 30% inferior could be rejected at the p < .005, .006, and .0075 levels at the three interim monitoring analyses with approximately 25%, 50%, and 75% information. At the first interim analysis, this null hypothesis could be rejected at the p = 0.001 level, the trial was stopped with 342 patients enrolled (out of a planned 480). After a thorough discussion, the DMC decided that “results were sufficiently convincing to stop accrual and to report the trial” (42). The survival curves at the time of the interim analysis are displayed in Figure 19-8(A), and with further follow-up in Figure 19-8(B) (hazard ratio = 0.99 [<1 designating carboplatin doing better], 95% confidence interval = 0.77 to 1.29). Because of the relatively large prespecified noninferiority margin, the interim monitoring bound for declaring noninferiority used in this trial is more aggressive than we would typically recommend for an overall survival end point.
ROLE OF INDEPENDENT DATA MONITORING COMMITTEES Interim monitoring of a phase III trial is best performed by a data monitoring committee that is independent of the study team (43–45). The interim analyses should only be seen by the data monitoring committee, as knowledge of interim results by the study investigators (who may over-interpret the results) may lead them to stop accruing patients (46). We believe that patients who are considering enrolling on a trial should be told of these arrangements, that is, that neither they nor their treating physicians will see accruing outcome data but that it is being monitored by a data monitoring committee who will stop the trial if the results become definitive. When a trial is in its follow-up period (accrual is
172
ONCOLOGY CLINICAL TRIALS
Percent surviving
A
we recommend that the study statistician consider not participating in these discussions if he or she has already seen confidential outcome data from the trial; instead, a statistician who has not seen the outcome data should be consulted. The data monitoring committee also monitors the trial for toxicity, accrual, and currency of the data collection, although the primary responsibility for the monitoring of these items rests with the study leadership. Monitoring for these items is discussed in Chapters 17 and 18.
100 80 60 40 20 0 0
6
18 12 24 Months after registration
30
POTENTIAL COSTS OF INTERIM MONITORING
B
Percent surviving
100 80 60 40 20 0
0
12
36 24 48 Months after registration
60
FIGURE 19.8 Overall survival of ovarian cancer patients treated with carboplatin-cyclophoshamide (dotted lines) or cisplatincyclophoshamide (dashed lines) on SWOG-8412. (A) Curves at the time of interim analysis. (B) Curves with additional follow-up. (Adapted from Figures 1 and 3 of Alberts et al., 1992.)
over) and in very special circumstances when release of the ongoing trial results could not affect the final trial results, an argument can be made for sometimes releasing the ongoing trial results to the public (47). However, this is a controversial proposal (48–50). A related concern about independence and interim monitoring concerns the statistician who is performing the interim analyses that are presented to the data monitoring committee (51). For the NCI Cooperative Groups, this is the study statistician. Some (52, 53) have argued that this should be an independent statistician to avoid conflicts of interest and possible leakage of confidential outcome information to the study investigators. For NCI Cooperative Group trials, we believe the benefits to having the study statistician (who has thorough knowledge of the study) perform the interim analyses outweighs potential harms (54, 55). However, when major trial amendments (change in primary end point, change in sample size) are being suggested and considered by the study team,
There are a number of potential costs associated with interim monitoring. There are the obvious costs of slightly increased maximum sample size mentioned previously. Also, there are the obvious costs associated with performing extra analyses (e.g., gathering more up-to-date data), but these costs should be relatively minor given present-day computerization of data collection. We now discuss some other costs that are potentially more troubling. Inability to Examine Longer-Term Survival Trends When a trial stops early, there may be very little or no information about the longer-term effects of the treatments. For example, suppose a trial is designed with 4 years of accrual and 3 years of additional follow-up. If this trial stops early at an interim analysis after 3 years of accrual, then there will be no information on the survival curves after 3 years, and the information on the curve from years 2 to 3 may be scanty (as the median follow-up will be around 1.5 years). For example, suppose the trial was stopped because the new treatment (N) looked remarkably better than the standard treatment (S). In particular, the survival curve for N was much above the survival curve for S for years one to three. If the trial was not stopped, it is conceivable that in years four onward that the survival curve for N will drop much faster than for S, with the survival curves ending up in the same place at some later year. This is an example of what is known as crossing hazards. In most situations, one can continue to follow the patients after the trial is stopped to draw conclusion about the longer-term survival. However, if many patients on S cross over to N then the longer-term results will be confounded. Although crossing hazards are always a possibility, we believe that when superiority is clear that it is appropriate to stop a trial and release the results, with
173
19 INTERIM ANALYSIS OF PHASE III TRIALS
Two Examples
A 0.95 0.90 0.85 0.80 0.75 0.70 0
10
20
30
40
50
Months after re-randomization
B 100
Percentage
90
80
70
60
50 0
1
2
3
4
5
6
7
Year
FIGURE 19.9 Disease-free survival of breast cancer patients treated with tamoxifen (dotted lines) or placebo (dashed lines) on NSABP-B14. (A) Curves at the time of the third interim analysis. (B) Curves with additional follow-up. (Adapted from Figure 2 of Dignam et al., 1998, and Figure 1 of Fisher et al., 2001.)
continued follow-up for longer-term survival. This presumes that the primary end point of the trial being used for the monitoring is clinically important; see below. With futility monitoring the considerations are the same, except in the situation where it is expected that the treatment effect will be delayed. For example, suppose a very aggressive treatment has a shortterm mortality associated with it, but with the hope of more cures. The futility interim monitoring plan for such a trial should be cautious because a standard futility plan might lead to inappropriate early stopping for futility.
The National Surgical Adjuvant Breast and Bowel Project conducted the second part of NSAPB-B14 to assess whether disease-free survival would be improved by the addition of 5 years of tamoxifen (versus placebo) in patients who had already received 5 years of tamoxifen (56). The patients were randomized (for this second part of the trial) between April 1987 and April 1994 and the results were released showing the addition of 5 additional years of tamoxifen appeared detrimental based on the third interim analysis in June 1995 with 88 events (76% of the total information) (56, 57). Figure 19.9(A) displays what the disease-free survival curves appeared at the time of third interim analysis. Peto argued that this trial stopped prematurely and that with further followup the benefits of the tamoxifen might become apparent: “Every year almost one million women develop breast cancer, and premature certainties as to whether adjuvant tamoxifen therapy could be stopped after 5 years could lead to many unnecessary deaths” (58). It turns out that the detrimental effects of the additional years of tamoxifen were confirmed with additional follow-up (Fig. 19.9(B), 243 events) (59). The second example is POG-9006, conducted by the Pediatric Oncology Group to compare the continuous complete remission (CCR) rates of two different chemotherapy regimens (“A” and “B”) in children with B-precursor acute lymphoblastic leukemia (60). With accrual 97% complete and 38% information, the trial was stopped when regimen B having a 2-year CCR rate of 82% and regimen A having a 2-year CCR rate of 70.8% (p = 0.0016) (60). With further follow-up (77% information), the statistical significance was lost (p = 0.22) and the survival curves appear to be coming together (60); see Figure 19.10. This appears partially due to crossing hazards.
100 80 Probability
Proportion event-free
1.00
60 40 20 0
0
1
2
3 4 Years followed
5
6
FIGURE 19.10 Complete continuous remission curves from POG-9006; treatment A = dashed line, treatment B = dotted line. (Adapted from Figure 2 of Lauer et al., 2001.)
174
ONCOLOGY CLINICAL TRIALS
Example The Southwest Oncology Group conducted SWOG9701 to assess whether a 12-month continuation of paclitaxel was superior to a 3-month continuation of paclitaxel for advanced ovarian cancer patients who achieved a complete response to the initial chemotherapy (70). The primary end point was progression-free survival. At an interim analysis at which time approximately 15% of the total number of expected events had been observed and accrual was about two-thirds complete, the trial was stopped and the results released stating that the 12-month continuation was superior (70). The hazard ratio at the time of the interim analysis was 0.43 (<1 designating 12 months doing better), 95% confidence interval = 0.25 to 0.75 (70); see Figure 19.11(A). With further follow-up representing 62% of the total number of events, the hazard ratio was estimated to be 0.70, with 95% confidence interval = 0.54 to 0.91 (71) (Fig. 19.11[B]).
100 80 Percentage
A trial that stops early will have less information about the treatment effect than one that does not, leading to a wider confidence interval for the treatment effect. If the sample size is too small or the confidence interval too wide, the results may not be considered convincing to the community. In addition, the treatment effect observed for a trial that stops early for positive results will be on average higher than the true treatment effect (61, 62). Some have suggested that this is a reason not to have interim monitoring for positive results (63–65). However, a positive trial that concludes at its regularly scheduled end will also yield an estimated treatment effect that is higher than the true value (66). It is true that this type of bias is larger for a trial that stops early than one that does not. However, on theoretical grounds, this bias is not much larger unless the trial is stopped very early (67), and empirical evidence suggests that the bias is not much of a problem (68). In addition, even though the estimated treatment effect may be biased, we can still be confident that it is nonzero (when stopping for superiority), that is, the type I error of the trial is preserved with proper interim monitoring. If one has prior information about the magnitude of the treatment effect, one can obtain using Bayesian methods attenuated estimates of the treatment effect that may be useful regardless of whether the trial stopped early (69). In any event, we believe releasing information to the public early about an effective treatment (or an ineffective one) is more important than being able to estimate the exact magnitude of the benefit.
A
60 40 20 0 0
12 24 36 Months after registration
48
B 100 80 Percentage
Inability to Estimate the Treatment Effect Precisely
60 40 20 0
0
24
72 48 Months after registration
96
FIGURE 19.11 Progression-free survival of ovarian cancer patients treated with 12 courses of paclitaxel (top curves) or 3 courses of paclitaxel (bottom curves) from SWOG-9701. (A) Curves at the time of interim analysis. (B) Curves with additional follow-up. (Adapted from Figure 1 of Markman et al., 2003, and slide 7 of Markman, 2006.)
Inability to Study Well Some Secondary End Points The designated primary end point of a trial is the one that is used to make the definitive statement concerning treatment effectiveness. There can be controversy about the appropriate primary end point, with different end points yielding different required sample sizes for the trial. For example, a trial that demonstrates a positive treatment effect at its conclusion for progression-free survival may not have a sufficient number of deaths at that time to evaluate conclusively overall survival benefits. The possibility of early stopping exacerbates this potential problem in that there may be very little information available about the nonprimary end points if the trial is stopped early. If one believes that an improvement in a non-overall-survival end point results in direct patient benefit, then there should be little argument against stopping a trial early based on extremely positive results for that end point. On the other hand, sometimes a non-overall-survival end point is used not because it directly represents
175
19 INTERIM ANALYSIS OF PHASE III TRIALS
patient benefit but as a surrogate for overall survival. As a surrogate for overall survival, it may have more statistical power because events accumulate faster and because it may be less susceptible to potential confounding by treatment crossovers to the experimental arm after non-overall-survival events in the control arm. In this case it is not as clear that one would need to stop a trial early for very positive non-overall-survival treatment effects. When a non-overall-survival primary end point does not directly represent clinical benefit, a reasonable strategy is to use overall survival for the efficacy interim analysis even though the primary end point is different. To do this, one would have to be comfortable continuing a trial that showed strongly positive results in the nonoverall-survival primary end point provided that overall survival differences were not very large. Other possibilities include requiring very extreme positive results for
OS patients (%)
A
100
75
50
25
0
OS patients (%)
B
0
4
8 12 16 20 24 28 32 Time from randomization (months)
36
40
0
4
8 12 16 20 24 28 32 Time from randomization (months)
36
40
100
75
50
25
0
early stopping/release or commencing the efficacy monitoring at a late enough time point that accrual will be complete or almost complete. This may allow evaluation of the overall survival effect with further follow-up. Two Examples The National Cancer Institute of Canada Clinical Trials Group conducted MA-17 to assess whether 5 years of letrozole improved the outcome of postmenopausal women with breast cancer who had already received 5 years of adjuvant tamoxifen treatment (72). The primary end point was disease-free survival. The trial was stopped and the result release early when the letrozole appeared superior at the first interim analysis with 40% information, accrual complete, and a median follow-up of 2.4 years (hazard ratio = 0.57, p = 0.0008) (72). Patients on the placebo arm were offered letrozole at the time the trial results were released, so it will never be possible to estimate the treatment effect on overall survival given the immaturity of the overall survival data at the time of release (73 deaths) (72). This was one of the reasons the early stopping for this trial was criticized (73). The MA17 study investigators responded that they considered disease-free survival an important clinical end point for their adjuvant breast cancer setting (74). The second example is the TARGET trial, which was conducted to assess whether sorafenib improved the overall survival in patients with advanced clear-cell renal cancer (75). After accrual was complete, the study results were released based on an improvement in progression-free survival (hazard ratio = 0.44, p < 0.001), and 48% of the patients on the placebo arm crossed-over to sorafenib treatment (75). Figure 19.12(A) shows how the survival curves appeared approximately 16 months after the study results were released (hazard ratio = 0.88, p = 0.146) (76). Because of the large proportion of cross-overs, the investigators also performed an analysis in which patients on the placebo arm had their data censored at the time of the release of the results (Figure 19.12(B), hazard ratio = 0.78, p = 0.0287) (75). This type of analysis can somewhat mitigate the effects of the crossovers on the overall survival analysis, although the number of deaths will still be limited.
FIGURE 19.12
SUMMARY
Overall survival of renal cell cancer patients treated with sorafenib (top curves) or placebo (bottom curves) from the TARGET trial. (A) Intent-to-treat analysis. (B) Data from patients on placebo arm censored at the time of the release of the trial results (16 month earlier). (Adapted from slides 9 and 10 of Bukowski et al, 2007.)
It is important that the detailed interim monitoring plan be specified in the protocol before the trial begins. Besides avoiding the perception that results are being manipulated, this allows the study investigators to specify their intentions concerning the monitoring. As
176
ONCOLOGY CLINICAL TRIALS
the protocol is not a confidential document, all can assess whether they believe the superiority and futility monitoring are appropriate for the trial in question. In particular, this should avoid questions about the appropriateness of stopping a trial that has been stopped for crossing an interim monitoring boundary. The data monitoring committee can overrule an interim monitoring plan when other information arises during the course of the trial, for example, information from other trials that have just completed, but this is seldom done. There are some potential costs to interim monitoring, but with a carefully designed plan based on an appropriate primary analysis outcome, the benefits to the public in getting access sooner to trial results easily outweigh these costs.
References 1. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. 2. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC, 2000. 3. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Rev. 2nd ed., Chichester, England: Wiley & Sons; 1997. 4. Haybittle JL. Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol. 1971;44:793–797. 5. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. 6. Freidlin B, Korn EL, George SL. Data monitoring committees and interim monitoring guidelines. Cont Clin Trials. 1999; 20:395–407. 7. Morris M, Eifel PJ, Lu J, et al. Pelvic radiation with concurrent chemotherapy compared with pelvic and para-aortic radiation for high-risk cervical cancer. N Engl J Med. 1999; 340:1137–1143. 8. Eifel PJ, Winter K, Morris M, et al. Pelvic irradiation with concurrent chemotherapy versus pelvic and para-aortic irradiation for high-risk cervical cancer: an update of radiation therapy oncology group trial (RTOG) 90–01. J Clin Oncol. 2004; 22:872–880. 9. Romond EH, Perez EA, Bryant J, et al. Trastuzumab plus adjuvant chemotherapy for operable HER2-positive breast cancer. N Engl J Med. 2005;353:1673–1684. 10. Perez E, Romond E, Suman V, et al. Updated results of the combined analysis of NCCTG N9831 and NSABP B-31 adjuvant chemotherapy with/without trastuzumab in patients with HER2-positive breast cancer. J Clin Oncol. 2007;25(suppl 18; abstr 512):S6. 11. Lau WY, Leung TWT, Ho SKW, et al. Adjuvant intra-arterial iodine-131-labelled lipiodol for resectable hepatocellular carcinoma: a prospective randomised trial. Lancet. 1999; 353:797–801. 12. Pocock S, White I. Trials stopped early: too good to be true? Lancet. 1999;353:943–944. 13. Leung TWT, Lau WY, Ho S, et al. Reduction of local recurrence after adjuvant intra-arterial lipiodol-iodine-131 for hepatocellular carcinoma—a planned interim analysis of a prospective randomized study. Proceedings of ASCO. 1997;16:279a (abstract 988). 14. Lau WY, Lai ECH, Leung TWT, et al. Adjuvant intra-arterial iodine-131-labeled lipiodol for resectable hepatocellular carcinoma—a prospective randomized trial-update on 5-year and 10-year survival. Annals of Surgery. 2008;247:43–48.
15. Wieand S, Schroeder G, O’Fallon JR. Stopping when the experimental regimen does not appear to help. Stat Med. 1994; 13:1453–1458. 16. Ellenberg SS, Eisenberger MA. An efficient design for phase III studies of combination chemotherapies. Cancer Treat Rep. 1985;69:1147–1152. 17. Fleming TR, Harrington DP, O’Brien PC. Designs for group sequential tests. Control Clin Trials. 1984;5:348–361. 18. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical trials. Commu Statist C. 1982;1: 207–219. 19. Freidlin B and Korn EL. Monitoring for lack of benefit: a critical component of a randomized clinical trial. Journal of Clinical Oncology 2009;27:629–633. 20. Freidlin B, Korn EL, George SL, et al: Randomized clinical trial design for assessing noninferiority when superiority is expected. J Clin Oncol. 2007:25:5019–5023. 21. Pampallona S, Tsiatis AA. Group sequential designs for onesided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. J Statist Planning and Inference. 1994;42:19–35. 22. Whitehead J, Stratton I. Group sequential clinical trials with triangular continuation regions. Biometrics. 1983;39:227–236. 23. Freidlin B, Korn EL. A comment on futility monitoring. Cont Clin Trials. 2002;23;355–366. 24. European Medicines Agency. Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan http://www.emea.europa.eu/pdfs/human/ ewp/245902enadopted.pdf [October 2007]. 25. Pocock SJ. Current controversies in data monitoring for clinical trials. Clin Trials. 2006;3:513–521. 26. Wright JR, Ung YC, Julian JA, et al. Randomized, double-blind, placebo-controlled trial of erythropoietin in non–small-cell lung cancer with disease-related anemia. J Clin Oncol. 2007:25: 1027–1032 27. Saltz LB, Niedzwiecki D, Hollis D, et al. Irinotecan fluorouracil plus leucovorin is not superior to fluorouracil plus leucovorin alone as adjuvant treatment for stage III colon cancer: results of CALGB 89803. J Clin Oncol. 2007;25:3456–3461. 28. Saltz LB, Niedzwiecki D, Hollis D, et al. Irinotecan plus fluorouracil/leucovorin (IFL) versus fluorouracil/leucovorin alone (FL) in stage III colon cancer (intergroup trial CALGB C89803). J Clin Oncol. 2004:22(suppl 14):A-3500, S245. 29. Lanciano R, Calkins A, Bundy BN, et al. Randomized comparison of weekly cisplatin or protracted venous infusion of fluorouracil in combination with pelvic radiation in advanced cervix cancer: a gynecologic oncology group study. J Clin Oncol. 2005;23:8289–8295. 30. Moore MJ, Hamm J, Dancey J, et al. Comparison of gemcitabine versus the matrix metalloproteinase inhibitor BAY 12–9566 in patients with advanced or metastatic adenocarcinoma of the pancreas: a phase III trial of the National Cancer Institute of Canada Clinical Trials Group. J Clin Oncol. 2003:21:3296–3302. 31. Freidlin B, Korn EL, Gray R, et al. Multi-arm clinical trials of new agents: some design considerations. Clin Cancer Res. 2008:14;4368–4371. 32. Cowan JD, Neidhart J, McClure S, et al. Randomized trial of doxorubicin, bisantrene, and mitoxantrone in advanced breast cancer: a Southwest Oncology Group study. J Natl Cancer Inst. 1991;83:1077–1084. 33. Tjan-Heijnen VCG, Postmus PE, Ardizzoni A, et al. Reduction of chemotherapy-induced febrile leucopenia by prophylactic use of ciprofloxacin and roxithromycin in small-cell lung cancer patients: an EORTC double-blind placebo-controlled phase III study. Ann Oncol. 2001;12:1359–1368. 34. Ardizzoni A, Tjan-Heijnen VC, Postmus PE, et al. Standard versus intensified chemotherapy with granulocyte colonystimulating factor support in small-cell lung cancer: a prospective European Organization for Research and Treatment of Cancer-Lung Cancer Group Phase III Trial-08923. J Clin Oncol. 2002;20:3947–3955.
19 INTERIM ANALYSIS OF PHASE III TRIALS
35. Giantonio BJ, Catalano PJ, Meropol NJ, et al. Bevacizumab in combination with oxaliplatin, fluorouracil, and leucovorin (FOLFOX4) for previously treated metastatic colorectal cancer: results from the Eastern Cooperative Oncology Group Study E3200. J Clin Oncol. 2007;25:1539–1544. 36. Giantonio BJ, Catalano PJ, Meropol NJ, et al. High-dose bevacizumab improves survival when combined with FOLFOX4 in previously treated advanced colorectal cancer: results from the Eastern Cooperative Oncology Group (ECOG) study E3200. J Clin Oncol. 2005;23(suppl 16):A-2, S1. 37. Jennison C, Turnbull BW. Sequential equivalence testing and repeated confidence intervals, with applications to normal and binary responses. 1993;49:31–43. 38. Durrleman S, Simon R. Planning and monitoring of equivalence studies. Biometrics. 1990;46:329–336. 39. von Minckwitz G, Raab G, Caputo A, et al. Doxorubicin with cyclophoshamide followed by docetaxel every 21 days compared with doxorubicin every 14 days as preoperative treatment in operable breast cancer: the GEPARDUO study of the German Breast Group. J Clin Oncol. 2005;23: 2676–2685. 40. Jackisch C, von Minckwitz G, Eidtmann H, et al. Dose-dence biweekly doxorubicin/docetaxel versus sequential neoadjuvant chemotherapy with doxorubicin/cyclophoshamide/docetaxe in operable breast cancer: second interim analysis. Clin Br Cancer. 2002;3:276–280. 41. Alberts DS, Green S, Hannigan EV, et al. Improved therapeutic index of carboplatin plus cyclophosphamide versus cisplatin plus cyclophosphamide: final report by the Southwest Oncology Group of a phase III randomized trial in stages III and IV ovarian cancer. J Clin Oncol. 1992;10:706–717. 42. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology. 2nd ed. Boca Raton, FL:Chapmann & Hall/CRC; 2003, p. 118. 43. Smith MA, Ungerleider RS, Korn EL, et al. Role of independent data-monitoring committees in randomized clinical trials sponsored by the National Cancer Institute. J Clin Oncol. 1997;15:2736–2743. 44. Ellenberg SS, Fleming TR, DeMets DL. Data Monitoring Committees in Clinical Trials. Chichester, England, Wiley & Sons; 2002. 45. Slutsky AS, Lavery JV: Data safety and monitoring boards. N Engl J Med. 2004;350:1143–1147. 46. Green SJ, Fleming TR, O’Fallon JR. Policies for study monitoring and interim reporting of results. J Clin Oncol. 1987; 5:1477–1484. 47. Korn EL, Hunsberger S, Freidlin B, Smith MA, Abrams JS. Preliminary data release for randomized clinical trials of noninferiority: a new proposal. J Clin Oncol. 2005;23:5831–5836. 4 8. Fleming TR, Sharples K, McCall J, et al. Maintaining confidentiality of interim data to enhance trial integrity and credibility. Clin Trials. 2008;5:157–167. 49. Korn EL, Hunsberger S, Freidlin B, Smith MA, Abrams JS. Comments on “maintaining confidentiality of interim data to enhance trial integrity and credibility by TR Fleming et al. Clin Trials. 2008;5:364–365. 50. Fleming TR, Sharples K, McCall J, et al. Response to the letter from Korn et al. Clin Trials. 2008;5:365–366. 51. Ellenberg SS, George SL. Should statisticians reporting to data monitoring committees be independent of the trial sponsor and leadership? Stat Med. 2004;23:1503–1505. 52. DeMets DL, Fleming TR. The independent statistician for data monitoring committees. Stat Med. 2004;23:1513–1517. 53. Pocock SJ. A major trial needs three statisticians: why, how and who? Stat Med. 2004;23:1535–1539. 54. Bryant J. What is the appropriate role of the trial statistician in preparing and presenting interim findings to an independent data monitoring committee in the U.S. Cancer Cooperative Group setting? Stat Med. 2004;23:1507–1511. 55. Lachin JM. Conflicts of interest in data monitoring of industry versus publicly financed clinical trials. Stat Med. 2004; 23: 1519–1521. 56. Fisher B, Dignam J, Bryant J, et al. Five versus more than five years of tamoxifen therapy for breast cancer patients with
57.
58. 59.
60.
61. 62. 63. 64. 65. 66. 67. 68.
69. 70.
71.
72.
73. 74. 75. 76.
177
negative lymph nodes and estrogen receptor-positive tumors. J Natl Cancer Inst. 1996;88:1529–1542. Dignam JJ, Bryant J, Wieand HS, et al. Early stopping of a clinical trial when there is evidence of no treatment benefit: protocol B-14 of the National Surgical Adjuvant Breast and Bowel Project. Cont Clin Trials. 1998;19:575–588. Peto R: Five years of tamoxifen—or more? J Natl Cancer Inst. 1996;88:1791–1793. Fisher B, Dignam J, Bryant J, et al. Five versus more than five years of tamoxifen for lymph node-negative breast cancer: updated findings from the National Surgical Adjuvant Breast and Bowel Project B-14 randomized trial. J Natl Cancer Inst. 2001;93:684–690. Lauer SJ, Shuster JJ, Mahoney DH, et al. A comparison of early intensive methotrexate/mercaptopurine with early intensive alternating combination chemotherapy for high-risk B-precursor acute lymphoblastic leukemia: a Pediatric Oncology Group phase III randomized trial. Leukemia. 2001;15: 1038–1045. Pocock SJ, Hughes MD. Practical problems in interim analyses, with particular regard to estimation. Cont Clin Trials. 1989;10:209S–221S Fan X, DeMets DL, Lan KKG. Conditional bias of point estimates following a group sequential test. J Biopharm Stat. 2004;14:505–530. Mueller PS, Montori VM, Bassler D, et al: Ethical issues in stopping randomized trials early because of apparent benefit. Ann Intern Med. 2007;146:878–881. Bassler D, Montori VM, Briel M, et al. Early stopping of randomized clinical trials for overt efficacy is problematic. J Clin Epidemiol. 2008;61:241–246. Trotta F, Apolone G, Garattini S, et al. Stopping a trial early in oncology: for patients or for industry? Ann Oncol. 2008; 19:1347–1353. Goodman SN. Stopping at nothing? Some dilemmas of data monitoring in clinical trials. Ann Intern Med. 2007; 146: 882–887. Freidlin B, Korn EL. Stopping clinical trials early for benefit: impact on estimation. Clin Trials. 2009;6:119–125. Korn EL, Friedlin B, Mooney M. Stopping or reporting early for positive results in randomized clinical trials: the NCI Cooperative Group experience 1990–2005. J Clin Oncol.2009; 10:1712–1721. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomized trials. J Roy Stat Soc(A). 1994; 157:357–387. Markman M, Liu PY, Wilczynski S, et al. Phase III randomized trial of 12 versus 3 months of maintenance paclitaxel in patients with advanced ovarian cancer after complete response to platinum and paclitaxel-based chemotherapy: a Southwest Oncology Group and Gynecologic Oncology Group trial. J Clin Oncol. 2003;21:2460–2465. Markman M, Liu P, Wilczynski S, et al. Survival (S) of ovarian cancer (OC) patients (pts) treated on SWOG9701/GOG178: 12 versus (v) 3 cycles (C) of monthly single-agent paclitaxel (PAC) following attainment of a clinically-defined complete response (CR) to platinum (PLAT)/PAC. J Clin Oncol. 2006;24:257s (suppl; abstr A-5005 and oral presentation) (www.asco.org). Goss PE, Ingle JN, Martino S, et al. A randomized trial of letrozole in postmenopausal women after five years of tamoxifen therapy for early-stage breast cancer. N Engl J Med. 2003; 349:1793–1802. Cannistra SA. The ethics of early stopping rules: who is protecting whom? J Clin Oncol. 2005;22:1542–1545. Pater J, Goss P, Ingel J, et al. The ethics of early stopping rules. J Clin Oncol. 2005;23:2862–2863. Escudier B, Eisen T, Stadler WM, et al. Sorafenib in advanced clear-cell renal-cell carcinoma. N Engl J Med. 2007;356: 125–134. Bukowski RM, Eisen T, Szczylik C, et al. Final results of the randomized phase III trial of sorafenib in advanced renal cell carcinoma: survival and biomarker analysis. J Clin Oncol. 2007;25:18s (suppl; abstr 5023 and oral presentation).
This page intentionally left blank
20
Interpretation of Results: Data Analysis and Reporting of Results
Donna Niedzwiecki Donna Hollis
The clinical trials process culminates with the reporting of study results. These reports may be made in many ways including via scientific presentations and through published journal articles. The reporting of clinical trials results has evolved over time, particularly with respect to the use and presentation of statistical methods. Standardization of reporting has been promoted including uniformity of study end points, elements of data capture, content, format, and the use of graphics to allow clearer comparisons and interpretation of results across reported trials in the same disease site. Momentum toward the goal of standardization in the design, conduct, and reporting of randomized clinical trials was provided by the pharmaceutical industry. In the 1980s, the European Community pioneered the harmonization of regulatory requirements across Europe with the goal of minimizing the length of time necessary to bring safe and effective treatments to market. Japan and the United States became engaged in the process expanding the international effort. The International Conference on Harmonisation (ICH), comprised of representatives from regulatory agencies and industry from the participating countries, was formed in April 1990. Expert working groups have been formed to make recommendations on specific topics related to the goals of the organization. In November 1995, the ICH Steering Committee approved and recommended guidelines on the structure and content of clinical study reports for adoption by the participating regulatory
agencies (1, 2, 3). Another group consisting of clinical trials experts, medical journal editors, and epidemiologists first published the Consolidated Standards of Reporting Clinical Trials (CONSORT) Statement in the Journal of the American Medical Association in 1996. The CONSORT Statement addresses the issue of inadequate reporting of clinical trials results and is considered an evolving document (4, 5). Methods of data analysis and results reporting from clinical trials depend closely on the trial design and statistical analyses and, as much as possible, should be conducted according to the statistical considerations presented in the protocol document. This dependence reemphasizes the need for careful trial planning at the outset. However, not all events can be anticipated and unexpected events that occur during the conduct of the trial must also be addressed. For example, the targeted accrual of the trial may not be met; unexpected toxicity may be observed or statistical assumptions made at the design stage may not hold. Such occurrences potentially impact the analytic methods used and the reporting and interpretation of results. Results of data analyses are also heavily dependent on the quality of the data captured. Thus, an organized mechanism for data collection and review is essential. In this chapter, aspects of the final data analysis and how to report results are discussed according to their relevance to specific sections of the final research publication. Illustrative examples are provided.
179
180
ONCOLOGY CLINICAL TRIALS
MANUSCRIPT STRUCTURE AND CONTENT Manuscripts reporting the results from a clinical trial contain the following general elements: title, authorship, abstract, introduction (background and significance), description of clinical and statistical methods, presentation of results, discussion, and references. Each of these elements is described below in the context of providing a clear and valid representation of the design and conduct of the trial and its outcomes. Title and Authorship The title should identify the main elements of the study and include relevant key words with respect to search queries. Authors should be listed generally in order of the magnitude of their contributions to the study and to the development of the manuscript with the primary (first) author being responsible for the manuscript composition. A senior (last) author is often named; senior authorship acknowledges the role of the individual as a mentor. The roles of other authors can be determined according to their degrees and affiliations. Many journals include a specific listing of author contributions to each publication. Abstract The abstract is a brief summary of the background, primary goals of the trial, the methods employed in trial conduct, key results, and conclusions. The full manuscript content should be distilled to only the most important aspects of the study. Contents of the abstract should be consistent with the details provided in the manuscript. Based on the abstract a reader should be able to understand the significance of the trial in the broader area of research for the disease under study, how the trial was conducted, the primary end points studied, and what was learned. Introduction The introduction provides the background and significance of the trial and contains information regarding the patient population under study, prior treatment approaches, and justification for the treatment studied in the trial. Results from previous trials in the same or similar patient populations are referenced. Authors should provide a clear rationale for the conduct of the trial consistent with the prior data. Methods Detailed information regarding the study population, patient eligibility, treatment administration, clinical and laboratory methods of patient evaluation, and a
general description of the study conduct are provided in the methods section. Often, the statistical methods are included in the general methods section though they may also appear separately as Statistical Considerations, Statistical Plan, or Statistical Analysis. The information provided in this section, again, should reflect all the specifications in the protocol and any other methods used. Study Population The established cancer staging criteria used in the protocol, for example, the staging system proposed by the American Joint Committee on Cancer (AJCC), should be included to define the study population (6). Staging is specific to the particular disease and classifies the severity of the disease at presentation. Uniform classification of tumors allows, at minimum, descriptive comparisons of efficacy and toxicity across studies and the potential for meta-analyses. Other nonstandard staging systems are used in describing leukemia (7). Disease state may further be defined using known prognostic factors that depend on the cancer type, for example, Gleason score and prostate specific antigen (PSA) level in prostate cancer and Breslow depth in melanoma. Validated prognostic indices incorporating multiple prognostic factors into one assessment are also often used to classify or stratify patients by risk. Examples include the Halabi nomogram used to classify patients with prostate cancer and the Follicular Lymphoma International Prognostic Index (FLIPI) for patients with follicular lymphoma (8, 9). With the move to targeted therapy, the identification of molecular markers and pharmacogenomic characterization of tumors, patients may be increasingly identified as candidates for tumor-specific treatment strategies. These should be presented in the article with references provided. Additional eligibility (inclusion and exclusion) criteria should be listed. In addition to disease type and stage, examples of eligibility criteria include: patient age, performance status (e.g., Karnofsky and Eastern Cooperative Oncology Group [ECOG]), number of prior treatment regimens, laboratory values, resection margins, and patient’s status on a molecular marker (10, 11). Understanding the patient population is critical to the interpretation of results. In 2002, results of a study on the use of hormone replacement therapy (HRT) among post menopausal women indicated that the use of HRT was associated with a higher risk of certain cancers in this patient group (12). This finding led to an observed reduction in the use of HRT over the following years. In 2007 and 2008, reanalyses of the data stratifying women by time since onset of menopause
20 INTERPRETATION OF RESULTS: DATA ANALYSIS AND REPORTING OF RESULTS
showed no increased cancer risk when time past onset of menopause was considered (13, 14). Treatment and Administration For treatment trials, the treatment under investigation is described per protocol including the scheduling and dosing of each agent and methods of administration. The length of each treatment cycle and the duration of treatment should be specified. For example, treatment may be administered for a fixed number of cycles or until a specified event such as progression of disease. Sufficient detail should be provided so that the treatment strategy can be replicated. Differences in dosing and administration of the same agent can greatly impact results in terms of efficacy or toxicity, which may require early reporting. As an example, consider the use of irinotecan in the treatment of colon cancer. Irinotecan is a topoisomerase I inhibitor with demonstrated efficacy in the treatment of metastatic colon cancer in both the firstline and second-line settings. Based on the prior evidence of efficacy, the Cancer and Leukemia Group B (CALGB) activated a randomized phase III trial of irinotecan, fluorouracil, plus leucovorin (IFL) versus fluorouracil plus leucovorin (FL), among patients with resectable stage III disease (C89803). In the IFL regimen, fluorouracil is given by bolus administration (15). A North Central Cancer Treatment Group (NCCTG) trial that was running simultaneously randomized patients with metastatic colorectal cancer to treatment with combinations of fluorouracil, irinotecan, and oxaliplation and included an irinotecan, fluorouracil, plus leucovorin treatment arm (N9741) (16). During the conduct of these trials, a higher than expected early (within 60 days of starting therapy) death rate was observed in the irinotecan-containing treatment arms relative to the fluorouracil plus leucovorin control on C89803, and the oxaliplatin, fluorouracil, plus leucovorin, and oxaliplatin plus irinotecan treatment arms on N9471, resulting in temporary suspension of accrual on both trials. The 60-day all-cause death rates were 4.8% versus 1.8% on N9741 (Fisher exact p = 0.06) and 2.2% versus 0.8% on C89803 (Fisher exact p = 0.06). The lethal toxicity was reported in a letter to the editor of the New England Journal of Medicine in July 2001 (17). An independent group of investigators that was subsequently assembled to review the adverse events (AEs) identified two syndromes associated with the bolus administration of irinotecan (18). (This toxicity is in contrast to the toxicity profile observed for the administration of irinotecan in combination with intravenously-administered fluorouracil plus leucovorin, the FOLFIRI regimen, though the two regimens have not been directly compared in a single trial (19).)
181
Treatment modifications were made as necessary to both trials. The clinical report for C89803 describes the toxicity experience (15). Though mentioned in the clinical report for N9741, the toxicity experience was more fully reported in a separate article (20). Disclosure of the complete trial conduct is essential in reporting to allow the proper interpretation of results. For nontreatment trials, details regarding the intervention(s) should be presented. Study Conduct Study conduct is described according to the study design given in the protocol. Studies can be classified as pilot or exploratory, safety trials, efficacy trials, or comparative trials. Other design features should be given such as whether a crossover was employed, whether blinding was used, if subjects were matched on prognostic factors or whether the design was single or multistage. Details regarding the standard study designs used in clinical trials in cancer (e.g., phase I, phase II, phase III) are presented in earlier chapters. The individuals or organizations responsible for trial design and analysis, data capture, and monitoring should be identified. Ethical aspects, including Institutional Review Board (IRB) approval for the protocol, patient informed consent, and monitoring by a formal data monitoring committee (if applicable), should be described. Statistical Methods The trial design and statistical considerations of the trial are provided in this section. Statistical methods should include the details of the trial design from the protocol: definitions of the primary and secondary end points, identification of patient subsets to be considered, stratification factors, estimation goals (precision), trial hypotheses, operating characteristics for tests of hypotheses (significance level, power), success criteria, targeted sample size, planned length of accrual and follow-up periods, planned interim analyses, and design modifications (if applicable). Detailed lists of recommended items for inclusion in reports of clinical trials results are provided by the Asilomar working group and the revised CONSORT statement (21, 22). The goal of the trial should be stated—for example, specifying whether the trial is hypothesis-testing, hypothesis-generating, or a screening trial. Data analysis methods are also presented. These should include all statistical tests used, even those that are not specified in the protocol. Examples are the chi-square, Wilcoxon, or t-test to compare patient characteristics between treatment arms or subgroups based on sociodemographic and disease characteristics.
182
ONCOLOGY CLINICAL TRIALS
Define statistical estimates, in particular, time-toevent end points, where a lack of consistency exists in definition. Time-to-event end points in clinical trials include overall survival (OS), progression-free survival (PFS), time-to-progression (TTP), time-to-treatment failure (TTF), disease-free survival (DFS) and others. The starting time from which the time-to-event end point is measured (e.g., study entry or time of randomization) and the outcomes included as events (e.g., death or progression) should be provided. Patients who do not experience an event are considered as “censored,” that is, observed only to a known time point prior to an event. For example, PFS is often measured from time of study entry until documented progression of disease or death from any cause but is sometimes defined as the time from study entry until documented progression or death with documented progression. Under the first definition, censored patients consist of the group alive without progression at the last observed follow-up time. Under the second definition, censored patients consist of the group alive without progression or dead without documented progression at the last observed followup time. To reduce confusion, one naming convention might be to include the term survival in the name when all-cause mortality (e.g., DFS) is an event, and to omit the term survival in the name when death is either not considered as an event or is considered as an event only if certain conditions exist, such as death with documented progression (e.g., TTP). The inclusion of death from any cause as an event is more conservative and is preferred in cases where it is difficult to determine if progression has occurred. In general, clearly defined estimates given in an article are potentially useful to other investigators for future study development in the same disease area. Follow-up time can also be estimated in several ways and is important in the interpretation of long-term results. Estimates based on data at time points well beyond the median follow-up may be imprecise. Schemper et al. describe and compare several methods of quantifying follow-up. They conclude that the Kaplan-Meier estimate of potential follow-up, obtained by reversing the role of the censoring variable in a Kaplan-Meier analysis, is the preferred method (23). Other authors recommend use of median follow-up among subjects who are alive at the time of analysis (24). Whatever method is used to estimate follow-up, it should be described in the statistical methods and the estimate reported under Results. The median follow-up time is also often used as a measure of data completeness though its interpretation is limited in this context. A more useful method for quantifying data completeness, proposed by Clark et al., is discussed under Results below (25).
Often analyses are conducted and reported in more than one patient subgroup. The group or subgroup of patients in which the primary end point is studied should be clearly defined. For example, specify if an analysis is intent-to-treat (ITT). In an ITT analysis all patients are included in the statistical analysis according to randomization assignment regardless of the treatment received (26). Primary and planned analyses should be distinguished from secondary or unplanned analyses. Results The Results section relates the actual experience of the clinical trial and the results of the statistical analyses. Results associated with the primary hypothesis are the main focus; results associated with secondary end points or unplanned analyses must be clarified as such. The appropriate test statistics must be used. Only facts relating to the conduct of the trial and the analytic results should be presented in this section. Qualifying and descriptive statements can be made in the Discussion. Describe whether the trial successfully completed its research goals. For example, a trial may conclude early due to lack of efficacy according to protocol or unexpectedly due to unanticipated toxicity or relevant information obtained from external trials. If difficulties were encountered in the conduct of the trial these events and their impact on the trial should be provided in an administrative summary. Patient Characteristics State the actual number of patients enrolled on each stratum or randomized to each treatment arm at the outset, specifying the time period between activation (or first patient enrolled) and end of accrual (last patient enrolled). Patient flow can be illustrated in a diagram of study subjects and is required by some journals (27). The patient flow diagram provides the numbers of enrolled patients, eligible patients, patients who withdrew or who received protocol therapy, and patients included in the primary statistical analyses. Other information such as the number of patients completing treatment by treatment arm may also be summarized in this chart. A flow diagram, as recommended by CONSORT, of study subjects from C89803 is provided in Figure 20.1. For trials with a time-to-event end point and longer median event time, such as OS in an adjuvant trial, a determination of data completeness is useful in interpreting results. More complete data are generally associated with greater reliability. Clark et al. proposed a simple measure of completeness of follow-up (C) based on the observed and potential person-time of
20 INTERPRETATION OF RESULTS: DATA ANALYSIS AND REPORTING OF RESULTS
Enrolled (n=1264)
Randomized
5FU/LV
CPT-11/5FU/LV
(n=629)
(n=635)
Ineligible (n=12)
Ineligible (n=6)
Metastatic disease (6)
Metastatic disease (1)
Rectal cancer (2)
Rectal cancer (4)
Positive margins (2)
Positive margins (1)
Labs outside limits (1) Too long post operative (1)
Not treated (n=4)
Not treated (n=3)
Completed treatment (n=497)
Completed treatment (n=450)
Median time on treatment: 204 days
Median time on treatment: 189 days
Ended treatment early due to:
Ended treatment early due to:
Progressive disease ( n=25)
Progressive disease (n=15)
Adverse event (n=28)
Adverse event (n=47)
Death (n=6)
Death (n=18)
Patient withdrawal (n=41)
Patient withdrawal (n=84)
Other disease (n=3)
Other disease (n=1)
Other treatment (n=6)
Other treatment (n=11)
No reason given (n=19)
No reason given (n=6)
Lost to follow-up ( n=14)
Lost to follow-up (n=16)
Median follow-up (alive): 4.74 years
Median follow-up (alive): 4.76 years
Completeness of follow-up: 83%
Completeness of follow-up: 83%
Analyzed (n=629)
Analyzed (n=635)
Dead with disease (n=145)
Dead with disease (n=153)
Dead without disease (n=26)
Dead without disease (n=28)
Alive with disease (n=56)
Alive with disease (n=69)
Alive without disease (n=402)
Alive without disease (n=385)
FIGURE 20.1 CONSORT flow diagram illustration of patient experience by treatment arm for patients randomized on CALGB 89803.
183
184
ONCOLOGY CLINICAL TRIALS
TABLE 20.1
Measures of Follow-Up and Completeness of Overall Survival for Patients Studied on CALGB 9581. Sample size Number of events Accrual period (dates of first and last patient enrolled) Study termination date
1,738 359 August 18, 1997–May 31, 2002 March 31, 2009
Median follow-up (range) in years+ Maximum follow-up Potential follow-up (T2) Observed follow-up (T1) Reverse Kaplan-Meier (95% CI) Median among patients still alive
8.7 (6.8–11.6) 8.3 (0.2–11.6) 6.9 (0.0–11.5) 7.5 (7.4, 7.6) 7.3
Complete individuals (survival current to within 13 months) Clark’s completeness measure (C) r (death rate: number dead/number patients/mean T1) Wu’s completeness measure (C*)
1,142 (65.7%) 79.9% 0.03 81.1%
+ Maximum follow-up is defined as the time from study entry to the end of the study whether or not an event occurred; potential follow-up is defined as the time from study entry to an event time or to study termination; observed follow-up is defined as the time from study entry to the date of last contact whether or not an event occurred.
Observerd follow-up time (years)
follow-up (25). To adjust for the effect of unobserved patient deaths when computing the potential persontime of follow-up, this concept was modified by Wu, et al. (C*) (28). These quantities were computed for the survival data from CALGB 9581, a phase III study of treatment with monoclonal antibody 17–1A versus observation among patients with stage II colon cancer (29). Table 20.1 provides the values of the measures,
C and C*, and other descriptive measures of followup. Enrollment on C9581 occurred between August 1997 and May 2002. Using the maximum potential follow-up from the date of first enrollment until March 2009, the unadjusted potential follow-up was approximately 11.5 years. A graphical illustration of completeness, observed survival time by time from study entry, is provided in Figure 20.2.
Alive Dead
10 8 6 4 2 0 0
1
2 3 4 Time from study entry (years)
5
FIGURE 20.2 Observed survival time by time of recruitment from the start of the study for 1,738 patients enrolled on CALGB 9581. Completeness is illustrated by the cluster about the line representing the maximum potential follow-up time. The Clark completeness measure for these data is 79.9%. See Table 20-1.
20 INTERPRETATION OF RESULTS: DATA ANALYSIS AND REPORTING OF RESULTS
Describe the patient population with respect to demographic attributes, disease and tumor characteristics, data captured at study entry, and treatment, if there is more than one treatment strata or arm. Descriptive statistics for these variables are best presented in table format. Adverse Events The toxicity experience should be reported by treatment arm or stratum. AEs should be reported by treatment actually received. Usually toxicity of grade 3 or greater is reported but lower grade toxicity may be reported as relevant. Efficacy Describe the number of interim analyses conducted and the impact of these analyses on the trial, if applicable. For example, a trial may be stopped at an interim analysis due to efficacy or futility, which may affect the interpretation of the results. The final analysis of a trial is impacted when interim analyses are conducted and, in particular, when a trial is stopped early due to crossing an interim boundary. The interim looks must be considered in the reporting of the final results (30, 31). Montori et al. provide a review of randomized trials stopped early for benefit and conclude that information about the decision to stop the trial early is not adequately presented by authors and that such results should be viewed with caution (32). Be explicit about the test of hypothesis associated with each reported p-value. One way to accomplish this is to specify the test statistic with the p-value in parentheses. Footnotes and text may also be used to convey this information. For time-to-event end points, estimated proportions of patients free of the event are often given at a landmark time such as 1, 3, or 5 years. Clarify whether estimates are binomial proportions or Kaplan-Meier estimates and also clarify whether the test of hypothesis compares the proportions of patients event-free at the specified number of years (binomial test of proportion) or are associated with a survival analysis such as the log-rank test or Cox regression. When estimating the binomial proportion of patients surviving to a landmark time point, only patients who have achieved the end point prior to the landmark time or have follow-up greater than the landmark time should be included in the analysis. In trial design, stratified randomization is often used to control the distribution of factors, between the treatment arms being compared, that are known prognostic variables. If patients are stratified at randomization, a stratified analysis of the end point should be
185
presented (33). That is, the stratification factors must be considered in the comparison. The p-value associated with an unstratified test should be less extreme. Multivariable analyses are often presented. For example, if a treatment difference is observed it might be of interest to show that the treatment difference is not explained by other prognostic factors that were not considered in the primary analysis as stratification factors. In conducting such analyses the size of the dataset and the relationships among the explanatory factors must be considered. When multiple hypothesis testing is conducted on the same data set the probability of finding a significant result is increased. If the overall type I error rate is not controlled, caution should be used in interpreting the significance of results. Example CALGB 9481 was a phase III randomized trial of hepatic arterial infusion (HAI) versus systemic therapy for patients with hepatic metastasis from colorectal cancer (34). The trial enrolled 135 patients between January 1996 and December 2000. Patients were randomized with equal probability to treatment with HAI or systemic therapy and were stratified at randomization by percent liver involvement, presence of synchronous disease, and prior chemotherapy. The trial was designed to detect a hazard ratio of 0.667 for OS in favor of HAI with 90% power (2-sided α = 0.05). The original targeted sample size was 340 patients to be enrolled over 4.5 years with a 1.5-year follow-up period and 262 deaths expected at the time of the final analysis. The results from this trial illustrate some of the issues that may be encountered in the analysis and reporting of data from clinical trials. Poor Accrual Patient accrual to this trial was slower than expected, 29 actual versus 75 projected patients per year. The slower rate of accrual was attributed to physician prejudice with respect to randomizing patients to treatment with HAI. In August 1999, the trial was amended to decrease the targeted sample size from 340 to 147 patients (262 to 112 expected deaths) and increase the follow-up time from 1.5 to 3 years. (The National Cancer Institute [NCI] now requires that the accrual rate be monitored for all NCI-funded trials and, if feasible, that the statistical considerations be amended to address slow accrual.) No formal interim analyses of the primary end point were conducted prior to the amendment. This amendment was reviewed and approved by the CALGB Data and Safety Monitoring Board (DSMB). However, due to
186
ONCOLOGY CLINICAL TRIALS
continued lagging accrual the trial was closed by the DSMB in December 2000 with 135 patients enrolled. Data monitoring by the CALGB, DSMB continued until November 2001 according to the prespecified monitoring plan for OS. The decision to close the trial due to slow accrual did not adversely impact study results and, due to size limitations, information regarding the slow accrual and the consequent amendment to the protocol were not presented in the paper. However, reporting the difficulty in patient enrollment to this trial may have been helpful to investigators planning a similar study. Stratified Analysis Stratified randomization can be useful in small trials when prognostic factors are known and in large trials with planned interim monitoring (35). However, misinterpreted and poorly chosen stratification factors can adversely impact the ability to capture these data and their usefulness in the final analysis. In C9481, 20.5% of patients classified at randomization as <30% liver involvement and 4.4% of patients classified as ≥30% liver involvement, respectively, were misclassified as determined by comparison with the actual percent liver involvement reported independently. Similarly, 21% of patients classified at randomization as “no synchronous disease” were misclassified. Finally, 97% of patients enrolled (n = 123) had received no prior chemotherapy and this stratification variable could, therefore, not be considered in a stratified analysis. Because of these discrepancies, in the paper, results of the unstratified log-rank test for OS were reported as the primary analysis (log-rank p = 0.0034).
Under the proportional hazards model, an unstratified test, comparing OS by treatment arm for these data, results in a p-value of 0.0038. Using the actual values for percent liver involvement and synchronous disease in place of the stratification factors, there was a trend toward improved OS with limited percent liver involvement defined as percent liver involvement <30% or > = 30% and <70% (Figure 20.3; p = 0.095). The presence of synchronous disease was not associated with OS. The p-value associated with treatment arm decreases when both variables, synchronous disease and percent liver involvement, are considered as covariates (p = 0.0021) primarily due to the association between percent liver involvement and OS. This outcome also illustrates how misclassification can bias results toward the null since the adjusted p-value for treatment arm based on the two stratification factors as reported is p = 0.006 (36). Multivariable Analyses Multivariable analyses are often conducted to determine if the significance of treatment arm can be explained by other factors potentially associated with outcome. In C9481, the following factors were considered in multivariable analyses: age at study entry, performance status (0; 1, 2), location of primary (colon; rectum), number of liver lesions (<5; ≥5), weight loss within 3 months of trial entry (no; yes), nonprotocol therapy postprogression (no; yes), LDH (continuous measurement), baseline CEA (continuous measurement), WBC (<10,000; ≥10,000 K/μL), alkaline phosphatase (< 300; ≥300 U/L), and albumin (<4; ≥4 g/dL). Two issues must be considered
1.0
Survival probability
0.8
< 30% >= 30% ; < = 70%
0.6 0.4 p = 0.095
0.2 0.0 0
1
2 3 4 Years from study entry
5
6
FIGURE 20.3 Kaplan-Meier plot of OS by actual percent liver involvement (<30%; ≥30% and ≤70%) for 135 patients randomized on CALGB 9481.
20 INTERPRETATION OF RESULTS: DATA ANALYSIS AND REPORTING OF RESULTS
in the interpretation of these results; first, the sample size (overfitting), and second, potential collinearity among the covariates. To avoid overfitting, Harrell et al. recommend that the number of degrees of freedom in the model p should not exceed m/10, where m denotes the number of uncensored observations (37). Including the stratification factors, this rule of thumb is not met in C9481 (p = 13 > m/10 = 125/10 = 12.5). The authors further recommend that if p exceeds m/10, data reduction methods should be employed. However, added complexity exists in this dataset as strong associations were observed among many of the covariates under study. These associations, particularly those between percent liver involvement and the laboratory values, can be anticipated based on the science. Thus, only the two-factor models including treatment arm and each of the covariates were considered. In the interpretation of results, the goal and methods of conducting the multivariable analysis must be considered. Most reported results will require validation. Discussion and Study References The discussion should put the study results in context, comparing and contrasting with data available from other trials. Qualifying statements can be made with respect to study results indicating the strengths and weaknesses of the research. Suggestions for extending the research can also be made. References should be provided for all nonoriginal work.
References 1. Lewis JA. Statistical principles for clinical trials (ICH E9) an introductory note on an international guideline. Stat Med. 1999;18:1903–1904. 2. ICH Topic E3. Structure and content of clinical study reports. MCA EuroDirect Publication No. 137/95. EuroDirect Publications Office, London; 1995. 3. ICH Expert Working Group. Statistical principles for clinical trials Stat Med. 1999;18:1905–1942. 4. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet. 2001;357(9263): 1191–1194. 5. Altman DG, Schulz KF, Moher D, et al. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med. 2001;134(8):663–694. 6. American Joint Committee on Cancer. AJCC Staging Manual. 6th ed. New York: Springer Science+Business Media; 2002. 7. O’Brien S, Keating MJ. Chronic lymphoid leukemias. In: DeVita VT, Hellman S, Rosenberg SA, eds. Cancer: Principles and Practice of Oncology. 7th ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2005:2121–2143. 8. Halabi S, Small EJ, Kantoff PW, et al. Prognostic model for predicting survival in men with hormone-refractory metastatic prostate cancer. J Clin Oncol. 2003;21:1232–1237. 9. Leonard JP. The “FLIPI” is no “FLOPI.” Blood. 2004; 104: 1233–1234.
187
10. Schag CC, Heinrich RL, Ganz PA. Karnofsky performance status revisited: reliability, validity, and guidelines. J Clin Oncol. 1984;2:187–193. 11. Oken MM, Creech RH, Tormey DC, et al. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am J Clin Oncol. 1982;5:649–655. 12. Rossouw JE, Anderson GL, Prentice RL, et al, Writing Group for the Women’s Health Initiative investigators. Risks and benefits of estrogen plus progestin for healthy postmenopausal women. JAMA. 2002;288:321–333. 13. Rossouw JE, Prentice RL, Manson JE, et al. Postmenopausal hormone therapy and risk of cardiovascular disease by age and years since menopause. JAMA. 2007;297(13):1465–1477. 14. Grodstein F, Manson JE, Stampfer MJ, Rexrode K. Postmenopausal hormone therapy and stroke. Role of time since menopause and age at initiation of hormone therapy. Arch Intern Med. 2008;168:861–866. 15. Saltz LB, Niedzwiecki D, Hollis D, et al. Irinotecan fluorouracil plus leucovorin is not superior to fluorouracil plus leucovorin alone as adjuvant treatment for stage III colon cancer: results of CALGB 89803. J Clin Oncol. 2007;25:3456–3461. 16. Goldberg RM, Sargent DJ, Moron RF, et al. A randomized controlled trial of fluorouracil plus leucovorin, irinotecan, and oxaliplatin combinations in patients with previously untreated metastatic colorectal cancer. J Clin Oncol. 2004;22:23–30. 17. Sargent D, Niedzwiecki D, O’Connell MJ, Schilsky R. Recommendation for caution with irinotecan fluorouracil and leucovorin for colorectal cancer. N Engl J Med. 2001;345(2):144–145. 18. Rothenberg ML, Meropol NJ, Poplin EA, Van Cutsem E, Wadler S. Mortality associated with irinotecan plus bolus fluorouracil/leucovorin: summary findings of an independent panel. J Clin Oncol. 2001;19:3801–3807. 19. de Gramont A, Bosser JF, Milan C, et al. Randomized trial comparing monthly low-dose leucovorin and fluorouracil bolus with bimonthly high-dose leucovorin and fluorouracil bolus plus continuous infusion for advanced colorectal cancer. J Clin Oncol. 1997;15:808–815. 20. Goldberg RM, Sargent DJ, Morton RF, et al. Randomized controlled trial of reduced-dose bolus fluorouracil plus leucovorin and irinotecan or infused fluorouracil plus leucovorin and oxaliplatin in patients with previously untreated metastatic colorectal cancer: a North American intergroup trial. J Clin Oncol. 2006;24(21):3347–3353. 21. The Asilomar Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature. Checklist of information for inclusion in reports of clinical trials. Ann Intern Med. 1996;124(8):741–743. 22. Moher D, Schulz KF, Altman D for the CONSORT Group. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA. 2001;285(15):1987–1991. 23. Schemper M, Smith T. A note on quantifying follow-up in studies of failure time. Cont Clin Trials. 1996;17:343–346. 24. Green S, Benedetti J, Crowley J. Clinical trials in oncology. Boca Raton, FL: CRC Press; 2002. 25. Clark TG, Altman DG, De Stavola BL. Quantification of the completeness of follow-up. Lancet. 2002;359:1309–1310. 26. Lachin JM. Statistical considerations in the intent-to-treat principle. Cont Clin Trials. 2000;21(3):167–189. 27. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Reports. 1985;69(1):1–3. 28. Wu Y, Takkenberg JM, Grunkemeier GL. Measuring follow-up completeness. Ann Thorac Surg. 2008;85:1155–1157. 29. Colacchio TA, Niedzwiecki D, Compton C, et al. Phase III trial of adjuvant immunotherapy with MOAb 17–1A following resection for stage II adenocarcinoma of the colon (CALGB 9581). J Clin Oncol, 2004 ASCO Annual Meeting Proceedings (post meeting ed.). 2004;22(14S):3522. 30. Todd S, Whitehead A, Stallard N, Whitehead J. Interim analyses and sequential designs in phase III studies. Br J Clin Pharmacol. 2001;51:394–399.
188
ONCOLOGY CLINICAL TRIALS
31. Whitehead J. The design and analysis of sequential clinical trials. New York: Halsted Press: John Wiley & Sons; 1983:121–123. 32. Montori VM, Devereaux PJ, Adhikari NKJ, et al. Randomized trials stopped early for benefit: a systematic review. JAMA. 2005;294:2203–2209. 33. Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br J Cancer. 1976;34:585–612. 34. Kemeny NE, Niedzwiecki D, Hollis D, et al. Hepatic arterial infusion versus systemic therapy for hepatic metastases from
colorectal cancer: a randomized trial of efficacy, quality of life, and molecular markers (CALGB 9481). J Clin Oncol. 2006;24:1395–1403. 35. Kernan WN, Viscoli CM, Makuch RW, Brass LM, Horwitz RI. Stratified randomization for clinical trials. J Clin Epidemiol. 1999;52:19–26. 36. Greenland S. The effect of misclassification in the presence of covariates. Am J Epidemiol. 1980;112:564–569. 37. Harrell FE, Lee KL, Mark DB. Tutorial in biostatistics, multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–387.
21
Statistical Considerations for Assessing Prognostic Factors in Oncology
Susan Halabi
Prognosis plays a fundamental role in medical decision making and patient management. The assessment of prognostic factors, which relate baseline clinical and experimental variables to outcomes, independent of treatment, is one of the major objectives in clinical research. In contrast, predictive factors describe the interaction between a factor and the treatment in predicting outcome, and it is implicit that the treatment is effective. In oncology the variability in outcome may be related to prognostic factors rather than to differences in treatments. Historically, the impetus for the identification of prognostic factors has been the need to accurately estimate the effect of treatment adjusting for these variables. This chapter begins with a brief review of prognostic factors, and subsequently offers a general discussion of the importance of prognostic factors. Within this context, it summarizes the types of prognostic factors and describes the significance of study design. Next it presents various modeling methods for identifying these factors. Finally, the relative values of different validation approaches are presented.
IMPORTANCE OF PROGNOSTIC FACTORS STUDIES There are several reasons why prognostic factors are important (1–2). First, by determining which variables are prognostic of outcomes, insights are gained on the
biology and the natural history of cancer. Second, patients and their families are informed about the risk of recurrence or death. Third, appropriate treatment strategies may be optimized based on the prognostic factors of an individual patient (1, 2). Finally prognostic factors play an important role in the design, conduct, and analysis of future clinical trials. Regression models that relate baseline clinical and experimental factors (such as treatment) to endpoints, such as overall survival (OS), are called prognostic models. Although individual prognostic factors are useful in predicting outcomes, investigators may be interested in constructing classification schemes. Combining multiple prognostic variables to form a prognostic index or score is a powerful strategy that will allow for the identification of groups of patients with differing risks of progression or death. For this reason, prognostic models have been widely used and will continue to be utilized in the design, conduct, and analysis of controlled trials in cancer. Risk models are employed for screening patients for eligibility on recent trials (3). In CALGB 90203, a neoadjuvant phase III trial where 750 men who are at high risk are randomized to either prostatectomy or docetaxel plus hormones followed by prostatectomy. High risk men are considered eligible to participate on this trial if their predicted probability of being disease free 5 years after surgery <60% on the basis of the Kattan nomogram (3).
189
190
ONCOLOGY CLINICAL TRIALS
In addition, researchers may employ prognostic models that are used in stratified randomized trials. As patient outcomes often depend on prognostic factors, randomization helps balance such factors by treatment assignments. Some imbalances may nevertheless occur by chance. One strategy to limit this effect is to use blocked randomization within predefined combinations of the prognostic factors (strata). In a recently designed trial, randomization was prospectively stratified by the predicted survival probability based on a prognostic model. In CALGB 90401, a phase III trial which assigned 1,050 men with castrate-resistant prostate cancer (CRPC) to receive either docetaxel or docetaxel plus bevacizumab, randomization was stratified by the predicted survival probability at 24 months: <10%, 10–29.9%, or ≥30% (4). Adjusting on prognostic factors in order to avoid bias in estimating the treatment effect is important even if the baseline factors are balanced between treatment groups. The need, though, to adjust on prognostic factors is more critical when the randomization is unbalanced. In a randomized phase III trial, 127 asymptomatic men with CRPC were randomized in a 2:1 ratio to either sipuleucel-T or placebo. The primary and secondary endpoints were time to progression and OS, respectively. In the multivariable analysis of OS, Small et al. identified five clinical variables (lactate dehydrogenase [LDH], prostate-specific antigen [PSA], number of bone metastases, body weight, and localization of disease) that were highly prognostic of OS in this study cohort. To correct for any potential imbalances in prognostic factors, the treatment effect of sipuleucel-T was adjusted using the above variables.
The observed hazard ratio (HR) was estimated to be 2.12 (95% CI = 1.31–3.44) (5). The above examples demonstrate the importance of prognostic factors and their applicability of such models in the design, conduct, and analysis of clinical trials in oncology.
TYPES OF PROGNOSTIC FACTORS According to Gospodarowicz et al., prognostic factors are classified as tumor-related, host, or environmental factors (6). Tumor related factors are variables that are related to the presence of the tumor and reflect tumor pathology, anatomic disease extent or tumor biology. Examples of tumor-related factors are histologic grade, TNM stage, tumor markers, and molecular markers (over expression of p53). Hostrelated factors do not relate to the malignancy of the tumor but may influence outcomes. These factors may be demographic, comorbidity, performance status, and compliance. Environmental related factors are external factors that may be related to an individual patient such as, physician expertise, access to healthcare, education, and availability of cancer control programs. All these factors are important as they will have an impact on clinical outcomes. Figure 21.1 presents the relationship between host, tumor, environmental and clinical outcomes. The reader is referred to Gospodarowicz et al. for a more detail discussions on the different types of prognostic factors. The focus in this chapter is on
Host factors Race, Age, Performance status, Body mass index, Comorbidity, etc. Tumor factors
Clinical outcomes
Environmental factors
Overall survival Progression-free survival Objective Response Rate Toxicity
Education Health insurance, etc.
PSA, LDH, Gleason score, Alkaline Phosphatase, etc
FIGURE 21.1 Diagram showing the relationship between host, tumor, environmental prognostic factors and clinical outcomes.
21 STATISTICAL CONSIDERATIONS FOR ASSESSING PROGNOSTIC FACTORS IN ONCOLOGY
factors that are relevant at the time of diagnosis or initial treatment.
STUDY DESIGN The literature is rich in articles related to prognostic factors, but despite their abundance, results may conflict on the importance of certain markers in predicting outcomes. Accurate and reliable information based on the accessible literature with regard to prognostic factors needs to be consistent so that the two critical questions on “who to treat” and “how to treat each individual” can be addressed (7). General principles and methods related to the assessment and evaluation of prognostic factors are not as well developed as the clinical trial methodology. Recently, guidelines (REMARK) have been developed to improve on quality and reporting of prognostic factors in cancer (8). Most prognostic factors studies are based on retrospective data analysis that have a small sample size or sparse data and as a result have poor data quality (1). As with any scientific study, investigators planning a prognostic factor analysis should start with a primary hypothesis, the end point should be specified a priori and sample size or power computations should be justified. As suggested by Altman, the sample size should be large in order to account for the multitude of potential biases that may arise in conducting such trials (1). These issues are related to multiple comparisons of variables, selection of variables, comparisons of different models, and missing data of the variables or outcomes (1).
Several papers in the literature have considered the sample size required for prognostic studies (9–11). To examine the role of a prognostic factor, sample size needs to be justified. For example, investigators in a CALGB study were interested in determining whether the presence of autoimmunity at the beginning of therapy is associated with improved OS in metastatic renal cell carcinoma (MRCC) patients treated with interferon alpha with or without bevacizumab controlling for treatment arm (12). To design this study, the prevalence of positive autoimmunity needs to be known. In this study, it was considered that a subject is positive for autoimmunity if any of the antibody assays meet the definition of positive either at baseline or during their time on interferon alpha treatment. Five hundred eighty-nine patients of the 732 enrolled on CALGB 90206 had plasma samples collected at baseline. Using the sample size formula from Schmoor et al. (10), the minimum HR that can be detected was computed and is presented in Table 21.1. The following assumptions were made: OS follows exponential distribution, 80% power, a two-sided type I error rate of 0.05, positive autoimmunity decreases hazard of an event, prevalence of positive autoimmunity of 0.05–0.20, correlation of 0–0.6, and events rates of 75–90%. As for model building, a general ad-hoc rule of thumb is to use 10 subjects per variable for a binary end point (such as objective response) and 15 events (deaths) per variable for time-to-event end points (10). For predictive factors studies, sample size computation should be formally justified based on a test of interaction between the prognostic factor and the treatment (11).
TABLE 21.1
Minimum Detectable HR under a Range of Assumptions for the Prevalence of Positive Autoimmunity and Correlation with Two-Sided Type I Error Rate and 80% Power. PROPORTION EVENTS OBSERVED AMONG 589 PATIENTS WITH PLASMA AVAILABLE PREVALENCE OF POSITIVE AUTOIMMUNITY
CORRELATION WITH OTHER FACTORS
75% EVENTS
80% EVENTS
90% EVENTS
5%
0 0.3 0.6 0 0.3 0.6 0 0.3 0.6
0.542 0.526 0.465 0.641 0.627 0.574 0.716 0.705 0.659
0.553 0.537 0.477 0.650 0.637 0.584 0.724 0.713 0.668
0.572 0.557 0.497 0.667 0.653 0.602 0.737 0.727 0.683
10%
20%
191
192
ONCOLOGY CLINICAL TRIALS
Usually, the sample size required for such studies is very large. For more on this topic, see Simon and Altman who provide a rigorous and thoughtful review of statistical aspects of prognostic factors studies in oncology (13). IDENTIFICATION OF PROGNOSTIC FACTORS Multiple strategies exist for the identification of prognostic factors. The modeling approaches described in the next sections are: logistic regression for binary endpoints (14), proportional hazards regression for time-to-event endpoints (15) (such as OS), and classification trees for both binary and time to event endpoints (16–17). Logistic Regression Model The logistic regression is a statistical model that can be applied when the endpoint is dichotomous in nature. In addition, it can be used to describe the relationship of the independent prognostic factors (denoted as Xi) to the binary end point (denoted as D). Examples of binary end points are objective response, recurrencesurvival free at 1 year or OS at 5 years. The logistic model is popular because the logistic function relates the expected odds of an event to the explanatory variables and provides estimates of the event probabilities (14). The logistic function can be written as the expected probability of the event of interest given X1, X2, . . ., Xk: P[D = 1| X1, X2, . . ., Xk] = 1/[1 + e ] -z
where z = b0 + b1X1 + b2X2 + . . . + bk Xk and is known as the prognostic index or score. b0 is the intercept and bi ’s are the unknown regression coefficients where i = 1, 2, . . . k. The logit transformation is essential to the use of the logistic model and can be described as: log[P(D = 1| X1, X2, . . ., Xk)/1-P(D = 1| X1, X2, . . ., Xk)] = b0 + b1X1 + b2X2 + . . . + bk Xk where log is the natural logarithm (i.e., to the base e). The estimated regression coefficients can be used to estimate the predicted probability by inverting the regression model as demonstrated in the following example. Example: Robain et al. used a logistic regression model to predict objective response in 1,426 women with metastatic breast cancer. Objective response was
defined as either the presence of a partial or complete response (18). Fifteen baseline covariates were examined as potential predictors of objective response. These were: age, performance status (Karnofsky index), number of sites (1, ≥ 2), and location of metastases (bone, lung, pleura, liver, peritoneum, skin, lymph nodes), serum lactic dehydrogenase (LDH), weight loss before treatment, menopausal status, disease-free interval from primary tumor diagnosis to metastases, year of inclusion in a metastatic trial, serum alkaline phosphatase, γ glutamyl transferase (γ GT), asparate aminotransferase (AST), serum albumin levels, and absolute lymphocyte count. Forward stepwise regression was utilized to select the prognostic factors and the maximized log-likelihood was used for comparison of models based on selection of prognostic factors at each step. Prior chemotherapy (yes vs. no), low Karnofsky index (<60 vs. ≥60), high LDH (>1 × N vs. ≤1 × N), presence of lung metastases (yes vs. no), and pleural metastases (yes vs. no) were combined to form a predictive score. The estimated score was: −1.32 + 0.54 (no prior adjuvant chemotherapy) + 0.80 (low KI) + 0.75 (elevated LDH) + 0.49 (lung metastases)+0.51 (pleural metastases). The estimated regression coefficients can be used to compute the predicted probability by inverting the regression model. For example, the predicted probability of not achieving objective response for a woman without prior adjuvant chemotherapy (coded as 0), low Karnofsky index (coded as 1), high LDH (coded as 1), presence of lung (coded as 1), and pleural metastases (coded as 1) is calculated to be: [exp(−1.32 + 0.54 × (0) + 0.80 × (1) + 0.75 × (1) + 0.49 × (1) + 0.51 × (1)]/ [1 + exp(−1.32 + 0.54 × (0) + 0.80 × (1) + 0.75 × (1) + 0.49 × (1) + 0.51 × (1))] = 0.774.
Proportional Hazards Model Often, an investigator seeks to assess the prognostic importance of several independent variables on the time-to-event endpoint. In phase III trials, time-to-event endpoints refer to outcomes where time is measured from randomization until occurrence of an event of interest. The time variable is referred to as the failure time and is measured in years, months, weeks, or days. The event may be death, death due to a specific cause, disease progression, or the development of metastases. In general, OS is the most common time-to-event end point in phase III trials in oncology. Most time-to-event end points must consider a basic analytical element
21 STATISTICAL CONSIDERATIONS FOR ASSESSING PROGNOSTIC FACTORS IN ONCOLOGY
known as censoring. Censoring arises when information about individual failure time is unknown and it occurs when patients either do not experience the event before the study ends or patients are lost during the follow-up period. Two quantitative terms are fundamental in any survival analysis. These are survivor function, denoted by S(t), and hazard function, denoted by l(t) (19). The survivor function is the probability that a person survives longer than some specified time t. The hazard function l(t), is the instantaneous potential per unit time for an event to occur, given that the individual survived until time t. There is a clearly defined relationship between these two functions, but it is simpler to mathematically model the hazard function than the survival function when an investigator is interested in assessing prognostic factors of time-toevent endpoints. Perhaps one of the most common approaches in the medical literature is the use of the regression proportional hazards (PH) model (15). The PH model is used to analyze such time-to-event data and it is a powerful method because it can incorporate both baseline and time-varying factors. A PH model with a hazard function is given by: l(t|X1, X2, . . ., Xk) = l0(t) exp(b1X1 + b2X2 + . . . + bk Xk ) where X1, X2, . . ., Xk represent the baseline covariates, k × 1 vector b are the regression or the log-HR parameters. λ0(t) is the baseline hazard which is a function of time and is equal to the overall hazard function when all the values of covariates are zero. The PH model is semi-parametric as it does not specify the form of λ0(t). The covariates are assumed to be linearly related to the log-hazard function. The parameter b can be estimated by maximizing the partial likelihood function as described by Cox (15). By estimating b it will allow one to quantify the relative rate of failure for an individual with one set of covariates compared to an individual with another set of covariates. From the PH model, the estimated HR for death and 95% confidence intervals are usually summarized. The PH model specifies multiplicative relationship between the underlying hazard function and the log-linear function of the covariates. This assumption is also known as the proportional hazards assumption. In other words, if we consider two subjects with different values for the covariates, the ratio of the hazard functions for those two subjects is independent of time. The proportional hazards model is a powerful tool as it can be extended to include time-dependent covariates and stratification. Time-varying covariates are factors whose values change over time.
193
The proportional hazards model can be extended to allow for different strata and is written as: ls (t) = l0s(t) exp(b1X1 + b2X2 + . . . + bk Xk) where l0s(t) is the baseline hazard function in the sth stratum, s = 1, 2, ..., l. The stratified proportional hazards model assumes that patients within each stratum satisfy the proportional hazards assumption but patients in different strata have different baseline hazard function and thus are allowed to have nonproportional hazards. It is important to note that b do not depend on the strata. Both graphical and test-based methods are used for assessing the proportional hazards assumption (20–21). Some of the graphical approaches include plotting the log [-log S(t)] versus time or using the stratified Cox model. There are several formal tests for assessing the proportionality hazards assumption. These tests are either based against a specified alternative or a general alternative hypothesis. An ominbus test against a general alternative test developed by Schoenfeld is a common and effective approach for testing the proportionality hazards assumption (21). Example: Halabi et al., identified seven prognostic factors of OS in men with castration-resistant prostate cancer using clinical data from 1,100 patients enrolled on CALGB studies (4). The goal was to have at least 20 events per variable. The data were split in two: two-thirds (n = 760) for the learning set and onethird (n = 341) of the data to be used for the validation set. The prognostic variables were identified a priori based on the review of the literature. Schoenfeld residuals were used to check the proportional hazards assumption. In addition, PSA, LDH, and alkaline phosphatase were modeled using the log transformation as these variables were not normally distributed. The final model included the following factors: LDH, PSA, alkaline phosphatase, Gleason sum, ECOG performance status, hemoglobin, and the presence of visceral disease. The observed hazard ratio associated with Gleason sum can be computed as exp(0.335),which is 1.40 (see Table 3 from reference 4). This means that men with high Gleason sum (8–10) had a 1.4-fold increased risk of death compared to men with Gleason sum <8 after adjustment for other covariates in the model. The fitted model was then used to plot the nomogram, which is a visual representation of the prognostic model (Fig. 21.2). Classification Trees A classification tree is another approach to classifying patients with respect to the outcome based on their covariates with the objective of identifying distinct risk
194
ONCOLOGY CLINICAL TRIALS
0
10
20
30
40
50
60
70
80
90
1000
2000
100
Points Yes Visceral disease Gleason score Performance status Baseline PSA LDH Alkaline phosphatase Hemoglobin Total points
No
8–10
2–7
1
0 1
2 3 7 20 70 300
6 8 10 17 0
12-Month survival probability 24-Month survival probability Median survival time (months)
5000
20
20 15 20
100
40 70 13 40
150 11
60
200
500
400
4000
2500
9 8 7 80 100 120 140 160 180 200 220 240 260 280
0.9
0.8
0.7 0.6 0.5 0.4
0.7 0.6 0.5 0.4 72 48 36 30 24
0.2 0.1
0.2 0.1
18
12
0.01
0.01 8
6
FIGURE 21.2 Pretreatment nomogram predicting probability of survival in men with CRPC. Printed with permission from JCO.
groups. Its value lies in the fact that this approach is a powerful tool for model development and validation as it can be nonparametric and uses a linear combination of factors to determine splits (16–17). Furthermore, this method will estimate a regression relationship by binary recursive partitioning where the data is split into homogeneous subsets until it is not feasible to continue based on a pre-specified set of criteria. Other advantages of recursive partitioning are: controls for the global type I error rate and can also be applied to both binary and time-to-event end points (22). In addition, it can handle missing covariate values, and is effective at modeling complex interactions. One of the major disadvantages of tree models include: complicated structures may emerge making the interpretation of the data difficult, not utilizing continuous variables effectively, the tree structures can be unstable and that of over-fitting (23). Example: Banjeeree et al. used data from 1,055 women with stage I-III breast cancer to identify recurrence-free survival (24). The primary endpoint was recurrence-free survival, which is defined as the time between diagnosis and documented recurrence (local, regional, or distant recurrence), excluding new primary breast cancer. The investigators considered 15 baseline variables which were: age, race, socioeconomic status, marital status, obesity, tumor size, number of positive lymph nodes, progesterone receptor (PR) status, estrogen receptor (ER) status, tumor differentiation, hypertension, heart disease, diabetes, cholesterol level, and
stroke. Using recursive partitioning, four distinct risk groups were identified based on the prognostic factors: race, marital status, tumor size, number of positive nodes, PR status, and tumor differentiation. Common Problems with Modeling In the next section, common pitfalls in building models are discussed so that they can be avoided. Categorizing a continuous prognostic factor as a binary variable based on the sample median is a common practice in the medical literature (1, 2). In the proportional hazards model, it is assumed that continuous variables have a log-linear relation with the hazard function. While many researchers cannot make this assumption, dichotomizing a continuous variable may result in substantial loss of information. Altman suggested another alternative that is to use several categories for quantifying the relationship of continuous variable to the hazard of death (25). Other approaches such as cubic splines or fractional polynomial have been applied in assessing the relationship between continuous variables and hazard function (26). Identification of a prognostic factor based on the optimal cut point is often applied in prognostic studies, but this approach is based on identifying the cut point that yields the minimal p-value (27). The approach is problematic as it does not correct for the multiplicity of comparisons and is rightly criticized due to the
21 STATISTICAL CONSIDERATIONS FOR ASSESSING PROGNOSTIC FACTORS IN ONCOLOGY
subjectivity and arbitrariness of the cut point. There are new algorithms that adjust for multiple comparisons, but even if this approach is employed then it should be considered only as exploratory. A confirmatory study of this prognostic factor should be undertaken and the sample size needs to be large in order to increase the precision of the estimate. As an example, exploratory statistical methods were used to find different cut points for the association between VEGF levels and OS in men with CRPC. Several cut points above and below the median showed an association between high VEGF levels and decreased duration of survival. At a cut point of 260 pg/ml, differences in median survival were 17 months (95% CI = 14–18) versus 11 months (95% CI = 6-13, p-value < 0.0005) for patients below and above the cut point, respectively (28). The multivariable HR associated with a VEGF level ≥260 pg/ml was 2.42, demonstrating the strongest association between VEGF levels and survival time (28). This analysis is deemed as exploratory and this cut point is being validated prospectively in an ongoing CALGB phase III trial. Variable selection is a critical step of model building. Some investigators have used the stepwise method for selecting prognostic factors. This type of variable selection approach may produce overoptimistic regression estimates that will yield a low predictive ability (1, 10). As with any regression method, one needs to understand and verify the assumptions of the model. If these assumptions are not held, then interpreting the results of the fitted model may be difficult. Assessing the proportional hazards assumption in the proportional hazards model is often overlooked. Example: In Smaletz et al., a proportional regression model was initially used to fit the baseline covariates (29). Several variables, however, violated the proportional hazards assumption of a constant hazard over time. Consequently, an accelerated failure time model was used. The reader is referred to an excellent review of strategies involved in model building that is provided by Harrell (2).
MODEL VALIDATION The primary goal of a prognostic model is to minimize uncertainty in predicting outcome in new patients (2). Validating a prognostic model is a critical step in developing a prognostic model. Assessing predictive accuracy is the next important step for model validation (1–2). As described by Harrell (2), calibration or reliability refers to the extent of the bias or match
195
between forecast and outcome. Discrimination, on the other hand, measures a model or predictor’s ability to classify or separate patients with different responses (2). Overfitting, or overlearning, will often invalidate a model. Overfitting refers to a situation in which a model has been fit with random noise and the associations between the covariates and the outcomes will be spurious. There are several useful and frequently employed measures of assessing the predictive accuracy of a model performance (2). Among them is the concordance index, a widely used measure of predictive accuracy. The c-index refers to the proportion of agreement between prediction and outcomes among all patients paired. An index of 0.5 indicates no discrimination while a value of 1 indicates perfect discrimination. On the other hand, Somers’ D rank correlation is computed by calculating 2 (c-0.5). If the end point is binary, then the c-index is identical to the area under the receiver operating characteristic (ROC) curve. Both c and Somer’s D can be obtained using standard statistical software (SAS, Splus, or R). There are two types of validation: external and internal. External validation is the most rigorous approach where the frozen model is applied to an independent data set from the development data. Ideally, investigators would have an independent data set available for validation purposes, although this is rarely available. Other types of internal validation such split-sample, cross-validation, and bootstrapping are used to obtain an unbiased estimate of predictive accuracy (30). In split-sample validation, the data set is randomly divided into two groups: a learning dataset where the model is developed, and a testing set where the model performance is evaluated. This is a critical process as improperly distributed imbalances by outcomes or predictors may occur and produce unreliable estimate of model performance. Cross-validation is a generalization of data splitting. Similar to data-splitting, with this approach one fits a prognostic model based on a random sample before subsequently testing it on the sample that was omitted. For example, in 10-fold cross validation, 90% of the original sample used to develop the model and 10% to test the sample. This procedure is repeated 10 times, such that all subjects have once served to test the model. To obtain accurate estimates using crossvalidation, more than 200 models need to be fitted and tested with the results averaged over the 200 repetitions. The major advantage of cross-validation over data-splitting is that the former reduces variability by not relying on a single sample split. Bootstrapping is a very effective technique of deriving reliable estimates without making any assumptions about the distribution of the data (31).
196
ONCOLOGY CLINICAL TRIALS
The bootstrap does with a computer what the experimenter would do in practice if it were possible: he/she would repeat the experiment. In bootstrap, the observations are randomly drawn with replacement and reassigned, and estimates are recomputed. Bootstrapping reflects the process of sampling from the underlying population. Bootstrap samples are drawn with replacement from the original sample. They are of the same size as the original sample. For example, when 500 patients are available for model development, bootstrap samples also contain 500 patients, but some patients may not be included, others once, others twice, others three times, etc. As with cross-validation, the drawing of bootstrap samples needs to be repeated many times to obtain stable estimates. In summary, prognostic studies can address important questions that are relevant to patient outcomes, however, they must be rigorously and carefully designed to ensure reliable results. Such studies should begin with a hypothesis, defining a priori the endpoint, using appropriate variable selection approaches, testing robustness of the models applied, and justifying the sample size. As the primary goal of such studies is minimize uncertainty in predicting outcome in future patients, it is vital to validate such factors or models so that prognosis may be better understood.
References 1. Altman DG. Studies investigating prognostic factors: conduct and evaluation. In: Gospodarowicz MK, O’Sullivan B, Sobin H, eds. Prognostic Factors in Cancer. 3rd ed., Hoboken, NJ: Wiley-Liss; 2006:39–54. 2. Harrell FE Jr, Lee KL, Califf RM, et al. Regression modelling strategies for improved prognostic prediction. Stat Med. 1984;3:143–152. 3. Eastham JA, Kelly WK, Grossfeld GD, et al. Cancer and Leukemia Group B (CALGB) 90203: a randomized phase 3 study of radical prostatectomy alone versus estramustine and docetaxel before radical prostatectomy for patients with highrisk localized disease. Urol. 2003;62(supple 1):55–62. 4. Halabi S, Small EJ, Kantoff PW, et al. Prognostic model for predicting survival in men with hormone-refractory metastatic prostate cancer. J Clin Oncol. 2003;21:1232–1237. 5. Gospodarowicz MK, O’Sullivan B, Koh ES. Prognostic factors: principles and applications. In: Gospodarowicz MK, O’Sullivan B, Sobin H, eds. Prognostic Factors in Cancer. 3rd ed., Hoboken, NJ: Wiley-Liss; 2006:23–34. 6. Small EJ, Schellhammer PF, Higano CS, et al. Placebo-controlled phase III trial of immunologic therapy with sipuleucel-T (APC8015) in patients with metastatic, asymptomatic hormone refractory prostate cancer. J Clin Oncol. 2006;24:3089–3094. 7. Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19:453–473. 8. McShane LM, Altman DG, Sauerbrei W, et al. Reporting recommendations for tumor marker prognostic studies (REMARK). JNCI. 2005;97:1180–1184.
9. Schmoor CW, Sauerbrei W, Schumacher M. Sample size considerations for the evaluation of prognostic factors in survival analysis. Stat Med. 2000;19:441–452. 10. Harrell F Jr, Lee K, Mark D. Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996; 15:361–387. 11. Royston J, Sauerbrei W. A new approach to modeling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med. 2004;23: 2509–2525. 12. Rini BI, Halabi S, Rosenberg JE, et al. Bevacizumab plus interferon-alpha versus interferon-alpha monotherapy in patients with metastatic renal cell carcinoma: results of CALGB 90206. J Clin Oncol. 2008; 26:5422–5428. 13. Simon R, Altman DG. Methodological challenges in the evaluation of prognostic factors in breast cancer. Br J Cancer. 1994; 69:979–985. 14. Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: Wiley & Sons; 1989. 15. Cox DR. Regression models and life tables (with discussion). J Royal Stat Soc B. 1972;34: 187–220. 16. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Belmont, CA: Wadsworth; 1984. 17. LeBlanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc. 1993;88:457–467. 18. Robain M, Pierga JY, Jouve M. Predictive factors of response to first-line chemotherapy in 1426 women with metastatic breast cancer. Eur J Cancer. 2000;36:2301–2312. 19. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: Wiley & Sons; 1980. 20. Grambsch P, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81: 515–526. 21. Schoenfeld D. Partial residuals for the proportional hazards regression model. Biometrika. 1982;69:239. 22. Torsten H, Kurt H, Achim Z. Unbiased recursive partitioning: a conditional inference framework. J Comp Graph Stat. 2006;15:651–674. 23. McShane LM, Simon R. Statistical methods for the analysis of prognostic factor studies. In: Gospodarowicz MK, Henson DE, Hutter RV, O’Sullivan B, Sobin LH, Wittekind CH, eds. Prognostic Factors in Cancer. 2nd ed. New York: Wiley-Liss; 2001:37–48. 24. Banjeeree M, George J, Song EY, et al. Tree-based model for breast cancer prognostication. J Clin Oncol. 2004;22: 2567–2575. 25. Royston J, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–141. 26. Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med. 1989;8:551–561. 27. Hilsenbeck SG, Clark GM. Practical p-value adjustment for optimally selected cutpoints. Stat Med. 1996;15: 103–112. 28. George D, Halabi S, Shepard T, et al. Prognostic significance of plasma vascular endothelial growth factor (VEGF) levels in patients with hormone refractory prostate cancer: a CALGB study. Clin Cancer Res. 2001;7:1932–1936. 29. Smaletz O, Scher HI, Small EJ, et al. A nomogram for overall survival of patients with progressive metastatic prostate cancer following castration. J Clin Oncol. 2002;20: 3972–3982. 30. Steyerberg EW, Harrell FE Jr, Borsboom GJ, et al. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54: 774–781. 31. Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Statist. 1983;37:36.
22
Pitfalls in Oncology Clinical Trial Designs and Analysis
Stephanie Green
This chapter provides an overview of some design and analysis pitfalls—traps that lead us to faulty interpretation of results. Most pitfalls are related to bias and variability. Statisticians are annoyingly insistent about standards for clinical trials, but with good reason. Unless clinical trial results are unbiased and precise, we are led astray.
DESIGN ISSUES
controlled in the sense that statistical hypotheses are based on estimates from the literature. If the response rate of standard treatment in the literature is consistently xx%, a test may be done to ascertain whether the response rate on the new treatment is significantly greater than xx%. This approach works best if historical estimates are well characterized with low variability and are stable over time. This may be true for uniformly nonresponsive disease with no effective treatment, resulting in low variability in patient outcomes.
Choice of Control Group Usefulness of a new treatment is judged by comparing outcome results for a new treatment with outcome information on standard treatment. For the comparison to be a valid assessment of differences in outcome, the two groups must be as similar as possible with the exception of treatment. A randomized trial comparing a control group to an experimental treatment group is the most convincing and reliable method for minimizing bias and demonstrating effectiveness of a new treatment, but not all trials can be designed as definitive randomized comparisons. Historical Controls Historical Information from the Medical Literature Historically controlled trials are usually single-arm trials without specific control groups, but rather are
Pitfalls There is potential for bias of unknown magnitude or direction in all single-arm trials. Many factors related to outcome contribute to entry of patients on a trial, with resulting biases that can make the experimental treatment appear either more or less effective than it actually is. Investigators may enroll patients who are relatively healthy to a new treatment, or in other circumstances may choose patients who are too compromised for other options. For the first case, an ineffective treatment might appear to be an improvement over results reported in the literature, whereas, for the second an effective treatment might appear ineffective. Another pitfall is related to the large number of phase II trials done. It is the promising studies that are followed up with large comparative trials. Results of the comparative trials often do not validate the high 197
198
ONCOLOGY CLINICAL TRIALS
response rates observed in phase II. Bias is certainly a factor in this observation, but so is variability. Due to variability in outcome, some inactive treatments will appear to be useful in phase II. Even if patient selection is unbiased, if tests are done at the .05 level then 5% of inactive treatments will be tested further. Thus treatments that are taken to phase III are a mixture of false and true positives, with the false positives providing overestimates of response. An interesting paper by Zia et al. (1) describes 43 phase III cancer studies following phase IIs with identical treatment in the same intended patient population. The trials were identified through a search of chemotherapy studies in solid tumors published from July 1998 to June 2003. In 35 of the 43, the experimental treatment had lower response rates than in the corresponding phase IIs, with the difference averaging 13%. The rate was substantially higher in only one phase III. Considering the potential for error, in most circumstances a single-arm trial will not provide a reliable answer to the question of whether a new treatment is an improvement over standard. An unfortunate example of the single-arm pitfall is autologous transplantation in breast cancer. Approximately 70 single-arm trials were reported in ASCO abstracts between 1992 and 1994. Nearly all of these were reported as promising or worth further study. Many people were convinced of efficacy based on these single-arm trials and demanded access. Litigations lasted for years and billions were spent (2), yet in the end randomized trials did not confirm usefulness of the approach (3). Patient selection is not the only source of bias for a historically controlled trial. There should also be no recent treatment improvements, changes in staging definitions, improvements in diagnostic procedures, or changes in the way primary outcome is assessed. Otherwise, results may be confounded with temporal changes. In the past, the probability of response for various types of cancer was uniformly dismal, so the single-arm approach was often reasonable. However, with recent treatment and diagnostic advances old assumptions no longer hold and the approach is becoming less informative. Alternative designs are becoming more common as variability in treatment outcome increases, as previously stable historical rates improve, and as new subsets of patients are defined according to new tumor biology characteristics. Specific Historical Control Groups For uncommon tumors it may be infeasible to do a randomized trial. In such cases, carefully chosen specific control groups may be useful. Use of prognostic factors is helpful in the selection of controls for this type of trial. Even so, only large differences should be accepted
as evidence of improvement due to experimental therapy. Specific control groups are potentially better comparators than values chosen from the literature, but will still be subject to bias. Pitfalls Any particular choice of control, no matter how carefully chosen, will be systematically different from the experimental group in many (and often unknown or unmeasurable) ways. Even if matched on patient characteristics, there will be subtle differences in the reasons patients were chosen for treatment and changes over time in how patients were managed. Examples of trials controlled with a sequence of patients before a new therapy is introduced are common. For example, Cheung et al. (4) compared a sequence of 36 cadaveric renal transplant patients treated with a tacrolimus-based immunosuppressive regimen plus induction daclizumab to a control group of a previous sequence of 21 patients with the immunosuppressive regimen without induction therapy. Sample size was not preplanned and potential biases were not identified. The percent of acute rejections in the control arm was higher (test not significant) and no significant differences were found in secondary efficacy and safety end points. The abstract conclusion was “adding daclizumab to a tacrolimus-based therapy is safe but cannot further improve clinical efficacy,” apparently equating nonsignificance with no difference. A very small trial with potential biases due to a nonrandomized control simply cannot support such a conclusion. Even carefully done trials must be interpreted carefully. A trial of prophylactic cranial radiotherapy (PCRT) in children with lymphoblastic lymphoma (5) illustrates some of the issues. One hundred fifty-six patients with selected characteristics from a new trial were compared to 163 selected patients from two previous trials in the same population to determine whether PCRT could be omitted from the treatment regimen for CNS-negative patients. This approach was chosen due to the rarity of the patient population. Decision criteria were specified in advance as to what would constitute evidence that PCRT could be omitted and the sample size was chosen to have adequate power. The paper included a description of differences in the studies (induction chemotherapy doses differed somewhat), differences in the patient groups (there were more mediastinal tumors in control) as well as a sensitivity analysis related to a potential source of bias in the patient selection. By the prespecified criteria, omission of PCRT was found to be acceptable. Appropriately, the discussion outlined points of concern in interpretation of results. In addition to the study and patient differences, one of the control studies had an unexpectedly large favorable result. Trial conclusions
199
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
would have been less clear had only this trial been chosen as the control. The abstract conclusion “treatment without PCRT may be noninferior to treatment including PCRT” is a far more suitable conclusion than that of the Cheung paper. Use of specific groups may be particularly problematic when used for comparison in settings where small absolute differences are important, such as for treatments that are particularly effective. If 90% of patients in a good risk control group have good outcome, then an absolute difference of 5% for an experimental treatment will appear to be substantial. Yet this could easily be explained by a few differences in the characteristics of patients studied (e.g., an adjustment in the definition of good risk), or differences in procedures used (e.g., change in method of determining marker positivity). It should be kept in mind that design considerations are different when a retrospective control group is used instead of a fixed value from the literature or a prospectively randomized control group. Sample size considerations are a function of the results of the control group and may also account for covariate adjustment. The outcome of interest from the control group cannot be treated as a fixed population value, since the observed value is an estimate, not the true population value. On the other hand, designing as if the historical group is a concurrent comparison is also not correct, since the comparator results are already known (6–9). Randomized Controls Control groups in randomized studies take various forms (10). The control group may be untreated except for palliative care (increasingly rare in cancer trials) or may consist of a standard treatment. A placebo may be administered in either of these settings. Unblinded Control In an unblinded trial both the patient and clinician know the treatment assignment. For a typical two-arm trial, patients on the control arm and experimental arm both receive standard of care. The control arm patients receive no additional therapy while the experimental arm patients receive the experimental agent. All other procedures, treatments, and assessments are the same. If the trial is executed as planned, differences will be attributable to the addition of the experimental agent. Pitfalls Patients on the control arm may feel cheated at not receiving the new therapy. There may be little motivation to return to the clinic for routine visits so outcome information may not be carefully collected, or patients might drop out of the trial to
1.0 Lost control arm patients dropped out to get effective alternative treatment
0.5 Kaplan and Meier estimate Lost control arm patients dropped out due to deteriorating condition and no motivation to cooperate
0 0
6
12
FIGURE 22.1 Potential bias due to drop out.
receive treatment other than the planned control treatment. Bias may be introduced if control patients drop out of a study. Figure 22.1 illustrates how reasons for drop out may influence estimates. It shows how the graph would look if all five patients who dropped out did well (solid line) and if all five patients who dropped out failed immediately (dotted line). The Kaplan and Meier (11) estimate (dashed line), which censors for drop-outs, falls between the extremes and will be biased to the extent that reasons for drop-out are related to outcome. Assessments of subjective outcomes in an unblinded study are particularly subject to bias. Investigators enthusiastic about a new treatment may be more optimistic in assessing disease, or may be more likely to attribute adverse events (AEs) to treatment on the experimental treatment arm than on the control arm. Patients may also be more inclined to attribute benefit to experimental treatment. A proportion of patients will report improvement in subjective disease symptoms or experience of treatment side effects due to the placebo effect whether or not the treatment is active (12). A series of trials of venlafaxine in sexual dysfunction (13), hot flashes (14), panic disorder (15), generalized anxiety disorder (16), migraine pain (17), and so on all noted improvements on the placebo arms, sometimes less than the effect of venlafaxine, sometimes similar. Without the placebo control it would not have been possible to assess whether venlafaxine was effective. In an unblinded study the placebo effect occurs mainly in the experimental treatment arm, so improvement may be indicated even when the new treatment is inactive. Blinded Control Blinding patients and clinicians to treatment arm assignment by treating control arm patients with a placebo of the same appearance as the experimental agent reduces the potential for biases related to knowledge
200
ONCOLOGY CLINICAL TRIALS
of treatment assignment. Compliance is improved for patients, whereas for investigators, objectivity of outcome assessments is improved. In addition, both arms should experience the placebo effect so differences between the groups should be due to active treatment. Pitfalls Blinded placebo-controlled trials are resource intensive. Significant time and money are needed for manufacturing of the placebo; labeling, shipping, and tracking of coded placebos and active agents; setting up mechanisms for distribution with pharmacies; and arranging for emergency unblinding. Mix-ups occur, with corresponding complications for analysis. For trials with survival (or other objective end point) as the primary end point, the benefits of blinding may not justify the cost. For trials without blinding patients should have no over-the-counter access to useful treatments and compliance problems associated with knowledge of the treatment assignment should be anticipated to be minor. Despite best efforts, blinding may be imperfect. Distinctive side effects, slight differences in the appearance or consistency of the placebo, response to treatment, and requirement for individual patient unblinding may result in at least partial unblinding of a study. Information on the success of blinding in clinical trials is limited. Hrobjartsson et al. recently reviewed a sample of publications of 58 blinded, randomized trials in all diseases (18). They note that assessment of blinding success is rarely reported and that among those reported there is mixed success, with 14/31 (45%) reporting successful blinding. Another review by Boutron et al. (19) of 58 study publications reporting success of blinding noted inconsistent and inadequate methods of assessment of subject blinding. The most often used method was to ask subjects to guess their treatment assignment and to conclude blinding was successful if the proportion guessing correctly was not significantly different from 50%. Unfortunately, a significant result using this method does not necessarily mean the study was unblinded (e.g., if guesses are related to outcome instead of treatment arm and the experimental treatment is effective, then the result may be an incorrect appearance of unblinding). In some cases, it may be impossible to blind, such as when a treatment has such distinctive side effect that it cannot be reproduced in a placebo. In other cases, placebo treatment may entail too much risk to the patient, such as a sham surgery. Choice of Experimental Arm A typical choice for the experimental arm of a randomized trial is standard of care plus experimental agent. Other choices for an experimental treatment
may be a combination of different agents, different ways of administering treatment, new schedules, or other variations that cannot be described simply as standard plus new. Double placebos are sometimes used to mask treatment assignment for these trials. In other cases, the second arm may just be a competing standard treatment. Pitfalls If the study tests efficacy of Control (C) versus C+Experimental (E), conclusions cannot be made concerning the usefulness of E alone. Due to potential treatment synergy or inhibition, improvement over standard does not prove singleagent activity and lack of improvement over standard does not disprove single-agent activity. If any aspect of C is different in the combination arm for a trial of C versus C+E (e.g., modified dose or schedule), differences may be attributable to C rather than E. A superiority trial of C versus E will allow for assessment of the difference between control and experimental treatment. However, any difference will likely not be attributable to a particular component of E, possibly making it difficult to build on the results of the trial. For a trial of a standard combination (A+B) versus sequential administration of the agents with higher doses (A+ → B+), for instance, if the experimental arm is superior it will not be clear whether the sequence (A→B) or higher doses for one or both of the agents (A+ or B+) were the key components for success. An equivalence trial of C versus E aims to show that E is as good as C. “As good as” is difficult to establish. Nonsignificance does not prove equivalence, it means C and E have not been shown to be different. To prove equivalence a tight confidence interval ruling out all but small benefit to C is necessary. In addition, results of the standard versus experimental should not only demonstrate that results are similar, but also that it is reasonable to conclude that E is superior to supportive care. Design considerations for such trials are challenging (20–22), particularly since accurate estimates of benefit over supportive care are not readily available. Structural Issues Although randomization is necessary for attributing differences to treatment, it may not be sufficient. Care must be taken in designing the trial that all patients are assessed and followed in the same way. Bias may be introduced if the trial is not structured in a way that equivalent information is submitted on both trial arms.
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
201
Better Prediction
Progression
1
Every 8 weeks
.5
Every 3 weeks
0 0
3
6
9
12
15
18
FIGURE 22.2 Bias due to different assement schedules.
Pitfalls A poorly conducted randomized trial may still result in substantial bias. For instance, if the outcome assessment schedules on the two arms are not the same, results will artificially favor the arm with the less frequent schedule. Figure 22.2 shows Kaplan-Meier curves for six patients with identical times-to-progression, but for one group disease is assessed every 3 weeks whereas the other is assessed every 8. The delay in documentation of progression in the 8-week arm is enough to make it appear that this group has a superior outcome. Methods of assessment should also be the same. If the method of assessment for progression in one group is more sensitive, then results will be biased against this group. For instance, additional intensive follow-up on the experimental arm for AEs will also create additional chances for identifying progressive disease, leading to the unfortunate circumstance of time-to-progression being biased early compared to control and toxicity incidence being biased high. Criteria for inclusion and exclusion must also be the same. If information is available in one group but not the other, then this information should not be used. For example, disease information in an experimental surgery group might identify patients who should not be eligible for the trial, but these cannot be excluded from the study since similarly unsuitable patients from the control group are not excluded. Alternative Designs Randomized phase III trials are expensive, take a long time, and success rate in cancer is not high. Various approaches are used with aims such as improving design of early studies for better prediction of phase III success, improving efficiency, or answering more questions in a single study.
Randomized Phase II As described above, Zia et al. (1) reported that response rates in phase III trials were not adequately predicted by phase II trials results. Worse, only 12 of the 43 (28%) phase IIIs were positive. Twenty-eight percent may represent a higher success rate than for agents not taken to phase III (presumably due to insufficient activity in phase II), but it is certainly not an encouraging success rate. As well, this may be an overestimate due to publication bias in favor of positive results. Interest in randomized phase IIs has been increasing, seemingly with the idea that a small randomized trial will be an accurate predictor of phase III success because it is randomized. Pitfalls Unfortunately, randomization provides only a false sense of security. With the small sample sizes of a randomized phase II, there are three choices: poor power, poor level, or both. Add in the fact that the phase III end point of survival generally isn’t used in the phase II, and the result is that phase II prediction will be wrong more often than not. Noncomparative trials are often recommended, but in the end treatment arms will be compared and judgments are made concerning relative effectiveness. Noncomparative really means informal comparison with decision by intuition instead of specified criteria, making description of design properties impossible. Whether or not a randomized phase II is more useful than a single-arm trial will depend on how accurately the control results are known and the relative importance of finding more active agents or maximizing the phase III success rate. A simplified example of five approaches illustrates this. Consider (1) a singlearm trial of 50 patients testing .25 versus .45 with level approximately .05 and power .9, (2) a small 74-patient randomized trial testing at the .05 level (with correspondingly low power), (3) a small 74-patient randomized trial with power approximately .9 (with correspondingly high level), (4) a larger 200-patient randomized trial tested at level .05, and (5) a hybrid approach using a 50-patient single-arm trial followed by a 74-patient randomized phase II if the single-arm trial isn’t convincingly positive or negative. If the assumption of .25 is accurate, the best approach is to use a single-arm phase II. As indicated in Table 22.1, the single-arm trial out performs the other approaches and has the smallest sample size. Of course, .25 will not be correct in every study. Assume, for example, that .25 is correct for 40% of trials, that it is .35 in 40% of trials, and .15 in the remaining 20% of trials, due to variations in the types of patients accrued. Table 22.1 shows the probability of further development
202
ONCOLOGY CLINICAL TRIALS
TABLE 22.1
Probability of a Positive Trial Result Under the Null and Alternative Hypothesis Using Different Phase II Strategies.
SINGLE-ARM N=50
SMALL RANDOMIZED, GOOD LEVEL N=74
SMALL RANDOMIZED, GOOD POWER N=74
HYBRID N=50 OR 124
LARGE RANDOMIZED N=200
No variability in control: probability of response = .25 for 100% of trials .05 .05 .36 .01 Null: E=C Alternative: E=C+.2 .91 .46 .88 .82
.05 .88
Variability in control: probability of response = .15 in 20%, .25 in 40%, or .35 in 40% Null: E=C .21 .05 .34 .09 Alternative: E=C+.2 .86 .46 .93 .84
for the five designs when there is no improvement in response and when there is an improvement of .2. It is hard to conclude that the small randomized designs are better than a single-arm design with this degree of uncertainty in the control response rate. Improvement comes only with substantially larger sample sizes. The probability of a positive phase II is only part of the story, as we are also interested in the probability of phase III success. To address this, information is needed on the percent of new agents that are active and how well response predicts survival. For example, assume 20% of new treatments are active with respect to response, and half of these produce a survival benefit. In addition, assume a few agents inactive with respect to response will produce a survival benefit. Under these assumptions there will be 14 potential positive phase IIIs for every 100 agents tested. Table 22.2 shows how the five approaches from Table 22.1 do in phase III. From the point of view of identifying good agents, the small randomized phase II with good level and poor power
.05 .88
does the worst—only 5 of the 14 are identified. On the other hand, this is the best with respect to the number of phase III trials done—only 13. Not surprisingly, the small randomized trial with good power is best at identifying agents—but this is at the expense of doing the most phase IIIs with the lowest chance of success in phase III. The large randomized phase II approach yields the highest success rate in phase III, but this is at the expense of doing phase IIs that are large enough to be termed phase III. Single-arm trials yield a good number of positive phase IIIs. More phase IIIs are done than for large randomized phase IIs, but this is off-set by the fact that the single-arm trials require one-fourth of the patients and are completed more quickly. The hybrid approach does reasonably well, but has the disadvantage of delay if the second phase II needs to be done. In settings where there is limited historical information, a large randomized phase II may be justified, as improved estimates for control patients will be important for the design of future studies. Otherwise,
TABLE 22.2
Identification of Useful Agents and Chance of Phase III Success Using Different Phase II Strategies.
DESIGN Single-arm Small randomized, good level Small randomized, good power Hybrid Large randomized
NO. PHASE IIIS DONE
NO. POSITIVE PHASE IIIS
PERCENT OF USEFUL AGENTS IDENTIFIED
PERCENT OF PHASE IIIS THAT ARE POSITIVE
34
9
64%
26%
13
5
36%
38%
46 24 22
11 9 9
79% 64% 64%
24% 38% 41%
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
it is not so clear that use of randomized phase IIs will improve decision making. Better Efficiency The phase I—phase II—phase III process of drug development is costly and takes a long time. Designs combining phase I and II or phase II and III have been proposed to more efficiently get from early assessment to a definitive comparison. Phase IIIs are particularly expensive and time consuming. To make better use of resources, designs to answer multiple treatment questions within a single trial are sometimes used. Phase I/II In some sense, all phase I trials have phase II aspects and all phase II trials have phase I aspects. Although the primary aim in phase I is to identify a suitable dose, we look for hints of activity as well. And although the primary aim in phase II is assessment of activity, toxicity is an important secondary end point, in part because the Maximum Tolerated Dose from phase I is so poorly estimated that it may need to be modified with further experience in phase II. A phase I/II design assesses both toxicity and efficacy. Ideally, the aim is to select a dose with high efficacy among dose levels with acceptable toxicity or, alternatively, the dose with the best efficacy-toxicity profile. Typical phase I sample sizes of 3 to 6 per dose are too small to support a joint assessment of both efficacy and toxicity. On the other hand, an accurate assessment of both efficacy and toxicity would require a very large sample size. Proposed phase I/II designs generally have larger but still modest sample sizes for the dose groups. The basic idea is to identify a region with both acceptable activity and toxicity. For most proposals it is assumed tolerability decreases with dose and efficacy increases (Fig. 22.3). Pitfalls A simplistic approach to phase I/II would be to use the patients accrued to the dose selected on a phase I as the first several patients of a phase II. A risk in using this approach is bias in the phase II. Feasible dose range
Dose
100
% of patients with efficacy
50 % of patients who tolerate treatment
0 0
10
20
30
40
FIGURE 22.3 Doses with both acceptable efficacy and toxicity.
203
Patients offered a phase I study likely represent a different population from those offered a phase II with potential for bias in either direction. Furthermore, by design, the subjects on the selected dose have a relatively low level of toxicity. If toxicity and efficacy are related, then this group may have a different efficacy profile than the target phase II population. In addition, if efficacy results in the final phase I cohort look unpromising, the rest of the phase II might not be done. Since the phase I assessment amounts to an unspecified informal interim analysis, it becomes impossible to describe design characteristics of the trial. A sounder approach is joint assessment of toxicity and efficacy with specific criteria. An issue in phase I/II design is that the typical phase I population, with eligibility encompassing all solid tumors, generally will not be suitable for a phase I/II trial. If the efficacy end point has widely different distributions among subgroups of eligible patients, then there will not be a single feasible dose range. At best, results will be difficult to interpret. At worst an inappropriate dose may be chosen due to patient differences in the dosing cohorts. A second issue is that the assumption of increasing efficacy with dose may not always be reasonable. If efficacy first improves with dose and then decreases, and if toxicity is mild, then an inappropriately high dose may be chosen for further study. This is a particular concern for targeted and immunotherapeutic agents. Recent proposals from Thall et al. (20) and Bekele et al. (21) allow for more complicated doseefficacy relationships. Third, various proposals employ Bayesian methods, meaning assumptions about toxicity and efficacy probabilities are updated after each patient cohort is observed for the end points. Dose for the next cohort is based on the current model of efficacy-toxicity tradeoffs. Model-based designs must be implemented carefully so that doses are not escalated too aggressively since early estimates of the best dose are inaccurate and the risks of severe toxicity after skipped doses are too high. Another drawback to the phase I/II approach is the effort it takes to design the studies and time it takes to conduct them. Extensive discussions concerning design parameters are needed followed by extensive simulations to characterize the properties of the design. It may take several iterations to identify an appropriate design. During conduct of the study, extra time is required to assess both toxicity and efficacy before information is available to determine the dose for the next cohort. Despite the efficiency motivation, combining phase I and phase II may not, in fact, be an efficient strategy.
204
ONCOLOGY CLINICAL TRIALS
An area where phase I/II designs might be worth the extra complications is in proof of concept assessment. For example, a joint assessment of toxicity and immune response for an immunologic agent could provide valuable information. Phase I/II approaches could potentially lead to better choices of dose for further development than approaches considering only toxicity, especially in cases where the dose-toxicity relationship is not strong. However, the complications and risks of a phase I/II generally argue against the use of this strategy. Phase II/III There are two main types of phase II/III proposals. The most common type addresses the problem of screening several candidate treatments within a randomized controlled trial. Unpromising treatments are eliminated in the first (phase II) stage of the trial, and randomization is continued for a phase III assessment of the remaining treatments (22–24). The second type of phase II/III jointly assesses early and final end points, with the aim of more efficient decision making (25). Pitfalls Although it is possible to do a two-arm trial with a phase II assessment based only on the experimental arm, this is risky. The percent of phase II agents that turn out to be useful is not high. The resources expended on planning for a phase III trial and accruing to a control arm are wasted if the phase II is negative. Further, resources are wasted if randomization and accrual continues while an ultimately negative phase II end point is being assessed. There are also ethical issues associated with this approach, both in launching a phase III without adequate phase II assessment and in continuing to accrue patients while an inconclusive phase II assessment is ongoing. When there are several phase II candidates and limited resources, then an initial selection based on relatively small samples can be a reasonable approach. Generally, the best of the experimental arms is chosen for continued testing provided the difference above control meets a specified threshold. The threshold is chosen such that if at least one of the experimental arms is better by a specified amount, then the trial is likely to continue. A significance test of the chosen arm versus the control arm is done at the end of the trial adjusting for the initial test. A pitfall of this approach is that if more than one experimental arm is effective, the procedures will not do a good job of selecting the best one, since the initial stage sample sizes are too small to distinguish moderate differences among arms. Usually the best that can be said is that the arm chosen is at most not too much worse than the arms not
chosen. If, in addition, the initial screening is based on an early end point, then risks of choosing the wrong arm for continued testing increase. The probability of choosing an inferior arm may become unacceptably high. Screening does not come for free. Sample size requirements per arm increase for the first stage of accrual as the number of screened arms increases. This is because the best arm must beat out more competition to be chosen for the second stage. Furthermore, the total sample size for the second stage comparison increases with the number of arms and will be somewhat larger than the sample size for a standard phase III. This is because the final test must be adjusted for the initial testing at the selection stage. The second approach to phase II/III assessment is joint assessment of an early end point along with the final end point for the initial phase II decision using information from both arms. For instance, if survival is the primary end point and the new treatment is expected to decrease death through influence on local control of disease, then an early decision might use information on both local control and death. Such methods generally require model assumptions that may or may not be true and are usually difficult to verify with small sample sizes. If the model assumptions are wrong, conclusions may be wrong. As well, the relationship between the early and primary end points needs to be strong for best use of this approach. If, for instance, local control is not the primary mediator of survival benefit, then a useful agent might be rejected. Either type of phase II/III design requires more effort to set up than a standard phase II followed by a standard phase III. As for phase I/II designs, establishing appropriate design parameters may require substantial programming and several iterations to arrive at desired design properties. Considering the extra time to design, possible temporary closure of the trial after the phase II accrual is done and additional delays due to expansion for the phase III, the phase II/III approach may not actually save any time. Screening several candidate agents and continuing with the best is probably the best setting for use of phase II/III designs, considering that resources for phase III are limited. Factorial Designs The simplest factorial design, a 2x2 design, tests two interventions at once. Patients are randomized to C versus D versus E versus DE where C is the control arm, arm D adds an experimental agent d, arm E adds experimental agent e, and DE adds both. Efficacy of d is tested by comparing arms without d (C and E) to arms with d (D and DE), whereas efficacy of e is tested by
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
comparing (C and D) to (E and DE). Provided the effect of e is the same with or without d (and similarly for d), this may provide an efficient way to test two agents in the same trial. Pitfalls In many cases a 2×2 factorial study is more accurately viewed as a four-arm study with the aim of identifying the best treatment arm rather than a study addressing two independent questions. Viewed this way, some of the design issues become apparent. A typical approach of doing both the d versus not-d and e versus not-e tests at the .05 level will result in a 10% chance of identifying an arm other than C as superior instead of 5%. As well, if power for each comparison is .9 and both d and e are efficacious, power for identifying DE as the best arm is only about .8, not .9. These problems are easily solved by increasing the sample size. Not so easily solved is how to address the possibility that the assumption of equal effect of each agent in the presence of the other may be incorrect. The usual approach is to test for interaction (i.e., testing whether the C vs. E treatment difference is different from the D vs. DE difference). Unfortunately, testing for a difference of differences has very poor power. Consider a factorial trial with a survival end point and power .9 to detect a hazard ratio (HR) of 1.33 for the test of d versus not-d. A possible testing strategy is to first test for interaction at the .1 level, then test d versus not-d and e versus not-e, each at level .05 if the interaction is not significant. If the interaction is significant, then do pair-wise testing of the four arms. Observations from a simulation of survival trials (26) indicated power for detecting an interaction of the same magnitude as the treatment difference is less than .5 despite testing at the .1 level. As well, if there are no differences among the arms, there is a 13% chance C will not be identified up from the ~10% when no interaction test is done. If D is effective and E is not and there is no interaction, then the chance D will be identified as the best arm is 81%, down from the planned 90%. If both D and E are effective, the chance that DE is identified is 75%, down from the ~80% if interaction is not tested. The reason the type I error increases when there are no interactions is due to the extra testing resulting in additional errors. The reason power decreases is that subset testing (when the interaction test is significant) has poor power. So if there is no interaction, it is better not to test for it. Of course we don’t know if there is an interaction so we test anyway. Unfortunately, testing does not reliably lead to the choice of the best treatment arm when there are interactions. If there is an interaction, with D and E both superior but DE results in no additional improvement, then the chance of identifying D
205
and E but not DE is 35%. If there is an interaction of a different type, with D superior but E and DE are the same as C then the chance of identifying D is 60%. Or if only DE is superior, the chance of identifying it is 42%. A factorial design example is a Children’s Oncology Group study of control chemotherapy (C) versus C plus MTP versus C plus ifosphamide versus C plus MTP plus ifosphamide (27) in patients with osteosarcoma. The planned analysis was to test MTP versus no MTP and ifosphamide versus no ifosphamide. Using this analysis for event-free survival, C plus MTP would have been concluded to be the preferred treatment. However, inspection of all four arms suggested an interaction: C plus ifosphamide was the worst arm, C plus ifosphamide plus MTP was the best, with the other two arms intermediate and about the same. This led to subset comparisons (C vs. each of the other arms), which were not significant. The paper concluded ifosphamide was not useful, and that MTP might be useful, but further investigation of the potential interaction was needed. Since the assessment of interaction was unplanned, the analysis and conclusion were criticized (28). Further followup (29) resulted in reduced evidence of an interaction, and a conclusion suggesting only MTP was useful based on a significant survival comparison. But this still resulted in controversy (30, 31)—this time due to the fact that an interaction could not be ruled out and usefulness of ifosphamide had not been disproved. All three letters to the editor concluded additional study was needed.
ANALYSIS PITFALLS Testing after Randomization The choice of randomization scheme informs the choice of primary analysis. In particular, a trial with stratified randomization generally should employ a stratified test for the primary analysis while a trial with simple randomization should not. Usually a model based approach to analysis is used. Binary outcomes such as response are assumed to have a binomial distribution, whereas proportional hazards modeling is typically used for time-to-event end points. The modeling approach assumes the patients studied constitute a random sample from some population. Significance of the outcome is assessed against the hypothesized model of no difference. If results are unusual under the model, the model is rejected and treatment arms are concluded to be different.
206
ONCOLOGY CLINICAL TRIALS
Occasionally, randomization tests are done; these tests have the advantage that model assumptions are not required. Tests for rank statistics are examples. Outcomes are ordered and analysis is based on rankings rather than the actual outcome values. If the rankings in arm A are enough higher than in arm B, A is concluded superior in the group of patients studied. “Enough higher” is determined by assigning ranks to A and B in all possible ways given the design, and ascertaining whether the trial outcome is rare (an assignment of all the highest ranks to A would be rare). Validity of this sort of test depends on equal probability for the possible orderings of A and B. Fisher’s exact test and the randomization version of log-rank tests are justified by simple randomization. Pitfalls Randomization testing has strong advocates but various considerations make the choice unpopular. Generating all possible orderings for a randomization test may be resource intensive. Due to the requirement for equal probability of orderings of A and B, the method does not allow tests of alternatives or generation of confidence intervals—so model based methods are needed in any case. In addition, the tests are based on the distribution of treatments given the outcome. This is not what we are interested in—the point of doing the trials is to determine and compare outcomes given the treatment arms. Model based tests, on the other hand, are based on the distributions of outcome given treatment and do allow for estimation and confidence intervals. But there are a number of pitfalls. One is that most trials do not consist of a random sample of a population, so it is rarely clear how generalizable results might be. We know that estimates from arms of the trial probably do not reflect population outcomes since patients who agree to clinical trials are systematically different from those who don’t. What we do assume is that the relative effect of treatment in the trial will reasonably reflect relative effect of treatment within the target population. This may or may not be true. A second pitfall is that inference may not be valid if the model is wrong. A particularly challenging example is quality of life. In order to compare treatment arms, model assumptions must be made about the information missing due to patient death or noncompliance. The missing-atrandom assumption is nearly always incorrect as are the assumptions necessary for a last-observationcarried-forward approach, since missing observations are most frequently due to bad outcome. Other methods have been developed to address nonrandom missingness, but assumptions are unverifiable and inference may still be invalid. Another example of the second pitfall is that logistic regression and
TABLE 22.3
Influence of Stratification on Testing. LOG-RANK LEVEL, POWER Testing after stratified randomization No effect of factors .025, .92 Large effect of factors .005, .58
STRATIFIED LOG-RANK LEVEL, POWER
.025, .91 .026, .93
Testing after randomization without stratification No effect of factors .023, .92 .026, .90 Large effect of factors .025, .58 .026, .92
proportional hazards models do not yield unbiased estimators when the model is correct but important factors are omitted. If we guess wrong and don’t stratify on the factors with the strongest association with the primary end point, our estimates of treatment effect may be biased and power reduced (32) (also see Table 22.3). Other pitfalls arise when the principle of analysis reflecting design is not followed. Simple log-rank tests are often done after stratified randomization instead of or in addition to stratified tests. Although this seems harmless, the pitfall is loss of efficiency. Stratification reduces variability in outcome but the simple log-rank is calculated according to variability expected with simple randomization. The result is a conservative test with realized level lower than the nominal level specified by the design. Anderson et al. (33) simulated results for 400 patient studies with 8 strata under various assumptions. Analysis was done by log-rank or stratified methods. Table 22-3 shows that a log-rank test after randomization has appropriate level if either no stratification is done or if factors have no effect. But if a log-rank is done after stratified randomization and there is a large effect of factors, level may become very conservative, with correspondingly poor power. The table also illustrates reduced power when important factors are left out of a proportional hazards (PH) model (the log-rank is equivalent to a PH model with treatment but no covariates). For simple randomization, tests of balance in patient characteristics are routinely presented in manuscripts with adjusted analyses done if significant imbalances are detected. One of the problems with this approach is that lack of a statistically significant difference in covariates does not mean that leaving a covariate out of a model will have no effect. As noted
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
above, leaving an important covariate out of a model may have an effect whether or not the covariate is balanced. Another problem is that a significant imbalance either means the trial wasn’t properly randomized (in which case there are bigger problems to worry about than whether or not to put a covariate in a model), or that the imbalance is just a random event, which happens in randomized trials and is not of much concern. Permutt (34) demonstrated that testing for covariate imbalance in a trial without stratification and adjusting based on the results has only a small effect on power compared to never adjusting. Not surprisingly, adjusting for the covariate most correlated with the outcome had the biggest effect. The primary analysis chosen must be specified clearly in the protocol prior to any results being known on the study. Covariates and methods of adjustment for covariates must be stated unambiguously; otherwise, whatever analysis is presented can be criticized as a possibly biased choice. If multiple tests are planned (log-rank, adjustment for stratification factors, adjustment for other covariates), then a suitable adjustment for multiple testing must also be planned. Vanishing Denominator Phase II analyses are often restricted to response evaluable patients. Start out with all patients enrolled (N=40), omit patients who didn’t get at least two cycles of treatment, since you can’t expect the treatment to work with only one cycle (now down to N=32). Among these, omit patients without at least one on-study response assessment (N=25), since you don’t have enough scans to tell what the response was. Finally, omit patients who died within 3 months of enrollment, because obviously they didn’t meet the expected survival criteria for eligibility (N=20). Pitfalls This should be obvious. Patients who stop treatment early, who do not comply with on-study response assessments or who die early typically go off treatment and stop coming for follow-up due to clinical progression, side effects of treatment outweighing potential benefit of treatment, or death due to disease. Since early discontinuation is rarely due to the patient doing well, such patients are most likely nonresponders and need to be included in the analysis. For the example, suppose there are eight responders on study, all in the group of evaluable patients. The response estimates using all patients and the evaluable patients are 20% and 40%, respectively. Which makes the most sense? From the perspective of what might be told to a patient being offered this treatment, is it more useful to say that 20% of
207
patients who started treatment on this trial had a response, or that 40% of patients who made it through at least two cycles of treatment, who came back for their scans, and who lived 3 months had a response? ASCO abstracts can be relied on to provide examples. An ASCO 2008 abstract (35) reports modest clinical benefit (response or stable disease) in 5 of 19 patients (26%, 2 PR and 3 stable). If you add back in the six patients who either progressed within two cycles or withdrew consent, then you get a more realistic estimate of 20%. And if the four patients early in their therapy are included, the estimate drops to 17%. The activity of the agent is even more modest than reported. A further reduction in the denominator might be done by omitting patients with treatment deviations. This is done as if the omission will result in an unbiased estimate of the true effect of treatment—if treatment is effective, patients with reduced doses would not be expected to get the full effect of treatment. But if we believe this is legitimate, then we should also conclude that patients receiving placebo will have the same outcome regardless of compliance—placebo has no effect on outcome, so the amount of placebo should not have an effect on outcome. Not so. There are several nice examples in the literature demonstrating that patients who comply with placebo administration live longer than those who do not. The Coronary Drug Research Project Group, for instance, reported 15% mortality for patients taking > = 80% of their placebo compared to 28% mortality for patient receiving < 80% (36). An extreme example is a randomized trial of 46,486 men to control versus prostate cancer screening (38). The authors reported 74 deaths from prostate cancer among 14,231 unscreened controls with only 10 deaths from prostate cancer in the screened group of 7,348 men during the first 11 years following randomization, and an estimated reduction of 62% from a proportional hazards model. Sounds great until you question what happened to the other 24,907 randomized subjects. In fact, 23,785 of the 31,133 men randomized to screening did not agree to be screened. In the other arm, 1,122 of the 15,353 men randomized to no screening did get screened. The authors chose to do the primary analysis by omitting all of these subjects (second publication, 38) and by analyzing according to the received arm using all patients (first publication, 37), but not according to randomization. Serious flaws are pointed out in commentary (39–41) to these articles. Bias is introduced by the fact that subjects who were diagnosed with prostate cancer after randomization but before their first screening were not analyzed on the screened arm. More bias is introduced
208
ONCOLOGY CLINICAL TRIALS
by starting screened patients at time of first screening (median 3 years after randomization), but starting unscreened patients at time of randomization. These choices plus the observation that rates of death due to prostate cancer increase over time for several years after a disease-free assessment would suggest bias in favor of the screening arm. Another source of bias is that characteristics of subjects who agree to be screened on the screening arm and who choose to be screened on the nonscreening arm will not be the same as the characteristics of subjects who comply. In fact, the observed rate of prostate cancer death is somewhat higher in those who did not get screened on the screening arm than in those in the nonscreened arm, again suggesting potential bias in favor of the screening arm. When deaths due to prostate cancer are compared according to randomized assignment, there is little suggestion of benefit—153 deaths in the screening group and 75 deaths in the control group. Basically, the trial provides little useful information concerning the benefit of prostate cancer screening. Subsets Clinical trials allow us to conclude usefulness of new treatments on average in a population of patients. The clinician’s dilemma is that most patients are not average, but have a unique set of prognostic factors, genetic factors, disease characteristics, and comorbidities, making it unclear as to whether a particular trial is applicable. Phase III analysis usually includes a series of subset comparisons with the aim of exploring differences in effect of treatment in subsets. In a positive trial of a toxic agent, it is tempting to search for subsets of patients who might not need treatment, although in a negative trial the search is for a subset of patients for whom the treatment provides benefit. Pitfalls Trials are usually designed to have adequate power for differences when all patients are included in the analysis. Thus, power in subsets will have inadequate power due to the reduced sample size. At the same time, many tests are done. The more tests done, the more chance of a false positive result; so testing at the usual .05 level results in more than a 5% error rate. Consider a simple case of a 600-patient two-arm trial with two-sided level .05 testing and power .9 for a 33% improvement in survival. If there are two markers with four subsets (+/+, +/−, −/+, −/−), with 50, 100, 150, and 300 patients in the subsets, then there is a 19% chance of a false positive result (significance in at least one of the four subsets when there is no difference in any), and powers for the four subsets are 15%, 26%,
37%, and 63%, respectively, if there is a 33% improvement in each. Since the aim of subset analysis is to identify differential effects of treatment by subgroup, the appropriate statistical test to use is a test for interaction (i.e., a test comparing the treatment difference in one subset vs. the treatment difference in the other subset). Unfortunately, testing for interaction requires a lot of patients. Similar to the factorial design example, for our sample trial of 600 patients, power to detect a marker—treatment interaction is only .39 when there is no difference in marker negative patients and a 33% improvement in marker positive patients. Given bad power and bad level, and likely imbalances in other patient characteristics, subset results are highly unreliable whether positive or negative. Rothwell (42) discusses subset analysis issues, including the famous ISIS-2 example (43, 44) in which Libra and Gemini patients did not benefit from aspirin treatment, while patients with other astrological signs did. The paper also cites the initial Early Breast Cancer Trialists’ Collaborative Group breast cancer overview indicating younger women did not benefit from tamoxifen, while later overviews indicate they do (45, 46). If large overviews with prespecified subset analyses can be unreliable, what hope is there for subsets of small trials? Recommendations for subset analysis in the Rothwell paper include, among others: (1) define subsets of interest up front since there are no statistically reliable tests for unplanned subset analyses, (2) limit planned subset analyses to a few clearly defined key factors, (3) state the expected direction of difference, (4) stratify on factors for which a differential effect of treatment is anticipated, (5) don’t report p-values for the individual subsets, rather report the tests for interaction, (6) report all tests done, not just the significant ones, (7) adjust for multiple testing, and (8) focus on the primary outcome. Even with all this, Rothwell concludes results must still be validated in other trials before they can be accepted. Wang et al. (47) assessed the completeness and quality of subset analyses in 97 papers, reporting randomized trials published in the New England Journal of Medicine between July 2005 and June 2006. Only 21 of the 59 trials reporting subset results included subset analysis in the Methods section; the number of subgroup analyses done was not reported in nine articles, whether any analyses were prespecified was not reported in 40, no interaction tests were reported in 32, at least 17 tested more than eight subsets, only 25 reported treatment results consistently in all subsets examined, and of 15 trials claiming heterogeneous
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
results only 2 papers had cautions about interpretation due to multiplicity. Quality of reporting was not ideal. Competing Risks Not infrequently, the primary end point of a trial is a composite of different types of end points. For instance, disease-free survival (DFS) is a composite of regionally recurrent disease, distantly recurrent disease, second primary disease, death due to disease, and death due to other causes. If overall differences are not significant, it is tempting to examine each component to see if treatment had an effect on any of the components. Pitfalls A common but inappropriate method used to analyze a component of a composite (e.g., distant relapse) end point is to censor at the time of other components if they occur first. As noted previously, this would be appropriate only if the censoring mechanism were independent of the outcome. This is clearly not the case in this setting. Patients with a regional relapse may be more likely to develop distant disease than those without. Death due to disease may imply occult metastases. There may be factors related to development of metastases in common with development of other cancers or noncancer death, such as sensitivity to treatment. Figure 22-1 illustrating potential bias due to drop-out applies equally well here. The top curve would represent the case where other events are related to lower rate of distant metastases, the bottom represents the case of increased rate, and the middle curve is the potentially biased Kaplan and Meier estimate. Consider that if there is no overall difference in a composite end point, then any difference in one of the components must be off-set by a difference in the other direction in another component. The trade-off may not be a good one. The 2000 EBCTCG overview of radiation therapy trials in breast cancer (48), for instance, indicates a large reduction in local relapses due to RT (9% vs. 27% at 10 years) with a modest reduction in breast cancer deaths. Good news for RT, except there was also an increase in noncancer deaths, resulting in no significant overall improvement in survival at 20 years (37% vs. 36%). The trial of prophylactic CNS radiation therapy in childhood lymphoblastic lymphoma (5) described above, provides another example of competing risks. Five types of failures were of interest: isolated CNS relapse, combined CNS relapse, non-CNS relapse, second primaries, and unrelated death. The primary end point was overall event-free survival at 2 years. The result for the control group was 91% compared to 86% in the group with no PCRT. The lower
209
bound for the confidence interval of the difference was −11%. Considering the serious risk of late effects due to PCRT and the potential for salvage in the event of CNS relapse, this was considered acceptable according to prespecified criteria. Various additional analyses related to the different types of failures were done to glean additional insight. Comparisons using just lymphoma relapses were consistent with the overall failure results (88% for no PCRT and 91% for control). However, the comparisons of cumulative incidences of each type of failure were not so straightforward, with the curious observation of similar CNS relapse but higher non-CNS relapse in the no PCRT group. The multiple testing problem applies here, as it does for subset analysis. If several components are tested, there are several opportunities for a false positive result—and the overall probability that at least one comparison will be positive, even when there are no differences, will be inflated. With six different comparisons of subtypes of the overall endpoint in this study (plus an overall survival comparison and sensitivity analyses) the chances of seeing an interesting difference were high. Whether the increase in non-CNS relapse in the no PCRT arm was due to chance, PCRT, the difference in induction regimen, or other reason cannot be ascertained from the trial. There is also a problem analogous to the small sample problem for subsets. Power for detecting differences with respect to a time-to-event endpoint is related to the number of events. If a study is designed to have adequate power for the composite event, it will not have adequate power for the subtypes of events since the number of events of a composite end point is the sum of number of subtype events. In this study the observed similarity of CNS relapse may be more related to the low rate of CNS events (three observed on no PCRT and one on control) than to a true lack of difference. There are two basic ways to analyze competing risks data, each with a different interpretation, the subdistribution (also known as cumulative incidence) and cause specific hazards approaches. For the subdistribution approach, the distribution of time to each subtype as a first event is estimated. The sum of these subdistributions will be equal to the distribution of the composite end point. Figure 22.4 shows a recurrence-free survival distribution along with its component subdistributions of local/regional recurrence, distant metastases, and death without disease. A comparison of distant recurrence subdistributions for two treatments (A and B) tests whether or not the probabilities of distant recurrence as a first event are different.
210
ONCOLOGY CLINICAL TRIALS
1.0
Overall failure—disease-free survival (DFS)
Failure
0.8 0.6
Distant recurrence as first event
0.4
Local/regional recurrence as first event
0.2
Death without disease as first event
0.0 0
5
10 Time
15
20
FIGURE 22.4 Example of subdistributions for a composite end point.
The second way to analyze is using cause specific proportional hazards models. For this analysis the relative rate of failure is assessed rather than a comparison of the number of failures. At each distant recurrence time the number of patients still at risk on each treatment arm is calculated. If k of the N patients still at risk are from arm A, then the chance the failure is from arm A should be k/N. If there are more failures than expected on arm A, the failure rate is concluded higher on A than on B. To illustrate, consider a simple progression-free survival (PFS) example, with 20 patients on each arm and the first scan done at time T. Suppose 10 patients on A and 5 patients on B die before the scan and the scan confirms progression in 5 patients on each arm. The estimated probability of progression as the first event by time T is equal on each arm, 5/20. In contrast, the rate of progression at time T is based on the number of patients still at risk, 10 on A and 15 on B. For this approach, results are not equal, 5/10 on arm A and 5/15 on arm B. If analyses of specific types of failure are planned, choose the approach that best reflects the question of interest. Outcome-by-Outcome Analysis The classic oncology example of outcome-by-outcome analysis is comparison of survival according to response. It used to be common to see plots accompanied by inappropriate statistical analysis; now these particular analyses are done much less frequently, with more suitable statistical methods, and with appropriate caveats concerning interpretation (49). But although these have declined, the lessons learned have
not always been applied to different outcome by outcome analyses. Pitfalls Patients must live long enough to have a successful outcome, so there is a built in correlation with survival with any other outcome that requires time to occur. For a study with a 6-week tumor assessment interval, patients must live at least 3 months to have a confirmed response. Everyone who dies before 3 months is defined as a nonresponder, so it should not be a surprise that nonresponders don’t live as long as responders; the 3-month lag time almost assures it. It might be that the early deaths occurred because the patients didn’t respond; then again perhaps the patients didn’t respond because they died too soon. If the lag time is very short this is less of an issue, but does not eliminate problems in interpretation. A better way to compare is to consider response status at 3 months among patients still alive and see if survival after 3 months is related to status. Although this approach reduces bias, a difference will still not be convincing evidence that response confers a survival advantage. Factors related to both outcomes may result in a correlation even if there is no relationship. For instance, patients with lower tumor burden or fewer prior treatments may be more likely to have a tumor response and may also be more likely to live longer independent of tumor response. Common prognostic factors can never be ruled out. This type of analysis has been used to try to demonstrate surrogacy. The claim that response is a surrogate end point is based in part on the reasoning that since responders live longer than nonresponders, then increasing the number of responders must increase
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
survival. As explained in the above discussion, the reasoning is faulty. We cannot necessarily conclude response confers a survival advantage. But even if we can conclude a strong relationship, the observation would not be sufficient to conclude effectiveness of treatment. In addition to a difference in survival between responders and nonresponders, we would also need the magnitude of difference in percent responding between two arms to predict the magnitude of difference in survival between the arms. Otherwise, incorrect conclusions may be drawn. For instance, a treatment with a 50% response rate and a short increase in survival among responders will not be superior to a treatment with a 40% response rate and a long increase in survival, despite the association with survival and a higher response rate on the first arm. Another long-standing example of outcome-byoutcome analysis is that of survival by dose received. Such analyses have been used to suggest that high dose treatment is better than low dose. A classic paper by Redmond et al. (50) demonstrates that with the same data you can prove that either high or low doses are better just by picking a different analysis. If high dose is defined as percent planned total dose received, then patients who fail before the end of treatment are automatically low-dose patients. They weren’t on trial long enough to be high-dose patients. In fact, low doses are due to failure in this analysis, not the other way around. On the other hand, if high dose is defined as percent planned dose until discontinuation, then patients who fail right away have the highest doses because there was little opportunity for dose adjustment or delay. For this analysis, lower doses will appear to confer benefit. The most striking observation in this paper is the apparent dose response on the placebo arm, which remained even when more appropriate analyses were done. Perhaps patients with the ability to stay on a treatment regimen have better general health and prognosis. Outcome-by-outcome analyses are still being done despite the pitfalls, with various reports in the literature examining the association of response or survival with toxicities, such as the association of rash with outcome in patients treated with cetuximab, or changes in biomarkers, such as association of survival with FDG PET response. Vermorken et al. (51) report an analysis of efficacy by skin reactions in 103 advanced head and neck cancer patients treated with cetuximab. Associations of most end points with rash were not demonstrated. However, for disease control rate (CR, PR, or stable) the authors note no association with early acne-like rash (25/51 disease control for no rash, 22/52 for grade 1-2 rash) but a trend toward higher disease control rate for any acne-like rash (7/28
211
with disease control for no rash, 39/74 for grade 1-2, 1/1 for grade 3-4). Here we have the classic pitfall. Twenty-three additional patients developed rashes after additional follow-up. Late rashes would occur more often in patients who attained at least stable disease whether or not rashes were related to efficacy. Cox time-dependent modeling is a better approach to analysis, as the model incorporates the time of occurrence of the rash. This approach was used in Burtness et al. (52) in a 117-patient head and neck cancer trial of cisplatin plus placebo versus cisplatin plus cetuximab. The analysis demonstrated a strong association of skin toxicity with survival on the cetuximab arm. Seventy-seven percent of patients on the cetuximab arm had skin toxicity versus 24% on placebo—yet overall survival was not shown to improve due to cetuximab. Based on consistency of association of rash with survival in single-arm trials, Herbst et al. (53) and other authors have suggested that rash may be an important marker for efficacy related to the mechanism of action of cetuximab. The Burtness results cast doubt on the hypothesized importance. Censoring when Data are Missing There is a perception that censoring cures all problems with time-to-event data. Consider time-to-progression. Patient died before progression? Censor. Data are submitted late? Censor. Patient lost to follow-up? Censor. Patient crossed over to a new treatment before progression? Censor. Pitfalls As noted before, the difficulty with this approach is that censorship requires the assumption that the censoring mechanism is independent of the end point under study. This is a reasonable assumption if the cause of censorship is that the patient still doesn’t have the event at the time of analysis. It is not a reasonable assumption if there are factors related to censorship that are also related to outcome. Association can go either way. Consider censorship for death without progression. Compromised immune status may make both progression and other causes of death more likely. Or treatment sensitive patients may have a reduced rate of progression but an increased rate of toxic death. It is also not a reasonable assumption that late data is unrelated to outcome. Deaths typically are reported more promptly than last contact dates. The sequence of graphs due to Benner (54) in Figure 22.3 is a nice illustration of effect of sloppy reporting. The first graph (Fig. 22.5(A)) shows simulated data from a randomized trial of 800 patients enrolled on the same day with complete follow-up on
212
ONCOLOGY CLINICAL TRIALS
Treatment arm
1.00
grp = 0 grp = 1 Censored grp = 0
0.75
Censored grp = 1 0.50
0.25
Survival distribution function
Survival distribution function
1.00
Treatment arm
grp = 0 grp = 1 Censored grp = 0
0.75
Censored grp = 1 0.50
0.25
0.00 0.00 0
5
10
15 Time
20
25
30
0
5
10
15 Time
20
25
30
FIGURE 22.5(A)
FIGURE 22.5(B)
Survival results at 30 months from a simulated two-arm trial with all patients followed for 30 months.
Survival results at 30 months with not all patients followed for 30 months, but follow up complete at time of analysis for all patients.
all patients through 30 months after study start. With complete follow-up, there are 697 deaths and the logrank test p-value=.002. The second graph (Fig. 22.5(B)) shows the estimate resulting from the same data set if accrual takes place over 18 months with 12 months additional follow-up. Here all deaths have been reported and all follow-up on living patients is up-to-date at the time of analysis. Results in this case look much like results from the complete data set. There are fewer deaths (624) due to less follow-up time, but the correct difference is evident with test p=.01.
The third graph (Fig. 22.5(C)) shows what happens to the same data set when some patients are lost and data for others are submitted late. Only 444 of the 624 actual deaths have been reported. The estimate becomes much less precise, some bias may have been introduced and although there might still be a trend, the difference is no longer significant, with p=.12. The final panel (Fig. 22.5(D)) illustrates a particularly bad case. It is common for follow-up to be considerably less frequent after a patient goes off treatment. Here we have the result when all 624 deaths are eventually reported, last contact dates are
Survival distribution function
Treatment arm
grp = 0 grp = 1 Censored grp = 0
0.75
Censored grp = 1 0.50
0.25
Survival distribution function
1.00 1.00
Treatment arm
grp = 0 grp = 1 Censored grp = 0
0.75
Censored grp = 1 0.50
0.25
0.00 0
0.00 0
5
10
15 Time
20
25
30
5
10
15 Time
20
25
30
FIGURE 22.5(D)
FIGURE 22.5(C) Survival results at 30 months with additional censorship due to late data submission and loss to followup.
Survival results at 30 months with additional censorship due to deaths but not last contact dates reported after discontinuation of treatment.
22 PITFALLS IN ONCOLOGY CLINICAL TRIAL DESIGNS AND ANALYSIS
reported for patients still on treatment, but last contact date is not submitted after the patient goes off treatment. In this case, substantial bias is introduced with even less suggestion of a possible difference (p=.36). The assumption of independence of events and the censoring mechanism is not correct in this case. Patients go off treatment mainly due to progression, and risk of death for patients who have progressed is elevated. The main message from this sequence of curves is that there are no easy analysis fixes when data are seriously flawed. Data quality is of utmost importance for reliable study results.
CONCLUSION Clinical research is fraught with pitfalls. Patients are not passive interchangeable units in precise experiments. Investigators are not unbiased observers of results. Statisticians do not have infallible methods. Patients don’t comply, data aren’t submitted correctly, and flaws in execution bias results. An appropriate control can be hard to come by, design short cuts often aren’t, and trials don’t answer all of our questions. Models are wrong, biased analyses lead to incorrect conclusions, and many tests lead to many errors. Yet clinical trials remain our best tool for assessing the usefulness of treatments. The more pitfalls we anticipate and manage, the more reliable our studies will be. We owe the patients who agree to participate on our trials quality research and correct answers.
References 1. Zia M, Siu L, Pond G, Chen E. Comparison of outcomes of phase II studies and subsequent randomized control studies using identical chemotherapy regimens. J Clin Oncol. 2005;23:6982–6991. 2. Jacobson C, Rettig R, Aubry W. Litigating the science of breast cancer treatment. J Health Policy Law. 2007;32:785–818. 3. Farquar C, Marjoribanks J, Lethaby A, Basser R. High dose chemotherapy for poor prognosis breast cancer: systematic review and meta-analysis. Cancer Treat Reviews. 2007;4:325–337. 4. Cheung C, Liu Y, Wong K, et al. Can daclizumab reduce acute rejection and improve long-term renal function in tacrolimusbased primary renal transplant recipients? Nephrology. 2008; 13:251–255. 5. Burkhardt B, Woessmann W, Zimmermann M, et al. Impact of cranial radiotherapy on central nervous system prophylaxis in children and adolescents with central nervous system— negative stage III or IV lymphoblastic lymphoma. J Clin Oncol. 2006;24:491–499. 6. Makuch R, Simon R. Sample size consideration for nonrandomized comparative studies. J Chron Dis. 1980;33: 175–181.
213
7. Dixon D, Simon R. Sample size consideration for studies comparing survival curves using historical controls. J Clin Epidemiol. 1988;41:1209–1213. 8. O’Malley J, Normand SL, Kuntz R. Sample size calculation for a historically controlled clinical trial with adjustment for covariates. J Biopharm Stat. 2002;12:227–247. 9. Lee J, Tseng C. Uniform power method for sample size calculation in historical control studies with binary response. Cont Clin Trials. 2001;22:390–400. 10. USDHHS, Food and Drug Administration. Guidance for industry: E10 choice of control group and related issues in clinical trials. Office of Training and Communication, Center for Drug Evaluation and Research, Rockville, MD; 2001. 11. Kaplan E, Meier, P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481. 12. Shapiro A, Shapiro K. The Powerful Placebo: From Ancient Priest to Modern Physician. Baltimore, MD: Johns Hopkins University Press; 1997. 13. Kilic S, Ergin H, Baydinc Y. Venlafaxine extended release for the treatment of patients with premature ejaculation: a pilot, single-blind, placebo-controlled, fixed-dose crossover study on short-term administration of an antidepressant drug. Int J Androl. 2005;28:47–52. 14. Evans M, Pritts E, Vittinghoff E, McClish K, Morgan K, Jaffe R. Management of postmenopausal hot flushes with venlafaxine hydrochloride: a randomized, controlled trial. Obstetrics & Gynecology. 2005;105:161–166. 15. Pollack M, Lepola U, Koponen H, et al. A double-blind study of the efficacy of venlafaxine extended-release, paroxetine, and placebo in the treatment of panic disorder. Depress Anxiety. 2006; Aug 7, Epub ahead of print. 16. Gelenberg A, Lydiard RB, Rudolph R, Aguiar L, Haskins JT, Salinas E. Efficacy of venlafaxine extended-release capsules in nondepresssed outpatients with generalized anxiety disorder: a 6-month randomized controlled trial. J Am Med Assoc. 2000;283:3082–3088. 17. Ozyalcin S, Talu GK, Kiziltan E, Yucel B, Ertas M, Disci R. The efficacy and safety of venlafaxine in the prophylaxis of migraine. Headache. 2005;45:144–152. 18. Hróbjartsson A, Forfang E, Haahr M, Als-Nielsen B, Brorson S. Blinded trials taken to the test: an analysis of randomized clinical trials that report tests for the success of blinding. Int J Epidemiol. 2007;36:654–653. 19. Boutron I, Estellat C, Ravaud P. A review of blinding in randomized controlled trials found results inconsistent and questionable. J. Clin Epidemiol. 2005;58:1220–1226. 20. Thall P, Cook J. Dose-finding based on efficacy-toxicity tradeoffs. Biometrics. 2004;60:684–693. 21. Bekele BN, Shen Y. A Bayesian approach to jointly modeling toxicity and biomarker expression in a phase I/II dose-finding trial. Biometrics. 2005;61:344–354. 22. Thall P, Simon R, Ellenberg S. Two-stage selection and testing designs for comparative clinical trials. Biometrika. 1988;75:303–310. 23. Schaid D, Wieand S, Therneau T. Optimal two-stage screening designs for survival comparisons. Biometrika. 1990;77: 507–513. 24. Liu Q, Pledger G. Phase 2 and combination designs to accelerate drug development. J Am Stat Assoc. 2005;100:493–502. 25. Inoue L, Thall P, Berry D. Seamlessly expanding a randomized phase II trial to phase III. Biometrics. 2002;58:823–831. 26. Green S, Liu PY, O’Sullivan J. Factorial design considerations. J Clin Oncol. 2002;20:3424–3430. 27. Meyer P, Schwartz C, Krailo M, et al. Osteosarcoma: a randomized prospective trial of the addition of ifosfamide and/or muramyl tripeptide to cisplatin, doxorubicin, and high-dose methotrexate. J Clin Oncol. 2005;23:2004–2211. 28. Romet-Lemonne J, Mills B, Fridman W, Munsell M. Correspondence. Prospectively planned analysis of data from a phase III study of liposomal muramyl tripeptide phosphatidylethanolamine in the treatment of osteosarcoma. J Clin Oncol. 2005;23:6437–6438. 29. Meyers P, Schwartz C, Krailo M, et al. Osteosarcoma: the addition of muramyl tripeptide to chemotherapy improves overall
214
ONCOLOGY CLINICAL TRIALS
30. 31. 32. 33.
34. 35.
36.
37. 38.
39. 40. 41. 42. 43.
survival—a report from the Children’s Oncology Group. J Clin Oncol. 2008;26:633–638. Bielack S, Marina N, Ferrari S, Helman L. Osteosarcoma: correspondence. The same old drugs or more? J Clin Oncol. 2008;26:3102–3103. Hunsberger S, Freidlin B, Smith M. Correspondence. Complexities in interpretation of osteosarcoma clinical trial results. J Clin Oncol. 2008;26:3103–3104. Anderson G, Fleming T. Model misspecification in proportional hazards regression. Biometrika. 1995;82:527. Anderson G, LeBlanc M, Liu P, Crowley J. On use of covariates in randomization and analysis of clinical trials. In: Crowley J, Ankerst D, eds. Handbook of Statistics in Clinical Oncology. Boca Raton, FL: Chapman and Hall/CRC; 2006. Permutt T. Testing for imbalance of covariates in controlled experiments. Stat Med. 1990;9:1455–1462. Lestingi T, Tolzien K, Galvez A, et al. A phase II study with single agent erlotinib in chemotherapy-naïve androgen independent prostate cancer (AIPC). Annual Meeting Proceedings [abstract 16104]. Am Soc Clin Oncol; 2008. Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the coronary drug project. N Engl J Med. 1980;303: 1038–1041. Labrie F, Candas B, Dupont A, et al. Screening decreases prostate cancer death: first analysis of the 1988 Quebec prospective randomized controlled trial. Prostate. 1999;38:83–91. Labrie F, Candas B, Dupont A, et al. Screening decreases prostate cancer mortality: 11 year follow-up of the 1988 Quebec prospective randomized controlled trial. Prostate. 2004;59:311–318. Boer R, Schröder F. Letter to the editor. Quebec randomized controlled trial on prostate cancer screening shows no evidence of mortality reduction. Prostate. 1999;40:130–131. Pinsky P. Letters to the editor. Labrie et al. Prostate. 2004;61:371. Elwood M. A misleading paper on prostate cancer screening. Prostate. 2004;61:372. Rothwell P. Subgroup analysis in randomized controlled trials: importance, indications and interpretation. Lancet. 2005;365:176–186. ISIS-2 Collaborative Group. Randomized trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases
44. 45.
46. 47. 48.
49. 50.
51.
52.
53.
54.
of suspected acute myocardial infarction: ISIS-2. Lancet. 1988;2:349–360. Collins R, MacMahon S. Reliable assessment of the effects of treatment on mortality and major morbidity. I: Clinical trials. Lancet. 2001;357:373–380. Early Breast Cancer Trialists’ Collaborative Group. Effects of adjuvant tamoxifen and of cytotoxic therapy on mortality in early breast cancer. An overview of 61 randomized trials among 28,896 women. N Engl J Med. 1988;319:1681–1692. Early Breast Cancer Trialists’ Collaborative Group. Tamoxifen for early breast cancer (Cochrane Review). Cochrane Database Syst. Rev. 1: CD000486; 2001 Wang R, Lagakos S, Ware J. Statistics in medicine—reporting of subgroup analyses in clinical trials. N Engl J Med.;2007;357: 2189–2194. Early Breast Cancer Trialists’ Collaborative Group. Favourable and unfavourable effects of radiotherapy for early breast cancer: an overview of the randomized trials. Lancet. 2000;355:1757–1770. Anderson J, Cain K, Gelber R. Analysis of survival by tumor response and other comparisons of time-to-event by outcome variables. J Clin Oncol. 2008;26:3913–3915. Redmond C, Fisher B, Wieand H. The methodologic dilemma in retrospectively correlating the amount of chemotherapy received in adjuvant therapy protocols with disease-free survival. Cancer Treat Reports. 1983;67:519–526. Vermorken J, Trigo R, Koralweski P, et al. Open-label, uncontrolled, multicenter phase II study to evaluate the efficacy and toxicity of cetuximab as a single agent in patients with recurrent and/or metastatic squamous cell carcinoma of the head and neck who failed to respond to platinum-based therapy. J Clin Oncol. 2007;25:2171–2177. Burkness B, Goldwasser M, Flood W, et al. Phase III randomized trial of cisplatin plus placebo compared with cisplatin plus cetuximab in metastatic/recurrent head and neck cancer: a eastern cooperative oncology group study. J Clin Oncol. 2005;23:8646–8654. Herbst R, Arquette M, Shin D, et al. Phase II multicenter study of the epidermal growth factor receptor antibody cetuximab and cisplatin for recurrent and refractory squamous cell carcinoma of the head and neck. J Clin Oncol. 2005;24:5578–5587. Benner R. Personal communication; 2008.
23
Biomarkers and Surrogate End Points in Clinical Trials
Marc Buyse Stefan Michiels
Biomarkers will become important in the clinic over the years to come, for several reasons. First, an increasing number of new drugs have a well-defined mechanism of action at the molecular level, allowing drug developers to measure the effect of these drugs on the relevant biomarkers, as well as select patients likely to respond to these drugs. Second, there is increasing public pressure for new, promising drugs to be approved for marketing as rapidly as possible, and such approval will often have to rely on biomarkers rather than on some long-term clinical end point. Last, if the approval process is shortened, there will be a corresponding need for earlier detection of safety signals that could point to toxic problems with new drugs. It is likely, therefore, that the evaluation of tomorrow’s drugs will be based primarily on biomarkers, rather than on the longerterm, harder clinical end points that have dominated the development of cancer drugs until now. The purpose of this chapter is to review different types of biomarkers, and to discuss their identification, validation, and use in clinical trials.
intervention” (1). Biomarkers can include biochemical markers, cellular markers, cytokines, genetic markers, gene expression profiles, imaging markers, physiological markers, or any other patient or tumor measurements. They can be measured once before a treatment is administered, or repeatedly before, during, and after the treatment is administered, in which case interest focuses on changes in the biomarkers over time. In contrast to biomarkers, clinical end points directly measure “how a patient feels, functions, or survives” (2). It will be convenient to categorize biomarkers into three broad groups: · Prognostic biomarkers, which affect the outcome of patients in terms of a clinical end point. · Predictive biomarkers, which affect the effect of a specific treatment on a clinical end point. · Surrogate biomarkers, which may replace a clinical end point in clinical trials carried out to evaluate the effect of a specific treatment.
STATISTICAL CRITERIA FOR BIOMARKERS DEFINITIONS A biomarker can be formally defined as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic
Figure 23.1 schematically represents the generic situation of interest in this chapter. Patients are randomized to receive either a standard (Std) or an experimental (Exp) treatment. A biomarker and a clinical end point are observed for each patient. Let us
215
216
ONCOLOGY CLINICAL TRIALS
Std
–
Clinical endpoint: CS–
+
Clinical endpoint: CS +
–
Clinical endpoint: CX–
+
Clinical endpoint: CX+
Biomarker: BS
Treatment
Exp
Biomarker: BX
FIGURE 23.1 Generic situation of interest, depicting relationships between treatment, biomarker, and clinical end point.
assume that the biomarker is the realization of a dichotomous variable denoted B (with values denoted + or −), and that the clinical end point is the realization of a continuous variable denoted C. Let subscripts S and X denote patients randomized to receive the standard or the experimental treatment, respectively. Prognostic and Predictive Biomarkers A prognostic biomarker affects the clinical outcome regardless of treatment. With the notations of Figure 23.1, this requires to establish that the alternative hypothesis HA1 is true: HA1 : C+ − C− ≠ 0
(Eqn. 23-1)
as opposed to the corresponding null hypothesis that the clinical outcome is the same in the biomarker negative and positive groups. A predictive biomarker modifies the effect of treatment on the clinical outcome, which justifies the more common and preferable appellation outside the oncology field, of treatment effect modifier. This requires to establish that the alternative hypothesis HA2 is true: HA2 : (CX+ − CS+) − (CX− − CS−) ≠ 0
(Eqn. 23-2)
as opposed to the corresponding null hypothesis that the effect of treatment on the clinical outcome is the same whether the biomarker is positive or negative. Note that alternative hypothesis HA2 can be rewritten as HA2 : (CX+ − C X−) − (CS+ − C S−) ≠ 0 (Eqn. 23-3) which also shows that a predictive biomarker is one that has a different prognostic impact in patients
receiving the experimental treatment as compared to patients receiving the standard treatment. The distinction between prognostic and predictive biomarkers is illustrated graphically on Figure 23.2, which shows the mean value of the clinical end point as a function of the biomarker status and the treatment received (3). Figure 23.2 can be interpreted as follows: · Panel A shows a biomarker that would be of no utility, since its status does not modify patient outcome, regardless of treatment. · Panel B shows a purely prognostic biomarker, for which the clinical outcome of the patients depend on their biomarker status. In this figure, it has been assumed that the clinical outcome tends to be better when the biomarker is positive than when it is negative. Prognostic biomarkers are quite common: for example, in patients with early breast cancer, lack of nodal involvement is a factor of good prognosis regardless of treatment. · Panel C shows a purely predictive biomarker, for which the clinical outcome of the patients who receive a certain treatment depends on the biomarker status. In this figure, it has been assumed that the clinical end point tends to be better when the biomarker is positive when patients receive the Exp treatment, but not when they receive the Std treatment. Predictive biomarkers have been known for a long time in patients with early breast cancer, for whom the hormonal receptor status predicts the effect of endocrine therapies such as tamoxifen or aromatase inhibitors. A biomarker can be predictive of efficacy or safety for a given treatment or class of treatments and not for another. For instance, hormonal receptor status is not predictive of the effect of chemotherapy in early breast cancer patients.
217
23 BIOMARKERS AND SURROGATE END POINTS IN CLINICAL TRIALS
B: Prognostic biomarker
Clinical endpoint <- Worse Better ->
Clinical endpoint <- Worse Better ->
A: Absence of biomarker
Std
Std
Exp Treatment
Exp Treatment
D: Prognostic predictive
Clinical endpoint <-Worse Better ->
Clinical endpoint <-Worse Better ->
C: Predictive biomarker
Std
Std
Exp
Exp Treatment
Treatment
FIGURE 23.2 Distinction between prognostic and predictive biomarkers. Patients can be biomarker positive (full line) or negative (dotted line), and can receive a Std treatment or an Exp treatment.
· Panel D shows a biomarker that is both prognostic and predictive. In this figure, it has been assumed that the clinical end point tends to better when the biomarker is positive than when it is negative, but the difference in clinical outcome is larger for patients receiving the Exp treatment than for those receiving the Std treatment. Overexpression of the HER2-neu gene in patients with early breast cancer provides an example of a biomarker that has both prognostic value (patients with HER2-neu overexpression having a worse prognosis) and predictive value for herceptin (patients with HER2-neu overexpression deriving a benefit from this treatment). Surrogate Biomarkers A surrogate biomarker must be able to replace a clinical end point for the purposes of evaluating the effect of an Exp treatment as compared with a Std treatment. The requirements for a biomarker to be considered a valid surrogate have been a theme of intense debate in the statistical literature over the last years. In early discussions about surrogate biomarkers, a common misconception was that it was sufficient for a biomarker
to be prognostic for the clinical end point to establish surrogacy (alternative hypothesis HA1 above). It was later argued that the presence of an association between the biomarker and the clinical end point, no matter how strong, does not imply surrogacy. Formal validation criteria were proposed by Prentice (4), with a key criterion being that the effect of treatment on the clinical end point be completely captured by the biomarker. This could potentially be tested by establishing the truth of alternative hypotheses HA3 and HA4: HA3 : CX − CS ≠ 0
(Eqn. 23-4)
which states that treatment has an effect on the clinical end point in the absence of biomarker information, and HA4 : CX+ − CS+ = CX− − CS− = 0
(Eqn. 23-5)
which states that there is no residual effect of treatment on the clinical end point after accounting for the surrogate. Establishing that HA4 is true raises a conceptual problem since it is not possible, in a finite sample, to show the complete absence of an effect (5). Several authors have therefore proposed to replace
ONCOLOGY CLINICAL TRIALS
hypothesis testing by an assessment of the following associations (6): · The association between the biomarker (B) and the clinical end point (C). This is called the individuallevel association because it is estimated using data on the biomarker and the clinical end point from individual patients. If there is a strong individual-level association, the clinical end point can be reliably estimated from the biomarker in individual patients. Note that the condition of a strong individual-level association is akin to showing HA1 to be true in a hypothesis testing framework. · The association between the effects of treatment on the biomarker (BX − BS) and the clinical end point (CX − CS). This is called the trial-level association because it must be estimated using data from several randomized trials (or other groups of patients receiving either the Std or the Exp treatment). If there is a strong trial-level association, the effect of treatment on the clinical end point can be reliably estimated from the effect of treatment on the biomarker. Note that the condition of a strong trial-level association is akin to (but less strict than) showing HA4 to be true in a hypothesis testing framework. To be accepted, a surrogate that has biological plausibility would require to be validated at both the individual level and the trial level. The appropriate statistics to quantify the strength of the associations, both at the individual level and at the trial level, are a matter of debate. Standard correlation coefficients are currently most popular, but new measures derived from information theory have been recently proposed (7). For illustration, Figure 23.3 shows a linear regression line used to predict treatment effects on the true end point from the observed treatment effects on the surrogate biomarker. It is important to underline that the strength of the association between a biomarker and a clinical end point does not in general provide information about the relationship between the effects of treatment on the biomarker and the clinical end point (i.e., between treatment-induced changes in the biomarker and corresponding changes in the risk of the clinical end point). To see the independence of the two levels of associations, consider two hypothetical situations: · One extreme situation in which a biomarker is perfectly correlated with a clinical end point, but the treatment exerts some of its effects through a pathway that does not involve the biomarker. In this situation, the trial-level correlation could be low if the effect of treatment on the clinical end point was mostly medi-
1 Treatment effect on true endpoint
218
0.5
0
– 0.5
–1 –1 0 1 Treatment effect on surrogate biomarker Treatment effects could be, for example, log odds ratios for binary end points or log hazard ratios (HR) for time-to-events. The symbol size is proportional to the number of patients in each trial. The full line corresponds to the linear regression line; the dotted lines to no treatment effect on the biomarker and on the clinical end point.
FIGURE 23.3 Correlation between treatment effects on the surrogate biomarker and the true end point across a set of trials (circles).
ated by other factors than the biomarker of interest. · The other extreme situation in which there is no individual-level correlation between the biomarker and the clinical end point, yet there is a perfect triallevel correlation because of a confounding factor that leads to a spurious correlation between the treatment effects on the biomarker and the clinical end point. In most situations of clinical interest, one would not expect to encounter situations as extreme as those just described. The treatment would be expected to shift patients from a high-biomarker-risk group to a low-biomarker-risk group, and the difference in prognosis between the two biomarker-risk groups would be expected to reasonably predict the impact of treatment on the clinical end point of interest. For example, there is some empirical evidence that treatment-induced cholesterol reductions lead to reductions in major cardiovascular events that parallel predictions made from population-based epidemiological studies. It remains to be seen whether such predictions can successfully be identified in oncology.
USES OF BIOMARKERS Biomarkers are already in common use in patient management: prostate-specific antigen (PSA) is used to monitor progression of disease in prostate cancer,
23 BIOMARKERS AND SURROGATE END POINTS IN CLINICAL TRIALS
carcinoembryonic antigen (CEA) in colorectal cancer, and so on. This does not automatically imply that these biomarkers are useful for clinical research purposes. In terms of clinical development of new treatments, biomarkers have the potential of increasing the sensitivity (statistical power) of clinical trials and of reducing the time required to carry them out. Specifically, biomarkers can be used to select patients eligible for clinical trials, to stratify patients at entry in clinical trials, to monitor patients and guide treatment decisions, or to substitute for a clinical end point in the evaluation of the effects of new treatments. These purposes are quite different from each other, and imply different conditions for the biomarkers. Prognostic and Predictive Biomarkers Prognostic biomarkers may be used to adjust the therapy for individual patients, with patients of poor prognosis being treated more aggressively than patients of good prognosis. A prognostic biomarker could be particularly helpful (8) for (1) understanding the natural history of the disease; (2) using the prognostic biomarker as a stratification factor in a future clinical trial; (3) conducting a trial in a high-risk population with a new drug, in order to get a rapid readout of the drug’s efficacy. If the drug turns out to have efficacy in high risk patients, it does not automatically follow that it will have efficacy in low risk patients, and therefore a confirmatory trial in low risk patients may still be warranted. Although prognostic biomarkers can be useful, it is mostly predictive biomarkers that have the potential of changing oncology practice in the near future, as the latter allow one to select a particular treatment among different candidates for a group of patients on the basis of their biomarker values. The presence of unknown predictive factors in a patient population can profoundly affect the statistical power of trials aimed at showing the benefit of new treatments for these patients (9). Conversely, if a biomarker could reliably identify a subset of patients who derive the most benefit (or the least toxicity) from a new treatment, then clinical trials could be restricted to this subset, and the treatment of future patients could be targeted at this subpopulation only. For example, it has recently been established that KRAS mutation is a predictive factor for the effect of anti-epidermal growth factor receptor (EGFR) monoclonal antibodies such as cetuximab or panitumumab for patients with advanced colorectal cancer (10–13). No responses were seen in several series of patients with mutant KRAS tumors, and there is now compelling evidence that only patients with wild-type KRAS tumors should be treated with EGFR-targeted agents.
219
Surrogate Biomarkers The search for prognostic or predictive biomarkers has been intense over the last years, but the real long-term hope is to identify a surrogate biomarker, i.e., a biomarker capable, in and of itself, to predict that a new treatment will have a desired effect on the ultimate end point of interest. Currently, no such biomarkers have been identified in oncology. One of the most intensely studied cancer biomarkers is PSA, a glycoprotein found almost exclusively in normal and neoplastic prostate cells. Changes in PSA often antedate changes in bone scan, and they have long been used as an indicator of response in patients with androgen-independent prostate cancer (14, 15). Notwithstanding the usefulness of PSA for patient management, it was recently shown that PSA could not be considered a surrogate for long-term clinical end points (16, 17). Similarly, tumor shrinkage was shown not to be an acceptable surrogate end point for overall survival in advanced colorectal cancer (18). Even so, biomarkers (such as tumor response) have often been used for drug approval in situations where the number of patients and/or the follow-up required to show the treatment effect on the long-term clinical end points would have led to unrealistic trials. IDENTIFICATION AND VALIDATION OF BIOMARKERS The clinical literature is replete with examples of biomarkers that have not been properly identified and validated, resulting in a large number of false claims for putative biomarkers that turn out to be of limited usefulness either for patient management or to assess the efficacy and safety of new therapies (19). The REMARK publication provides useful guidance for the conduct and reporting of prognostic biomarker studies (20). Basic methodological requirements for the identification and use of biomarkers are that their measurements be unbiased, accurate, standardized, and reproducible (21, 22). Biomarkers must be validated for their intended clinical use (3). Finally, fully validated biomarkers must fulfill the statistical criteria outlined above. Table 23.1 shows the three categories of biomarkers discussed above, and the minimum requirements for the statistical validation of biomarkers in each category. Prognostic Biomarkers For a prognostic biomarker, the baseline value of the biomarker, or changes in the biomarker over time, should be correlated with the clinical end point in
220
ONCOLOGY CLINICAL TRIALS
TABLE 23-1
Different Types of Biomarkers and Minimum Requirements for Their Statistical Validation. MINIMUM REQUIREMENTS FOR STATISTICAL VALIDATION BIOMARKER
KEY FEATURE
STUDY DESIGN
SAMPLE SIZE
Prognostic
Biomarker predicts clinical outcome Biomarker predicts treatment effect on clinical outcome Treatment effect on biomarker predicts treatment effect on clinical outcome
Case-control or cohort study Large randomized trial
> 100 patients
Predictive
Surrogate
untreated or in treated patients. This condition is straightforward to establish statistically, and does not require any particular study design: the prognostic impact of a biomarker can be investigated retrospectively in any series of patients providing a sufficient number of clinical events are available. In addition, a prognostic biomarker will be of clinical interest only if its impact on the clinical end point of interest is large enough, hence the number of patients required will often be smaller than that required to establish or confirm a treatment benefit. Nowadays, many research efforts in clinical oncology aim at deriving prognostic biomarkers using gene expression profiles of RNAs extracted from retrospective tumor tissue collections (23). One such example of prognostic microarray biomarker in early breast cancer is a 70-gene profile or signature used to assess a patient’s risk for distant metastasis, which was originally developed on a retrospective series of only 78 patients, and was later validated on a larger, independent series (24). The ideal setting to identify and validate a prognostic biomarker is a randomized therapeutic trial, in which all procedures to measure the biomarker are specified in the protocol (biological material collected, methods of preservation, assays and quantification methods used, quality control procedures, reproducibility assessments, etc.) and in which all patients fulfill well-defined characteristics and are treated and followed according to a predetermined schedule. One issue that has not received sufficient attention is the choice of appropriate statistic(s) to quantify the impact of a prognostic biomarker on the clinical outcome of interest. The p-value of a test comparing the clinical outcome in biomarker positive vs. biomarker
> 500 patients
Several randomized > 10 units of analysis trials, or a large trial > 1,000 patients with several units of analysis (e.g., countries)
negative patients provides insufficient evidence that the biomarker is of any clinical use. Indeed, the p-value depends on the sample size so that in large series of patients, factors having a small (perhaps negligible) impact on the patient prognosis, may well reach statistical significance. Measures that quantify the magnitude of effect of the biomarker on the clinical end point, such as the odds ratio (for tumor response and other dichotomous end points) or the hazard ratio (for survival time and other time-related end points), along with their confidence limits, are far more informative. However, these statistics are not sufficient to gauge the performance of the prognostic biomarker (25). Measures of predictive accuracy are needed as well, and these are usually provided by the sensitivity and specificity of the biomarker, or equivalently by its positive and negative predictive value. More specific statistical measures of predictive accuracy, such as the Receiver Operating Characteristics (ROC), curves and measures of explained variation can appropriately be used. A biomarker is of interest only if it provides additional prognostic value, over and above that of all easily measured clinical and pathological characteristics of the patients. The gain in predictive accuracy by the classifier as compared to established clinical prognostic factors should therefore be quantified (24, 26, 27). Predictive Biomarkers For a predictive biomarker, the baseline value of the biomarker, or changes in the biomarker over time, should be correlated with the effect of treatment on the clinical end point of interest. This condition is far more difficult to establish than a mere prognostic impact of the biomarker on the end point. It is a common
23 BIOMARKERS AND SURROGATE END POINTS IN CLINICAL TRIALS
misconception that a biomarker having prognostic value in a group of treated patients is predictive, even if the biomarker does not have prognostic value in another group of untreated patients. Indeed, the lack of impact in one group and the impact in another may be due to different selection criteria and other confounding factors that make such an indirect comparison untenable. The most reliable way to formally identify and validate a predictive factor is through a comparative trial as shown in Figure 23.1, where patients are randomized either to a Std treatment or to an Exp treatment for which a predictive biomarker is thought to exist, since such a design makes treated and untreated patients comparable. The statistical evidence required to establish that a biomarker is truly predictive is an open question. The most convincing situation would be one in which an interaction test between the effect of treatment and the biomarker status reaches statistical significance in a particular model. If the null hypothesis that the effect of treatment is the same whether the biomarker is positive or negative is rejected, one can safely conclude that the biomarker status modulates the effect of treatment—which is exactly equivalent to concluding that the biomarker is predictive. The major problem of interaction tests is that they lack statistical power, so that very large trials would be required for interaction tests to reach significance. As a rule of thumb, the sample size required to show an interaction is at least four times larger than that required to detect main effects of the same magnitude (28). The absence of treatment effect in a particular subgroup may sometimes be sufficiently convincing even if the interaction test fails to provide formal confirmation of this observation. Often biological considerations will alleviate the need for definite statistical evidence. For instance, it is acceptable to consider HER2-neu status as a predictive biomarker for herceptin, even though no study was carried out to formally test for an interaction between the biomarker and the targeted agent. Surrogate Biomarkers As discussed above, for a surrogate to be valid, it must show a strong association with the true end point at both the individual level (association between biomarker and clinical end point) and the trial level (association between treatment effects on biomarker and clinical end point). The best theoretical setting to test both conditions consists of several large-scale randomized clinical trials in the context of a meta-analysis, or a large randomized trial that can be broken down in smaller, clinically meaningful units, such as countries or clinical sites.
221
Meta-analyses have been used, for example, to show that disease-free survival (DFS) is an acceptable surrogate end point for overall survival (OS) in resectable colorectal cancer (29), and progression-free survival (PFS) in advanced (metastatic) colorectal cancer (30). Admittedly, DFS or PFS are clinical end points and not biomarkers, but the statistical issues in validating potential surrogate biomarkers are exactly the same as in validating surrogate end points (31). One must bear in mind that if a biomarker has been validated as a surrogate end point for a particular class of treatments in a given disease, this does not automatically carry over to another treatment or disease because the mechanism of action of the treatments or the natural course of the diseases may differ. In the past, many important lessons have been learned about risks involved in accepting surrogate end points (32). Even commonly used surrogate end points are unlikely to capture all the therapeutic benefits and potential adverse effects or safety issues a drug will have in a diverse patient population. In order for a surrogate biomarker to be used in the clinical development of a new drug, surrogacy would have to be demonstrated across a range of treatments, for instance using data from randomized trials testing multiple drugs, or one new drug compared with multiple established treatments. This again suggests recourse to a meta-analysis of several randomized trials (33).
BIOMARKER-BASED TRIAL DESIGNS This section briefly discusses trial designs appropriate to use and validate the clinical usefulness of prognostic and predictive biomarkers available at the start of the trial. Prognostic Biomarkers When a prognostic biomarker has been identified, it can be used to drive treatment decisions in a randomized fashion. For instance, it is typical for patients with low risk to require only standard therapy (possibly watchful follow-up), and for patients with high risk to require experimental therapy. However, for patients with intermediate risk (based on biomarker and/or clinic-pathological factors), uncertainty may prevail as to whether these patients should be treated aggressively or not, hence these patients could be randomized to either Std or Exp treatment as shown in Figure 23.4 (panel A). Two examples of such randomized trials in early breast cancer are currently being conducted in order to demonstrate the clinical usefulness of a prognostic gene profile.
222
ONCOLOGY CLINICAL TRIALS
Biomarker – Std
Biomarker: BX
Biomarkerbased strategy
Biomarker Biomarker + Exp
R
–
Non biomarkerbased strategy
Low risk
+ Biomarker and Clinical Features
C Biomarker-based strategy design: patients are randomly assigned to have their treatment determined by their biomarker status or to receive standard treatment.
Intermediate risk
– High risk
Std
Biomarker – Std Biomarkerbased strategy
+
A Randomized design for intermediate risk: patients with low risk require only standard therapy, patients with high risk require experimental therapy, patients with intermediate risk are randomized to either standard or experimental treatment.
Biomarker Biomarker + Exp
R Non biomarkerbased strategy
Std Marker –
Std
R
Std R Exp
D Modified biomarker-based strategy design: patients are randomly assigned to have their treatment determined by their biomarker status or to be randomized again to either standard or experimental treatment.
Exp Biomarker Std Marker +
R
Biomarker follow-up
Exp B Stratified randomized design: patients are stratified according to biomarker status (positive or negative) and randomized to either standard or experimental treatment.
Biomarker – Std
+ Exp
R Routine follow-up
Std
E Biomarker-based follow-up design: patients are randomized to routine follow-up with standard treatment or to biomarker follow-up, with treatment switched upon a change in the biomarker.
FIGURE 23.4 Clinical trial designs for prospective biomarker validation.
The first example is provided by the MINDACT trial (Microarray In Node-negative Disease May Avoid Chemotherapy Trial), a clinical trial for women with early breast cancer carried out under the auspices of the European Organization for Research and Treatment of Cancer (EORTC) (34). The key question in this trial is whether the biomarker in question (the 70-gene profile) has better prognostic value than the clinical and pathological characteristics of the patients, and as such should be used to identify high risk patients who need chemotherapy (assumed here to be the Exp treatment). The trial design consists of randomizing patients whose
risk classification is discrepant, whether it is based on traditional clinicopathological characteristics or on the biomarker, to chemotherapy or not. Patients whose risk assessment is the same using traditional clinicopathological characteristics as with the biomarker receive the treatment adapted to their risk as done in current clinical practice. A similar trial known as TAILORx (Trial Assigning Individualized Options for Treatment[Rx]) is carried out in the same patient population under the auspices of the U.S. National Cancer Institute’s Program for the Assessment of Clinical Cancer Tests to
23 BIOMARKERS AND SURROGATE END POINTS IN CLINICAL TRIALS
demonstrate the clinical usefulness of the 21-gene signature Oncotype DX (35) . In this trial, patients with an intermediate risk score as determined by the biomarker (21-gene profile) are randomized to receive chemotherapy or not. Patient whose biomarker-risk assessment is low or high receive standard of care treatment. The primary end point of this trial consists of showing formal noninferiority between the intermediate biomarker-risk group of patients not treated with chemotherapy as compared to those treated. Predictive Biomarkers In practice, prognostic biomarkers are often validated in the context of a randomized trial, in which case their predictive ability can also be investigated. Assuming a randomized trial is contemplated to compare an Exp treatment with a Std one, the simplest design randomizes all patients to receive either treatment without taking any account of the biomarker (a “completely randomized design”), or after stratification for the biomarker (a “stratified randomized design”), as shown on Figure 23.4 (panel B). The latter design is almost always preferable to the former, since it balances treatment allocation in patients with biomarker positive and negative values, but neither design uses the biomarker status to modify the treatment allocation. An imbalanced allocation could be used in order to assign a larger proportion of patients to either treatment depending on the biomarker status. At the cost of a small power loss, such an imbalanced allocation would assign more biomarker positive patients to the experimental group (thought to be beneficial for these patients) and more biomarker negative patients to the standard group. An example of a stratified randomized trial is the p53 trial, run by the Breast International Group and the EORTC, and powered to test the biological hypothesis that p53-mutated tumors (biomarker positive) are resistant to anthracyclines but retain sensitivity to taxanes (36). In the subset of biomarker positive patients, the benefit in DFS from taxanes is expected to be four times larger than that for the biomarker negative patients. Typically, when developing a new therapy (e.g., a therapy targeting the action of a specific gene) one does not know in advance whether or not the therapy will be effective in the entire population or only in a particular subgroup defined by positivity of a specific candidate predictive biomarker. One could design a trial to test the null hypothesis that there is no treatment effect both in the overall population and in the subset of biomarker positive patients (37). This design provides a win-win scenario for a pharmaceutical company when a predictive biomarker is suspected
223
but one would first want to test the drug in the overall population. This design will need far less patients than the stratified randomized design but will not have the statistical power to formally test an interaction between the candidate predictive biomarker and the treatment effect. For biomarkers that are already well documented but still await clinical validation, various more complex designs have been suggested. These designs aim at validating the biomarker while taking advantage of their predictive character, thus maximizing the potential for patient benefit. The first design, shown on Figure 23.4 (panel C), is called the “biomarker-based strategy design.” Patients are randomly assigned to have their treatment determined by their biomarker status or to receive treatment independent of their biomarker status (38, 39). In this design, patients in the biomarker-based group receive the Exp treatment if they are biomarker positive and the Std treatment otherwise, while all patients in the nonbiomarkerbased group receive the Std treatment (i.e., the treatment they would have received outside of the trial). This design suffers from the fact that the biomarkerbased group will do better if the Exp treatment has an effect in all patients, regardless of biomarker status. The treatment effect comparison is also diluted because the biomarker-negative patients from the nonbiomarker based group get exactly the same treatment as the biomarker-negative patients in the biomarkerbased group. Another design called the “modified biomarkerbased strategy design” reduces this problem by randomizing patients in the nonbiomarker-based group to receive either the Exp treatment or the Std standard treatment, in a ratio equal to the ratio of the incidences of biomarker positive and biomarker negative (38, 39). This design is shown on Figure 23.4 (panel D). If one is studying a biomarker whose measures can evolve over time, the “biomarker-based follow-up design” could be of interest (panel E): in this design, patients are randomized to routine follow-up with Std treatment or to biomarker follow-up with treatment switched upon a change in the biomarker. The biomarker needs to be measured repeatedly over time in this setting. Biomarker-Adaptive Designs An adaptive design allows modifications to some aspects of the trial design after its initiation without introducing bias or undermining its validity and integrity. It can be expected that adaptive designs will increasingly be used to take advantage of biomarker knowledge that emerges during patient accrual, so as
224
ONCOLOGY CLINICAL TRIALS
to facilitate a rapid and efficient evaluation of new cancer treatments. A biomarker-adaptive design could be used to (1) select the right patient population (e.g., enrichment process for selection of a better target patient population), (2) identify the natural course of disease, (3) detect disease progression earlier, and (4) help in developing personalized medicine (40). Such a design could be particularly useful when a predictive biomarker is not known at the start of the trial. Recently, a two-stage randomized design has been proposed to identify a predictive gene expression profile and to validate it in a single prospective trial (41). In the first stage, the gene expression profile is identified to predict whether a patient is more likely to benefit from the Exp treatment compared with the Std one through an interaction test. The gene expression profile is prospectively applied to identify the subset of sensitive patients among stage II patients, rather than to restrict the entry of stage II patients. The final analysis of the trial consists of a comparison of the Exp treatment with the Std treatment in the whole trial population, as well as in the subset of the stage II sensitive patients (41), with proper adjustment of the significance level of each test to keep the overall significance level under an acceptable value such as 0.05. CONCLUSION The number of candidate biomarkers is increasing dramatically, as a result of better molecular characterization of tumors and the advent of genomic technologies. Several biomarkers are getting close to introduction into daily oncology practice or into trial designs for the evaluation of new drugs. The road to implementation of biomarkers is steep, and each candidate biomarker needs to be rigorously evaluated for its intended use, depending on whether it is expected to be a prognostic or a predictive biomarker, or even (in the best case) a potential surrogate end point for a long-term clinical outcome.
References 1. Biomarker Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther. 2001;Mar;69(3):89–95. 2. Temple RJ. A regulatory authority’s opinion about surrogate endpoints. In: Nimmo WS, Tucker GT, eds. Clinical Measurement in Drug Evaluation. New York: Wiley & Sons; 1995:3–22. 3. Hayes DF, Trock B, Harris AL. Assessing the clinical impact of prognostic factors: when is “statistically significant” clinically useful? Breast Cancer Res Treat. 1998;52(1–3):305–319. 4. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989 Apr;8(4):431–440.
5. Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics. 1998 Sep;54(3):1014–1029. 6. Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000 Mar;1(1):49–67. 7. Alonso A, Molenberghs G, Geys H, Buyse M, Vangeneugden T. A unifying approach for surrogate marker validation based on Prentice’s criteria. Stat Med. 2006 Jan 30;25(2):205–221. 8. Windeler J. Prognosis—what does the clinician associate with this notion? Stat Med. 2000 Feb 29;19(4):425–430. 9. Betensky RA, Louis DN, Cairncross JG. Influence of unrecognized molecular heterogeneity on randomized clinical trials. J Clin Oncol. 2002 May 15;20(10):2495–2499. 10. De Roock W, Piessevaux H, De Schutter J, et al. KRAS wildtype state predicts survival and is associated to early radiological response in metastatic colorectal cancer treated with cetuximab. Ann Oncol. 2008;19:508–515. 11. Di Fiore F, Blanchard F, Charbonnier F, et al. Clinical relevance of KRAS mutation detection in metastatic colorectal cancer treated by Cetuximab plus chemotherapy. Br J Cancer. 2007;96:1166–1169. 12. Lièvre A, Bachet JB, Boige V, et al. KRAS mutations as an independent prognostic factor in patients with advanced colorectal cancer treated with cetuximab. J Clin Oncol. 2008 Jun 23;26:374–379. 13. Amado RG, Wolf M, Peeters M, et al. Wild-type KRAS is required for panitumumab efficacy in patients with metastatic colorectal cancer. J Clin Oncol. 2008;26:1626–1634. 14. Smith DC, Dunn RL, Strawderman MS, Pienta KJ. Change in serum prostate-specific antigen as a marker of response to cytotoxic therapy for hormone-refractory prostate cancer. J Clin Oncol. 1998 May;16(5):1835–1843. 15. Sridhara R, Eisenberger MA, Sinibaldi VJ, Reyno LM, Egorin MJ. Evaluation of prostate-specific antigen as a surrogate marker for response of hormone-refractory prostate cancer to suramin therapy. J Clin Oncol. 1995 Dec;13(12): 2944–2953. 16. Collette L, Burzykowski T, Carroll KJ, Newling D, Morris T, Schroder FH. Is prostate-specific antigen a valid surrogate end point for survival in hormonally treated patients with metastatic prostate cancer? Joint research of the European Organisation for Research and Treatment of Cancer, the Limburgs Universitair Centrum, and AstraZeneca Pharmaceuticals. J Clin Oncol. 2005 Sep 1;23(25):6139–6148. 17. Bloom JC, Dean RA. Biomarkers in Clinical Drug Development. New York: Marcel Dekker; 2003. 18. Buyse M, Thirion P, Carlson RW, Burzykowski T, Molenberghs G, Piedbois P. Relation between tumour response to first-line chemotherapy and survival in advanced colorectal cancer: a meta-analysis. Meta-Analysis Group in Cancer. Lancet. 2000 Jul 29;356(9227):373–378. 19. Hammond ME, Taube SE. Issues and barriers to development of clinically useful tumor markers: a development pathway proposal. Semin Oncol. 2002 Jun;29(3):213–221. 20. McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst. 2005 Aug 17;97(16):1180–1184. 21. Ransohoff DF. How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. J Clin Epidemiol. 2007 Dec; 60(12): 1205–1219. 22. Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer. 2004 Apr; 4(4): 309–314. 23. Michiels S, Koscielny S, Hill C. Interpretation of microarray data in cancer. Br J Cancer. 2007 Apr 23;96(8):1155–1158. 24. Buyse M, Loi S, van’t Veer L, et al. Validation and clinical utility of a 70-gene prognostic signature for women with nodenegative breast cancer. J Natl Cancer Inst. 2006 Sep 6; 98(17):1183–1192.
23 BIOMARKERS AND SURROGATE END POINTS IN CLINICAL TRIALS
25. Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004 May 1;159(9):882–890. 26. Dunkler D, Michiels S, Schemper M. Gene expression profiling: does it add predictive accuracy to clinical characteristics in cancer prognosis? Eur J Cancer. 2007 Mar;43(4): 745–751. 27. Kattan MW. Judging new markers by their ability to improve predictive accuracy. J Natl Cancer Inst. 2003 May 7; 95(9): 634–635. 28. Peterson B, George SL. Sample-size requirements and length of study for testing interaction in a 2 x k factorial design when time-to-failure is the outcome. Cont Clin Trials. 1993; 14(6): 511–522. 29. Sargent DJ, Wieand HS, Haller DG, et al. Disease-free survival versus overall survival as a primary end point for adjuvant colon cancer studies: individual patient data from 20,898 patients on 18 randomized trials. J Clin Oncol. 2005 Dec 1; 23(34):8664–8670. 30. Buyse M, Burzykowski T, Carroll K, et al. Progression-free survival is a surrogate for survival in advanced colorectal cancer. J Clin Oncol. 2007 Nov 20;25(33):5218–5224. 31. Burzykowski T, Molenberghs G, Buyse M. The Evaluation of Surrogate Endpoints. Statistics for Biology and Health. New York: Springer; 2005:xxiii, 408. 32. Lesko LJ, Atkinson AJ Jr. Use of biomarkers and surrogate endpoints in drug development and regulatory decision making:
33. 34. 35. 36. 37. 38. 39. 40. 41.
225
criteria, validation, strategies. Annu Rev Pharmacol Toxicol. 2001;41:347–366. Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Stat Med. 1997 Sep 15;16(17): 1965–1982. Bogaerts J, Cardoso F, Buyse M, et al. Gene signature evaluation as a prognostic tool: challenges in the design of the MINDACT trial. Nat Clin Pract Oncol. 2006 Oct;3(10):540–551. Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004 Dec 30;351(27):2817–2826. Therasse P, Carbonnelle S, Bogaerts J. Clinical trials design and treatment tailoring: general principles applied to breast cancer research. Crit Rev Oncol Hematol. 2006 Aug;59(2):98–105. Wang SJ, O’Neill RT, Hung HM. Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharm Stat. 2007 Jul–Sep;6(3):227–244. Sargent D, Allegra C. Issues in clinical trial design for tumor marker studies. Semin Oncol. 2002 Jun;29(3):222–230. Sargent DJ, Conley BA, Allegra C, Collette L. Clinical trial designs for predictive marker validation in cancer treatment trials. J Clin Oncol. 2005 Mar 20;23(9):2020–2027. Chow SC, Chang M. Adaptive design methods in clinical trials—a review. Orphanet J Rare Dis. 2008;3:11. Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clin Cancer Res. 2005 Nov 1;11(21):7872–7878.
This page intentionally left blank
24
Use of Genomics in Therapeutic Clinical Trials
Richard Simon
Effective phase III therapeutic clinical trials ask important questions and get reliable answers. The key components include having: (i) a new treatment with a strong biological basis, (ii) an appropriate control regimen, (iii) a well-defined target population of patients, (iv) randomized assignment of patients to the new treatment or control groups, (v) an endpoint that reflects patient welfare, (vi) a focused analysis plan that limits the chance of false positive conclusions, and (vii) an adequate sample size that limits the chance of false negative conclusions. Developments in cancer genomics are having a major impact on all of these key features.
MOLECULARLY TARGETED THERAPY Most new oncology drugs are intended to inhibit defined molecular targets. This targeting is the basis of most translational research—translating a basic research finding of the importance of a gene in oncogenesis or tumor pathogenesis into a drug that inhibits the deregulated activity of that gene. The pharmaceutical and biotechnology industries are often adept at developing inhibitors of over-expressed genes; distinguishing genes that are of key importance for tumor development and progression from those that activation represent secondary effects is, however, difficult. The rate-limiting steps to accelerating the development of curative cancer treatments are the identification of
the key molecular targets, the development of methods of individualizing the right treatment for the right patient, and improving detection strategies to initiate treatment early. In spite of the improvements in tumor biology, identification of the key oncogenic mutations and accurate mapping of the structure and dynamics of mammalian signaling pathways is quite imperfect. Although the complexity and difficulty of clinical translational research has frequently been emphasized, the success of translational research is often limited by the lack of knowledge of tumor biology to translate (1). In developing an inhibitor of a deregulated oncogene, the key principles of phase III clinical trial design apply. The development of molecularly targeted drugs not only provides improved opportunities for success and patient benefit, but also provides many new challenges. Two of the most important findings of the past decades of research in tumor biology are the near universality of intratumor and intertumor genomic heterogeneity. Intratumor heterogeneity indicates the importance of targeting key oncogenic mutations that dominate all components of the tumor and of treating early. Intertumor heterogeneity suggests that a molecular target will only be important for some of the tumors selected based on conventional criteria such as primary site, stage, and histopathology. Most patients who receive conventional chemotherapy do not benefit. Because cancer is life-threatening and because it has not been possible to identify which patients will benefit from
227
228
ONCOLOGY CLINICAL TRIALS
a treatment, oncologists have had to be content with treating the many for the benefit of the few. With molecularly targeted drugs, the minority may be even smaller. One of the important challenges in development of such drugs is the development of companion diagnostics to identify the patients whose tumors are driven by deregulated pathways that are targeted by the drug and to evaluate these diagnostics in a prospectively planned reliable manner, not in the post hoc way that correlative studies are often conducted.
PHASE I STUDIES Phase III trials should reliably determine whether a new treatment provides medical benefit, relative to a control regimen, for a prospectively defined patient population. The purpose of phase I and phase II trials are determining whether a new treatment warrants a phase III evaluation and to provide the information needed to design an effective phase III trial. The purpose of a phase I oncology trial has generally been to identify a dose schedule and mode of administration to be used in conducting later trials. Since most cancer drugs are toxic to normal cells and since there is little evidence that lower doses inhibit cancer cells more than higher doses, phase I trials have been based on attempting to find a maximum tolerated dose for repeated administration. A variety of clinical trial designs have been used for this purpose (2, 3). Most molecularly targeted drugs are toxic to normal mammalian cells. Although in the future, cancer drug development may focus more on drugs that are specific for the mutated form of oncogenes, today’s drugs are rarely so specific. It is difficult to design highly specific kinase inhibitors, and some companies seem to prefer drugs with multiple targets. Consequently, standard phase I trial designs developed previously are widely used for molecularly targeted drugs (4). The assumption that a dose below the MTD is unlikely to be more effective than the MTD seems more tenuous for drugs that are modulating key-signaling pathways, but it is difficult to perform phase I trials in which the pharmacodynamic effects of a drug are adequately studied as a function of dose. Such studies require the development of sensitive and reproducible assays as well as adequate numbers of patients whose tumors can be biopsied before and after drug administration at each dose-level studied. The phase 0 study concept is an attempt to perform a limited proof of concept evaluation at a very early stage of clinical development. The study is performed at a very low dose even before preclinical toxicology is completed to determine whether the drug is inhibiting its intended target (5).
An alternative approach to early clinical development is: (i) first conduct a phase I trial to determine an MTD suitable for repeated drug administration; (ii) then conduct a two-arm randomized study for evaluating pharmacodynamic end points at two-dose levels—one just below the MTD and the other at a lower level. It may be more appropriate to think of such a study as a phase II study because a larger number of patients will be required than the three patients per dose level typical of phase I studies and because the patients should all have the type of tumor being targeted by the drug. Use of surrogate tissue introduces additional complexities of interpretation. The requirements of sensitive and reproducible assays and availability of patients with accessible tissue remain formidable challenges for such studies. For some types of cancer, the most suitable patient population for these pharmacodynamic studies may be patients who can be treated with one or two doses of the drug in the neoadjuvant setting.
PHASE II STUDIES The effectiveness of phase II trials in oncology has often been problematic. Phase II studies have typically been single-arm studies that evaluate a new drug as a single agent or in combination with established drugs for patients with advanced metastatic disease. Single-agent phase II studies generally attempt to determine whether the drug can cause tumor shrinkage that qualifies as a partial response or better. Since tumors rarely shrink on their own, it is feasible to use a single-arm design to determine whether a new single agent can cause partial responses in a relatively small number of patients with advanced metastatic disease (6). As a screen for further development of the drug, however, such a study is imperfect. Drugs that cause some partial responses when given as single agents may not contribute to the survival of patients when the drug is administered with standard drugs. Drugs that do not cause partial responses in patients with advanced metastatic disease might be effective for patients with earlier stages of cancer. Also, focusing on the overall response rate for tumors of a primary site is less acceptable for molecularly targeted drugs because of the heterogeneity of the tumors occurring in many primary sites. Ratain points out that phase II oncology trials of standard agents have a high negative predictive value but a limited positive predictive value (7). If, however, the positive and negative predictive values were high, there would be little need for phase III trials. Single arm phase II studies of a new drug given in combination with standard agents are particularly problematic because it is difficult to reliably determine
24 USE OF GENOMICS IN THERAPEUTIC CLINICAL TRIALS
whether the new drug adds to the antitumor effect of the standard agents without an adequate control group. It is questionable whether single-arm phase II studies of drug combinations are worthwhile, except in cases where there is a body of data on outcomes for the standard agents which can be used to construct a control group with adequate adjustment for prognostic factors and for which the schedules and methods of outcome assessment are adequately standardized (8). Unfortunately, phase II studies are rarely conducted in that manner. Because molecularly targeted drugs often function by inhibiting deregulated pathways rather than by directly causing DNA damage, alternatives to tumor regression have been sought (9, 10). Ratain points out several well-documented examples of drugs which were demonstrated to delay tumor progression in randomized clinical trials but which had very low objective response rates; for example, sorafenib and bevacizumab in renal cell cancer. The argument for using a less stringent threshold for activity should perhaps be limited to drugs that are sufficiently nontoxic to normal cells that they could be administered for long periods of time to hold the tumor in check. Even so, such a treatment strategy may be likely to select for resistant tumor clones. In experimental tumor systems inhibiting a deregulated oncogene to which the tumor is addicted leads (in some cases) to apoptosis of the cell and shrinkage of the tumor. Whereas, human metastatic tumors may rarely be addicted to single oncogenes, a drug that is cytostatic in metastatic disease may have much greater effects for earlier-stage patients. Using tumor growth stabilization or growth rate reduction as a phase II end point reliably generally requires a randomized control arm (9, 11, 12). Whereas partial response can generally be attributed to drug effect, the same is not true for stabilization of growth rate. The process of tumor measurement is imprecise and tumors grow at widely varying rates. To meaningfully use tumor stabilization as an end point in a single-arm phase II study, the duration of stabilization should be selected sufficiently long relative to the time-to-progression distribution seen in the absence of effective treatment. The procedures used for detecting progression and the frequency of follow-up should also be compatible for the phase II trial patients and the historical controls. Simon et al. (13, 14), Rosner et al. (15), Rubinstein et al. (11), and Karrison et al. (12) have described randomized phase II designs for comparing a new regimen to a control. Such designs can be used to evaluate whether the time until tumor progression is delayed by administration of the new regimen. Although time-toprogression may not be an accepted measure of patient benefit or a validated surrogate of survival, it may be
229
useful as a phase II end point as a measure of antitumor effect. It is unrealistic to expect that a phase II end point be validated as a surrogate for survival. If it were validated, then it would be an acceptable phase III end point. It is not widely appreciated how difficult it is to properly validate an end point as a surrogate of survival (16–19). Conducting a randomized phase II study in the neoadjuvant setting is also an option. The neoadjuvant phase II design ensures the availability of tumor tissue for evaluating pharmacodynamic end points. Establishing that a drug inhibits its target is not the same as establishing that the drug inhibits tumor growth. With the neoadjuvant phase II design, one can also assay the surgical specimen for apoptosis of tumor cells. By comparing the apoptosis endpoint between a control group not receiving the drug and groups receiving the drug at one or more different dose levels, an assessment of the antitumor effect of the drug can be obtained. An important advantage of the neoadjuvant phase II design is that the drug is evaluated on patients with less advanced disease. This can be extremely important for identifying the potential of a new drug. Even drugs such as trastuzumab and gleevec are much more effective when administered earlier.
PREDICTIVE BIOMARKERS The term biomarker has a variety of different meanings. Traditionally, a biomarker was thought of as a biological measurement which tracks the pace of a disease—increasing as the disease progresses, decreasing as the disease regresses. There are few examples of measurements which really satisfy that goal. We rarely understand the biology of a disease well enough to know whether a measurement qualifies for such a definition, and there are important examples where the medical profession was led astray in evaluating new treatments by relying on measurements which seemed to have strong biological credentials as biomarkers. It is a mistake to propose criteria for validation of biomarkers without focus on the intended use of the biomarker. Validation or qualification of a biomarker has meaning only as “fitness for purpose” and so the focus should be on purpose. The phrase predictive biomarker has come to mean a biological measurement made before treatment whose purpose is to enable prediction of whether the patient will benefit from a particular treatment. For patients with metastatic breast cancer, estrogen receptor content of the tumor is a predictive biomarker for tamoxifen treatment, and amplification of the Her-2 gene is a predictive biomarker for trastuzumab treatment.
230
ONCOLOGY CLINICAL TRIALS
A predictive biomarker can substantially enhance the effectiveness of drug development. For example, approval of trastuzumab for treatment of metastatic breast cancer was based on a phase III randomized clinical trial which restricted entry to patients with 2+ or 3+ expression of the Her-2 protein. Approximately 25% of screened patients had such over-expression. The benefit of trastuzumab in the randomized trial of 469 patients was statistically significant, but not overwhelming (20). Had the biomarker not been measured and used to restrict eligibility, it is almost certain that the difference in outcome for the group as a whole would have been so diluted by lack of effectiveness of trastuzumab in patients who don’t over-express Her-2 that the trial would have been negative. How does one develop, use and validate a predictive biomarker? We will address these issues here and in subsequent sections. Usually the predictive biomarker will be a measurement of a gene or protein product related to the target of the drug. For a monoclonal antibody, expression of the ligand in tumor cells is a natural candidate. Expression of the ligand is generally a necessary but not sufficient condition for antitumor effectiveness of the antibody. If the molecular target is deregulated by a genomic alteration, it is important to know the nature of that alteration (e.g., somatic point mutation or gene amplification). Such knowledge will facilitate development of an assay for distinguishing patients whose tumors are driven by deregulation of the drug target from other tumors. In cases for which the nature of the genomic alteration of the target is unknown or for which there is uncertainty about the important target of the drug, there will be several candidate predictive biomarkers to be evaluated during phase II development. The evaluation involves measurement of all candidate predictors on pretreatment tumor tissue and determination of which candidate is most correlated with response to treatment. This differs from attempting to measure the candidates pre- and posttreatment and determine which ones are most changed by treatment. This latter strategy might be appropriate for determining which measure might serve best as a pharmacodynamic end point, but not for determining which is the best predictive biomarker. In evaluating candidate predictive biomarkers, it is desirable to have at least 10 responders for whom the pretreatment measurement has been made, and at least as many nonresponders. Dobbin et al. suggested a minimum of 20 to 30 responders for developing a predictive classifier based on genome-wide transcriptexpression profiling (21). The evaluation can be based on any endpoint that reflects antitumor effect, not just objective response. If one has conducted a randomized
phase II study of the new drug versus a control group not receiving the drug, candidate biomarkers can be evaluated as those for which the effect of treatment on time-to-progression depends on the level of the pretreatment biomarker measurement. In any case, the development of a predictive biomarker during phase II development will generally necessitate increasing the size or number of phase II trials conducted. In general, the statistical power for evaluating the predictive biomarker will be limited by the number of patients who respond to the drug for whom a pretreatment measurement is available. In the best situation, a predictive biomarker will be developed during phase II development, and a test for that biomarker will be ready for use in the phase III trial that will provide a definitive evaluation of the new drug. The phrase “be ready for use” means be analytically and pre-analytically validated. Analytical validation means that the assay is standardized to be reproducible and robust to intralaboratory and interlaboratory sources of variations. In the rare cases in which there is a gold standard measurement to which the assay result can be compared, analytical validation means that the assay is accurate, sensitive, and specific relative to the gold standard measurement. Preanalytical validation means that the assay is robust to real-world variations in tissue procurement and sample handling prior to assay performance. If the predictive biomarker is to be used as an eligibility criterion in the phase III clinical trial, then the test must be developed and analytically validated prior to the trial. If, however, the assay is going to be used for analysis of the trial results, but not as an eligibility criterion, then it might be acceptable if assay development is not completed until the time that the phase III trial will be analyzed. In the latter case, availability of tumor tissue should be an eligibility criteria for patients in the phase III trial, but the assay may be performed on archived tissue. In this case, the analytical validation of the assay should establish that the results of the assay are robust to archiving of the tissue.
PREDICTIVE CLASSIFIERS A predictive classifier is a predictive biomarker to which cut-points have been introduced in order to obtain a discrete classification (22). A binary classifier uses two classes; for example, the class representing those patients with tumors most likely to benefit from the treatment versus those less likely to benefit from the treatment. An immunohistochemical grading system of 0+, 1+, 2+, 3+ may be viewed as a classifier with four classes. In the phase III trial of trastuzumab described
24 USE OF GENOMICS IN THERAPEUTIC CLINICAL TRIALS
in a previous section, a binary classifier of Her-2 expression of <2+ vs. ≥2+ was used. There are several advantages to using classifiers rather than continuous predictive indices. Having a classifier with a predefined cut-point makes it possible to prospectively use the classifier as an eligibility criterion. Even if the predictive biomarker is not going to be used to limit eligibility, using a binary classifier forces a more focused and less exploratory analysis, which is of fundamental importance. Perhaps the fundamental principal for using predictive biomarkers in therapeutics is to separate the data used to develop the biomarker from the data used to evaluate the treatment in subsets determined by the biomarker. This is much easier to accomplish if the cut-point is established prior to the phase III trial, although specialized methods are available for analyzing a phase III trial using a predictive index without a predefined cut-point (23).
MULTIVARIATE-GENE EXPRESSIONBASED CLASSIFIERS Predictive classifiers can be developed that combine the information from several biomarkers. Such classifiers are frequently developed based on genome-wide transcript-expression profiling and combine contributions for dozens of genes. For example, Ghidimi et al. (24) developed a classifier for predicting response to preoperative chemoradiotherapy for patients with rectal cancer, and Hess et al. (25) have developed such a classifier for predicting response of patients with breast cancer to preoperative chemotherapy. Many algorithms have been used effectively with DNA microarray data for predicting of a binary outcome, for example, response versus nonresponse. Dudoit et al. (26) compared several algorithms using several publicly available data sets. A linear discriminant is a function l(x) = ∑ wi xi i ∈F
(Eqn. 24-1)
where xi denotes the logarithm of the expression measurement for the ith gene, wi is the weight given to that gene, and the summation is over the set F of features (genes) selected for inclusion in the class predictor. For a two-class problem, there is a threshold value d, and a sample with expression profile defined by a vector x of values is predicted to be in class 1 or class 2 depending on whether l (x) as computed from equation (1) is less than the threshold d or greater than d, respectively. Many types of classifiers are based on linear discriminants of the form shown in equation (24-1). They
231
differ with regard to how the weights are determined. Diagonal linear discriminant analysis is a special case of Fisher linear discriminant analysis in which the correlation among genes is ignored. By ignoring such correlations, one avoids over-fitting the data and obtains a method which performs better when the number of samples is small. Golub’s weighted voting method (27) and the compound covariate predictor of Radmacher et al. (28) are similar to diagonal linear discriminant analysis. They compute the weights based on the univariate prediction strength of individual genes and ignore correlations among the genes. Support vector machines (SVMs) are very popular in the machine learning literature. Although they sound very exotic, linear kernel SVMs do class prediction using a predictor of the form of equation (24-1). The weights are determined by optimizing a misclassification rate criterion, however, instead of a least-squares criterion as in linear discriminant analysis (29). Although there are more complex forms of SVMs, they appear to be inferior to linear kernel SVMs for class prediction with large numbers of genes (30). In the study of Dudoit et al. (26), the simplest methods, diagonal linear discriminant analysis and nearest neighbor classification, generally performed as well or better than the more complex methods. Tibshirani et al. (31) developed a type of nearest neighbor algorithm called shrunken centroid classification which combines the gene selection and nearest centroid classification components. It is important to distinguish the studies that develop predictive classifiers from those that use previously developed classifiers in phase III studies evaluating a new drug regimen. The vast majority of published prognostic marker studies are developmental. Perhaps the most fundamental principal in the validation of predictive biomarkers is that the data used for developing the predictive biomarker or predictive classifier should be separated from the data used for evaluating treatments in subsets determined by the predictive classifier. Developmental studies are analogous to phase II clinical trials. They should include an indication of whether the pharmacogenomic classifier is promising and worthy of being used in a phase III trial. There are special problems in evaluating whether classifiers based on high dimensional genomic assays are promising, however. The difficulty derives from the fact that the number of candidate genes available for use in the classifier is much larger than the number of cases available for analysis. In such situations, it is always possible to find classifiers that accurately classify the data on which they were developed even if there is no true relationship between expression of any
232
ONCOLOGY CLINICAL TRIALS
of the genes and outcome (28). Consequently, even in developmental studies, some kind of validation on data not used for developing the model is necessary. This internal validation is usually accomplished either by splitting the data into two portions—one used for training the model and the other for testing the model—or some form of cross-validation based on repeated model development and testing on random data partitions. This internal validation should not, however, be confused with external truly independent validation of the classifier. The most straightforward method of estimating the prediction accuracy is the split-sample method of partitioning the set of samples into a training set and a test set. Rosenwald et al. (32) used this approach successfully in their international study of prognostic prediction for large B cell lymphoma. They used twothirds of their samples as a training set. Multiple kinds of predictors were studied on the training set. When the collaborators of that study agreed on a single fully specified prediction model, they accessed the test set for the first time. On the test set there was no adjustment of the model or fitting of parameters. They merely used the samples in the test set to evaluate the predictions of the model that was completely specified using only the training data. In addition to estimating the overall error rate on the test set, one can also estimate other important operating characteristics of the test such as sensitivity and specificity. The split-sample method is often used with too few samples in the test set, however. One can evaluate the adequacy of the size of the test set by computing the statistical significance of the classification error rate on the test set or by computing a confidence interval for the test set error rate. Since the test set is separate from the training set, the number of errors on the test set has a binomial distribution. Michiels et al. (33) suggested that multiple training-test partitions be used, rather than just one. The split sample approach is mostly useful, however, when one does not have a well-defined algorithm for developing the classifier. When there is a single training set-test set partition, one can perform numerous unplanned analyses on the training set to develop a classifier and then test that selected single classifier on the test set. With multiple training-test partitions, however, that type of flexible approach to model development cannot be used. If one has an algorithm for classifier development, it is generally better to use one of the cross-validation or bootstrap resampling approaches to estimating error rate (see below) because the split sample approach does not provide as efficient a use of the available data, as was shown by Molinaro et al. (34).
Cross-validation is an alternative to the splitsample method of estimating prediction accuracy (28). Molinaro et al. describe and evaluate many variants of cross-validation and bootstrap resampling for classification problems where the number of candidate predictors vastly exceeds the number of cases (34). The cross-validated prediction error is an estimate of the prediction error associated with application of the algorithm for model building to the entire dataset. The so-called resubstitution estimate, although commonly used, is very biased. With the resubstitution estimate, you use all the samples to develop a model and then you predict the class of each sample using that model. The predicted class labels are compared to the true class labels and the errors are totaled. It is wellknown that the resubstitution estimate of error is highly biased for small data sets and the simulation of Simon et al. (35) confirmed that, with 98.2 % of the simulated data sets resulting in zero misclassifications even when no true underlying difference existed between the two groups. Simon et al. (35) also showed that cross-validating the prediction rule after selection of differentially expressed genes from the full data set does little to correct the bias of the resubstitution estimator: 90.2 % of simulated data sets with no true relationship between expression data and class still result in zero misclassifications. When gene selection was also redone in each cross-validated training set, however, appropriate estimates of misclassification error were obtained; the median estimated misclassification rate was approximately 50%. Although whole-genome transcript-expression profiling is a powerful technology for developing prognostic and predictive classifiers, many studies do not use adequate statistical methods and present claims that may not be justified. Dupuy et al. (36) reviewed 90 studies relating gene-expression profiling to cancer outcomes. They found at least one serious problem in 50% of the publications and developed a list of “Do’s and Don’ts” for such studies. Simon also published a checklist for evaluating publications dealing with the development of prognostic and predictive gene expression-based classifiers (37). The BRB-ArrayTools software provides extensive resources for development of a wide range of prognostic and predictive classifiers based on gene expression data for binary response or survival end points (38). It was developed for use by biomedical scientists. It provides an environment for developing a classifier on a training set and estimating the accuracy of the model on a test set of data or for using a wide range of valid complete cross-validation and bootstrap resampling methods for estimating the predictive accuracy of the model. BRB-ArrayTools is available for downloading online at http://brb.nci.nih.gov.
233
24 USE OF GENOMICS IN THERAPEUTIC CLINICAL TRIALS
The patients included in a developmental study of a predictive classifier should be appropriate to enable identification of patients who are most likely to benefit from the new drug in a phase III study. For example, a phase III study of patients with advanced disease who have failed first-line chemotherapy might involve comparing survivals for patients receiving the new drug to survivals for patients receiving palliative care. Patients from single-arm phase II trials of the new drug could be used to develop a predictive classifier of those patients likely to respond to the new drug. Dobbin et al. (39) not only have studied sample size considerations for developmental studies of predictive binary classifiers based on whole-genome transcript-expression profiling and recommended generally having at least 20 to 30 cases in each class, but also made available a Web-based computer program for planning the development of such classifiers (http://brb.nci.nih.gov). Having expression profiles of 20 to 30 responders would generally require a larger phase II developmental program than is conventional. Pusztai et al. (40) described a simulation example where whole-genome transcriptexpression profiling based on too small a number of responders was unsuccessful in discovering a known predictive biomarker and they recommended focusing on candidate genes. A predictive classifier of the patients whose tumors are responsive to a new drug given as a single agent may be useful for a phase III trial of the drug in which it is not administered as a single agent and for which the end point is not tumor response. It may be difficult, however, to create a phase II database with sufficient numbers of responders to the new drug as a single agent for development of a predictive classifier based on whole-genome transcript-expression profiling. In many ways the best resource for developing a predictive classifier for use in a pivotal trial is a collection of pretreatment tumor specimens from patients enrolled in such a pivotal trial. For example, archived material from a failed pivotal trial of the drug can be used to develop a biomarker classifier of patients most likely to benefit from the drug compared to the control. The classifier can be based on the actual end point used in the clinical trial or upon an intermediate end point such as progression-free survival for which there may be more events available. By failed pivotal trial, I mean a trial for the same target population of patients which did not establish a statistically significant benefit for the drug for the randomized patients as a whole. The classifier developed based on archived material in a failed pivotal trial should be considered to have the same status as a classifier based on a phase II database. That is, the classifier should be used for designing a new pivotal trial that establishes the clinical benefit of the drug
in a prospectively specified subset of patients. Using the same pivotal trial to develop a predictive classifier and to test treatment effects in subsets determined by the classifier is generally problematic. Freidlin et al. (41) have shown how one pivotal trial can be potentially used for both purposes, however, if the set of patients used to develop the classifier is kept distinct from the set of patients used to evaluate treatment benefit. Generally, however, the studies should be kept separate. Developmental studies are exploratory, though they should result in completely specified binary classifiers. Studies on which claims of drug benefit are based should be nonexploratory, but should instead test prospectively defined hypotheses about treatment effect in a predefined patient population. PREDICTIVE CLASSIFIERS IN PROSPECTIVE PHASE III CLINICAL TRIALS Enrichment Design The objective of a phase III pivotal clinical trial is to evaluate whether a new drug, given in a defined manner, has medical utility for a defined set of patients. Pivotal trials test prespecified hypotheses about treatment effectiveness in specified patient population groups. The role of a predictive classifier is to prospectively specify the population of patients in whom the new treatment will be evaluated. By prospectively specifying the patient population in a manner defined in the protocol, one assures that adequate numbers of such patients are available, and one avoids the problems of post hoc subset analysis. The process of classifier development may be exploratory and subjective based on data collected prior to the phase III trial, but the use of the classifier in the phase III trial should not be. Figure 24.1 shows a design in which a classifier is used to define eligibility for a randomized clinical trial comparing a new drug to a control regimen. This approach was used for the development of Evaluate Diagnostic Test
Test Positive
New RX
Control
FIGURE 24.1 Enrichment design.
Test Negative
Off Study
234
ONCOLOGY CLINICAL TRIALS
trastuzumab. Patients with metastatic breast cancer whose tumors expressed HER2 at a 2+ or 3+ level based on an immunohistochemistry test were eligible for randomization (20). The clinical trial randomized 469 patients, but the number whose tumors were tested was not stated. If approximately 75% of patients had available specimens and adequate tests and 25% of patients with adequate tests were HER2 positive, then about 2,500 patients would have to be screened to obtain 469 eligible for randomization. Simon et al. (42–44) studied the efficiency of this approach relative to the standard approach of randomizing all patients without measuring the diagnostic. They found that the efficiency of the enrichment design depended on the prevalence of test positive patients and on the effectiveness of the new treatment in test negative patients. For binary endpoint trials, they showed that the ratio of number of patients to be randomized for the standard trial (nS) compared to the number randomized in the enrichment trial (nE) is approximately nS /nE ≈ f/(prev + (1 − prev)d−/d+ )2
(Eqn. 24-2)
where prev is the proportion of patients who are test positive, d− is the treatment effectiveness for test negative patients, and d+ is the treatment effect for test positive patients. The parameter f is a constant that does not depend on the prevalence or treatment effects; it is generally close in value to 1 unless the control response rate is very low. In cases where the new treatment is completely ineffective in test negative patients, the formula above simplifies to approximately 1/prev2. Often, however, it is unrealistic to expect that the treatment will be completely ineffective for test negative patients. The treatment may have some effectiveness for test negative patients either because the assay is imperfect for measuring deregulation of the putative molecular target or because the drug has off target effects. Since these alternatives cannot generally be distinguished, there is little value in decomposing d− into these components. However, because of the limited specificity of the test, d− may not be zero. If the new treatment is half as effective in test negative patients as in test positive patients, then the right-hand side of expression (24-2) simplifies to approximately 4/(prev + 1)2. This equals about 2.56 when 25% of the patients are test positive, indicating that the enrichment design reduces the number of required patients to randomize by a factor of 2.56. In order to obtain nE test positive patients for randomization, one must screen approximately nE/prev patients. Simon et al. (42–44) also compared the enrichment design to the standard design with regard to the number of screened patients. These methods of sample-size planning for the design of enrichment tri-
als and other designs described by Simon (45) that do not exclude patients based on the predictive biomarker are available online (http://linus.nci.nih.gov/brb). The Web-based programs are available for binary and survival/disease-free survival (DFS) endpoints. The planning takes into account the performance characteristics of the tests and specificity of the treatment effects. The programs are easy to use and provide comparisons to standard designs that do not use predictive classifiers based on the number of randomized patients required and the number of patients needed for screening. When fewer than half of the patients are test positive and the new treatment is relatively ineffective in test negative patients, the number of randomized patients required for an enrichment design is often dramatically smaller than the number of randomized patients required for a standard design. This was the case for trastuzumab. The enrichment design that led to approval of trastuzumab was conducted in 469 patients with metastatic breast cancer whose tumors over-expressed HER2 based on immunohistochemical analysis in a central laboratory. The results were highly significant with regard to several end points, including 1-year survival rate (78% vs. 67%). The trial of 469 patients provides 90% power for detecting a 13.5% improvement in the 1-year survival rate above a baseline of 67% with a two-sided 5% significance level. Assuming that HER2 negative patients would not benefit from trastuzumab, if the trial had not been restricted to HER2 positive patients, then the overall improvement in 1-year survival rate would have been only 3.375%, and a total of about 8,050 patients would have been required for 90% power to detect such a small effect. This is 17.2 times as many patients as for the enrichment design, in good agreement with the ratio of 16 computed from the approximate form of equation (24-2) with f = 1. If the test negative patients benefit half as much as the test positive patients, then 1,254 total patients would be required for 90% power with the standard design. This is 2.7 times as many for the enrichment design but is much less than the number required for screening with the enrichment design. Focusing initial development on test positive patients can lead to clarity in determining who benefits from the drug. If the enrichment design establishes that the drug is effective in test positive patients, the drug could be later developed in test negative patients. This is preferable to testing new drugs in heterogeneous populations resulting in false negative results for the overall population. The enrichment design is particularly appropriate for contexts where there is such a strong biological basis for believing that test negative patients will not
24 USE OF GENOMICS IN THERAPEUTIC CLINICAL TRIALS
benefit from the new drug that including them would be unethical. If the treatment is shown to be effective in test positive patients and if there is a robust assay for the test, then it could be argued that medical utility has been demonstrated for the new treatment and for the test. If, however, the test requires approval as a medical device, then either biological or empirical evidence that the new drug is not effective in test negative patients will be required. In some cases this might be achieved using data from phase II single-arm studies that did not restrict entry based on the classifier. Including Both Test Positive and Test Negative Patients Instead of using the predictive classifier as an exclusion criterion, both test positive and test negative patients may be included and randomized between the new treatment and control group as indicated in Figure 24.2 (46–48). It is essential that an analysis plan be predefined in the protocol for how the predictive classifier will be used in the analysis. It is not sufficient to just stratify the randomization with regard to the classifier without specifying a complete analysis plan. In fact, for many statisticians stratification of the randomization is not essential for inference, its main importance is that it assures that only patients with adequate specimens and interpretable test results will enter the trial. It is important to recognize that the purpose of this design is to evaluate the new treatment in the subsets determined by the prespecified classifier. The purpose is not to modify the classifier. If the classifier is a composite gene expression based classifier, the purpose of the design is not to reexamine the contributions of each component. If one makes any such changes, then an additional phase III trial may be needed to evaluate treatment benefit in subsets determined by the new classifier. In moving from post hoc correlative science to
Evaluate Diagnostic Test
Test Positive
New RX
Test Negative
Control
New RX
Control
FIGURE 24.2 Designs that include test negative and test positive patients.
235
reliable predictive medicine, it is important to separate the data used for developing classifiers from the data used for testing treatment effects in subsets determined by those classifiers. Only by honoring this principle can reliable conclusions be achieved. Simon (49) described several specific analysis plans for phase III trials that utilize a predefined predictive classifier. Here we shall describe one approach based on the traditional statistical approach to the analysis of data in which cases are classified by treatment and by a binary covariate that may affect treatment efficacy. The first step is to first test whether there is a significant interaction between treatment efficacy (treatment vs. control) and the covariate (test negative and positive). The interaction test is often performed at a threshold above the traditional 5% level. If the interaction test is not significant, then the treatment effect is evaluated overall, not within levels of the covariate. If the interaction test is significant, then treatment effect is evaluated separately within the levels of the covariate (e.g., test positive and test negative classes). This is similar to the test proposed by Sargent (47). In order to have 90% power for detecting a 50% reduction in the hazard of failure in test positive cases with a two-sided 5% significance requires approximately 88 events in test positive patients. If the event rates in the positive and negative patients are the same at the time of analysis and 25% of cases are test positive, then when there are 88 events in test positive patients, there will be approximately 264 events in test negative patients. At that point the interaction test will have approximately 93.7% power at a one-sided significance level of 0.10 for detecting an interaction with 50% reduction in hazard for test positive patients and no treatment effect in test negative patients. If the treatment also reduces the hazard by 33% in test negative patients, the interaction test has little power, but that is fine because the overall analysis of treatment effect will be appropriate in that circumstance. Computer simulations indicate that the two-stage analysis plan with (αi = .10 has the following properties: With 88 test positive patients and 264 test negative patients, the design detects a significant interaction and detects a significant treatment effect in test positive patients in 88% of replications when the treatment reduces hazard by 50% in test positive patients and is ineffective in test negative patients. If the treatment reduces hazard by 33% in both test positive and test negative patients, the interaction is nonsignificant and the overall treatment effect is significant in 87% of cases. The overall treatment effect refers to the comparison of treatment to control that includes both test negative and test positive patients. If one were planning a trial to detect a uniform 33% reduction in hazard with 90% power and 5% two-sided
236
ONCOLOGY CLINICAL TRIALS
significance level, one would require approximately 256 events. If 25% of the cases were test positive and the control group event rates in test negative and test positive patients are about the same, then at time of analysis there would be approximately 64 events in test positive cases and 192 events in test negative cases. If the treatment reduces hazard by 33% in both test positive and test negative patients, the interaction is nonsignificant and the overall treatment effect is significant in approximately 81% of cases. If the treatment reduces hazard by 50% in test positive cases and is ineffective in test negative cases, then the interaction is significant and the treatment effect in test positive cases is significant in 76% of replications. So even if the trial is sized for detecting a uniform 33% reduction in hazard, the two-stage analysis plan will have approximately 76% power for detecting a substantial treatment effect restricted to the test positive patients. Web-based computer programs for designing trials using this approach and with other analysis plans have been developed by Simon and Zhao (available online at http://brb.nci.nih.gov).
PROGNOSTIC BIOMARKERS Prognostic markers are baseline measurements that provide information about the likely long-term outcome of patients either untreated or with standard treatment. Prognostic markers can be used to help determine whether a patient requires any systematic treatment or any beyond the standard treatment used for the patients in the study. The oncology literature is replete with publications on prognostic factors, but very few of these are used in clinical practice. For example, Pusztai et al. (50) identified 939 publications over a 20-year period on prognostic factors in breast cancer, but only four factors (ER, PR, HER2, and urokinase plasminogen activator) were at that time recommended for use by the American Society of Clinical Oncology. Prognostic factors are rarely used unless they help with therapeutic decision making. Most prognostic factor studies are conducted using a convenience sample of patients whose tissues are available (51). Often these patients are too heterogeneous with regard to treatment, stage, and standard prognostic factors to support therapeutically relevant conclusions (52). Many publications attempt to show that new factors are “independently prognostic” or are more prognostic than standard factors, but these analyses often fail to identify a role of the new factors in therapeutic decision making. Multivariate analysis is often used in developmental studies to support the claim that the new classifier is more important than standard prognostic/
predictive factors. Usually this is done because the patients are too heterogeneous and not therapeutically relevant. A multivariate analysis, however, is an inadequate solution to the problem of case selection that is not therapeutically relevant. If the cases selected are too heterogeneous to be therapeutically relevant, it is better to analyze more homogeneous subsets separately than to perform multivariate analysis. Multivariate analysis does not address predictive accuracy, which is the endpoint of concern in developmental studies. Hayes et al. have developed and used a tumor marker utility grading system (TMUGS) to assist in the evaluation of the clinical utility of tumor markers (53, 54). The REMARK guidelines will hopefully promote better study design for the development of prognostic markers (55). Prognostic classifiers based on gene expression data should be developed in a manner that addresses a focused therapeutic decision context. For example, the Oncotype DX recurrence index was developed by studying women with breast cancer whose tumors were estrogen receptor positive and had not spread to the axillary lymph nodes and who had received tamoxifen as the only systemic therapy (56, 57). A score was developed based on tumor expression of 21 genes to identify women whose DFS was sufficiently good that they might elect to forgo cytotoxic therapy. Prognostic factors developed in such a focused manner can be relevant for therapeutic decisions. The score is often used as a classifier by introducing two cut-points to distinguish patients with low, moderate, and high risks of tumor recurrence. Prognostic classifiers can be therapeutically relevant if they identify a set of patients who have such a good prognosis without aggressive systemic therapy that they may choose to be spared the risks and inconvenience of such therapy and forgo the small potential benefit. Such a classifier needs to be validated before it is “ready for prime time” (58). Validation can be ideally accomplished by prospectively applying the classifier to eligible patients, identifying a sufficiently large cohort that are deemed good risk by the classifier, and withholding intensive systemic therapy from them. This is the approach being used to validate Oncotype DX in the TAILORx study (59) and the Netherlands Cancer Institute 70-gene signature in the MINDACT study (60) for patients with node negative primary breast cancer. Both studies involve up-front testing of all entered patients with node negative breast cancer after local therapy and endocrine therapy for ER-positive patients. The primary analysis of both studies involves estimation of distant metastasis-free survival in patients predicted to be low risk by the multigene signature for whom cytotoxic therapy is withheld. Patients in the MINDACT study considered low risk by the 70-gene signature are actually
24 USE OF GENOMICS IN THERAPEUTIC CLINICAL TRIALS
randomized to receive or not receive chemotherapy if they are not low risk by current practice standards. Nevertheless, if the predictive classifier is accurate, the potential benefit of chemotherapy is so small in absolute terms for those patients that the primary analysis of the study involves simply evaluating the long-term DFS of the patients for whom chemotherapy is withheld. By validating these predictive signatures in a fully prospective manner rather than by using archived tissue from a previously conducted series, one assures that an adequate number of patients are studied, that assay results are available on all patients, and that assay results reflect real-world tissue handling and laboratory variation. Such studies are expensive and time consuming, however. In some cases, effective validation of a classifier predictive of low recurrence risk can be accomplished with specimens archived from an appropriate clinical trial that withheld chemotherapy from such patients. Convincing results are only possible, however, if the number of patients is sufficiently large, if the proportion with available specimens adequate for testing is high, and if careful analytical and pre-analytical validation provides assurance that assay results on archived samples are accurate predictors of assay results on fresh tissue. A prognostic biomarker can also be used to identify patients whose outcome is very poor under standard chemotherapy. Such patients may be good candidates for experimental regimens. But unless there is a viable therapeutic option, such prognostic biomarkers may not be widely used in general practice.
CONCLUSION The genomics revolution is influencing clinical trials in fundamental ways. It has become increasingly clear that many of the entities currently treated in clinical trials are distinct molecularly and unlikely to be responsive to the same treatments. Industry, academic medicine, funding bodies, regulatory agencies, and clinical trial methods have not adequately adapted to this discovery. The genomic revolution is providing powerful tools for improving the health of patients. Randomized clinical trials will continue to play an essential role in therapeutics evaluation, but statisticians and clinical investigators face important new challenges in moving from a posture of retrospective correlation to one which brings about reliable predictive oncology. Predictive oncology based upon patient genetics and disease genomics is an achievable goal and offers many benefits to patients and societies. Development and utilization of prognostic and predictive biomarkers, while offering greater advantages, complicates the drug development process and is thereby easily inhibited by
237
confusion, underfunding, and overregulation. Meeting the challenges and taking advantage of the opportunities for therapeutic progress will require new approaches to the design and analysis of clinical trials, the funding of clinical trials, and the regulation of drugs and companion diagnostics. Rapid progress is possible, but will require leadership and partnership among academia, industry, and government.
References 1. Simon R. Lost in translation: problems and pitfalls in translating laboratory observations to clinical utility. Eur J Cancer. 2008;44(18):2707–2713. 2. Eisenhauer EA, O’Dwyer PJ, Christian M, Humphrey JS. Phase I clinical trial design in cancer drug development. J Clin Oncol. 2000;18(3):684–692. 3. Rubinstein LV, Simon RM. Phase I clinical trial design. In: Budman DR, Calvert AH, Rowinsky EK, eds. Handbook of Anticancer Drug Development. Philadelphia, PA: Lippincott Williams & Wilkins; 2003:297–308. 4. Parulekar WR, Eisenhauer EA. Phase I trial design for solid tumor studies of targeted, non-cytotoxic agents: theory and practice. J Natl Cancer Inst. 2004;96(13):977–978. 5. Kummar S, Kinders R, Rubinstein L, et al. Compressing drug development timelines in oncology using phase ‘0’ trials. Nat Rev Cancer. 2007:131–139. 6. Simon R. Optimal Two-stage design for phase II clinical trials. Cont Clin Trials. 1989;10:1–10. 7. Ratain MJ. Phase II oncology trials: let’s be positive. Clin Cancer Res. 2005;11(16):5661–5662. 8. Thall PF, Simon R, Estey E. New statistical strategy for monitoring safety and efficacy in single-arm clinical trials. J Clin Oncol. 1996;14:296. 9. Ratain MJ. Phase II studies of modern drugs directed against new targets: if you are fazed, too, then resist RECIST. J Clin Oncol. 2004;22(22):4442–4445. 10. Korn EL, Arbuck SG, Pluda JM, Simon R, Kaplan RS, Christian MC. Clinical trial designs for cytostatic agents: are new approaches needed? J Clin Oncol. 2001;19:265–272. 11. Rubinstein LV, Korn EL, Freidlin B, Hunsberger S, Ivy SP, Smith MA. Design issues of randomized phase 2 trials and a proposal for phase 2 screening trials. J Clin Oncol. 2005;23:7199–7206. 12. Karrison TG, Maitland ML, Stadler WM, Ratain MJ. Design of phase II cancer trials using a continuous endpoint of change in tumor size: application to a study of sorafenib and erlotinib in non-small-cell lung cancer. J Natl Cancer Inst. 2007;99(19):1455–1461. 13. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treat Reports. 1985;69:1375–1381. 14. Simon RM, Steinberg SM, Hamilton M, et al. Clinical trial designs for the early clinical development of therapeutic cancer vaccines. J Clin Oncol. 2001;19:1848–1854. 15. Rosner G, Stadler W, Ratain M. Randomized discontinuation design: application to cytostatic antineoplastic agents. J Clin Oncol. 2002;20(22):4478–4484. 16. Torri V, Simon R, Russek-Cohen E, Midthune D, Freidman M. Relationship of response and survival in advanced ovarian cancer patients treated with chemotherapy. J Natl Cancer Inst. 1992;84:407. 17. Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med. 1996;125(7): 605–613. 18. Korn EL, Albert PS, McShane LM. Assessing surrogates as trial endpoints using mixed models. Stat Med. 2004;24: 163–182.
238
ONCOLOGY CLINICAL TRIALS
19. Buyse M, Molensberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–67. 20. Slamon DJ, Leyland-Jones B, Shak S, et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. New Engl J Med. 2001;344(11):783–792. 21. Dobbin K, Simon R. Sample size planning for developing classifiers using high dimensional DNA expression data. Biostatistics. 2007;8:101–117. 22. Simon R. Interpretation of genomic data: questions and answers. Seminars in Hematology. 2008;45(3):196–204. 23. Jiang W, Freidlin B, Simon R. Biomarker adaptive threshold design: a procedure for evaluating treatment with possible biomarker-defined subset effect. J Natl Cancer Inst. 2007;99: 1036–1043. 24. Ghadimi BM, Grade M, Difillippantonio MJ, et al. Effectiveness of gene expression profiling for response prediction of rectal adenocarcinomas in preoperative chemoradiotherapy. J Clin Oncol. 2005;23:1826–1838. 25. Hess KR, Anderson K, Symmans WF, et al. Pharmacogenomic predictor of sensitivity to preoperative paclitaxel and 5flurouracil, doxorubicin, cyclophosphamide chemotherapy in breast cancer. J Clin Oncol. 2006;24(26):4236–4244. 26. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97:77–87. 27. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression modeling. Science. 1999;286:531–537. 28. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comp Biol. 2002;9: 505–512. 29. Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Science. 2001;98:15149–15154. 30. Ben-Dor A, Bruhn L, Friedman N, et al. Tissue classification with gene expression profiles. J Comp Biol. 2000;7:536–540. 31. Tibshirani R, Hastie T, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Science. 2002;99:6567–6572. 32. Rosenwald A, Wright G, Chan WC, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002;346: 1937–1947. 33. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple validation strategy. Lancet. 2005; 365:488–492. 34. Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21(15):3301–3307. 35. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003;95:14–18. 36. Dupuy A, Simon R. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007;99:147–157. 37. Simon R. A checklist for evaluating reports of expression profiling for treatment selection. Clin Adv Hematol Oncol. 2006;4:219–224. 38. Simon R, Lam A, Li MC, Ngan M, Menenzes S, Zhao Y. Analysis of gene expression data using BRB-ArrayTools. Cancer Informatics, 2007;2:11–17.
39. Dobbin KK, Zhao Y, Simon RM. How large a training set is needed to develop a classifier for microarray data? Clin Cancer Res. 2008;14:108–114. 40. Pusztai L, Anderson K, Hess KR. Pharmacogenomic predictor discovery in phase II clinical trials for breast cancer. Clin Cancer Res. 2007;13:6080–6086. 41. Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clin Cancer Res. 2005;11:7872–7878. 42. Simon R, Maitournam A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clin Cancer Res. 2005; 10:6759–6763. 43. Simon R, Maitournam A. Evaluating the efficiency of targeted designs for randomized clinical trials: supplement and correction. Clin Cancer Res. 2006;12:3229. 44. Maitournam A, Simon R. On the efficiency of targeted clinical trials. Stat Med. 2005;24:329–339. 45. Simon R. Using genomics in clinical trial design. Clin Cancer Res. 2008;14:5984–5993. 46. Pusztai L, Hess KR. Clinical trial design for microarray predictive marker discovery and assessment. Ann Oncol. 2004;15: 1731–1737. 47. Sargent DJ, Conley BA, Allegra C, Collette L. Clinical trial designs for predictive marker validation in cancer treatment trials. J Clin Oncol. 2005;23(9):2020–2027. 48. Simon R, Wang SJ. Use of genomic signatures in therapeutics development. Pharmacogenomics J. 2006;6:166–173. 49. Simon R. Designs and adaptive analysis plans for pivotal clinical trials of therapeutics and companion diagnostics. Expert Review of Molecular Diagnostics. 2008;2(6):721–729. 50. Pusztai L. Perspectives and challenges of clinical pharmacogenomics in cancer. Pharmacogenomics J. 2004;5(5):451–454. 51. Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology. Br J Cancer. 1994;69:979–985. 52. Henry NL, Hayes DF. Uses and abuses of tumor markers in the diagnosis, monitoring and treatment of primary and metastatic breast cancer. Oncologist. 2006;11:541–552. 53. Hayes DF, Bast RC, Desch CE, et al. Tumor marker utility grading system: a framework to evaluate clinical utility of tumor markers. J Natl Cancer Inst. 1996;88:1456–1466. 54. Hayes DF, Trock B, Harris AL. Assessing the clinical impact of prognostic factors: when is “statistically significant” clinically useful? Breast Cancer Research and Treatment. 1998;52: 305–319. 55. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). Br J Cancer. 2005;93:387–391. 56. Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351:2817–2826. 57. Paik S. Development and clinical utility of a 21-gene recurrence score prognostic assay in patients with early breast cancer treated with Tamoxifen. Oncologist. 2007;12:631–635. 58. Simon R. When is a genomic classifier ready for prime time? Nature Clinical Practice: Oncology. 2004;1(1):2–3. 59. Sparano JA, Paik S. Development of the 21-gene assay and its application in clinical practice and clinical trials. J Clin Oncol. 2008;26(5):721–728. 60. Bogaerts J, Cardoso F, Buyse M, et al. Gene signature evaluation as a prognostic tool: challenges in the design of the MINDACT trial. Nature Clinical Practice: Oncology. 2006;3(10): 540–551.
25
Imaging in Clinical Trials
Binsheng Zhao Lawrence H. Schwartz
Radiographic imaging has been increasingly incorporated into all phases of oncological clinical trials, particularly early phase trials, for various purposes including (but not limited to) patient selection, proof of biologic activity, and surrogate biomarkers for clinical outcomes (1). Early phase clinical trials are designed to assess the safety and preliminary efficacy of drugs with small numbers of patients. Oncology end points, measurement precision of the end points, and response assessment criteria can significantly impact on the study duration, cost, and success. In addition to clinical and/or physical assessments, medical imaging that allows us to visibly depict disease extents and quantify drug-induced morphological and/or functional changes of diseases will play an important role in the drug development.
A REVIEW OF THE IMAGING MODALITIES USED Conventional computed tomography (CT) and magnetic resonance imaging (MRI) have been generally considered as the best (currently) available, accurate, and reproducible imaging modalities to identify target lesions from baseline scans performed prior to the onset of therapy, to measure and follow up size changes of the target lesions, and to identify new lesions on subsequent scans for monitoring and assessing response to therapy.
These modalities can provide sectional images showing detailed (both normal and abnormal) anatomical structures inside the body, allowing tumor to be detected and tumor size to be measured and followed up. Over the past years, our understanding of drug mechanisms of action has been increased drastically. As a result, drug designs have been shifted away from traditional cytotoxic agents (which typically shrink tumors) to cytostatic agents (which usually inhibit tumor growth, resulting in tumor consolidation rather than shrinkage). Meanwhile, quantitative imaging technologies have been evolved from the anatomical to the functional to better coincide with the advances in the drug discovery. Being able to noninvasive and in vivo characterize and measure cancer biological processes at molecular level, the functional imaging techniques such as positron emission tomography (PET) with fluorine-18-fluorodeoxyglucose (18F-FDG) and dynamic contrast-enhanced (DCE) MRI have shown great promises in the early response assessment and are being more and more used in early phase clinical trials testing molecular-targeted anticancer agents. Conventional CT and MRI CT scans are well suited for imaging the neck, chest, abdomen, and pelvis, where anatomical and pathological structures can be differentiated by their attenuation differences. Intravenous, oral, and rectal CT
239
240
ONCOLOGY CLINICAL TRIALS
PET
FIGURE 25.1 Tri-phasic CT imaging of hepatocellular carcinoma (HCC). Upper row shows an HCC before onset of hepatic embolization, and lower row is the same HCC 6 weeks after the treatment. (A) Precontrast. (B) Arterial phase. (D) Portal phase.
contrast media are often administrated to emphasize vascular structures, specific organs, and tissues so that the presence of abnormality can be better depicted and identified. More advanced CT imaging techniques such as the multiphase imaging protocols are applied to some specific studies (e.g., hepatocellular carcinoma treated with hepatic artery embolization) to better differentiate vasculature, living tumor component, necrosis, and surrounding parenchyma (Fig. 25.1). With the remarkable advances of the multidetector (MD) CT technology, CT scan can be acquired at an ultra fast speed with isotropic submillimeter image spatial resolution. The new generation MDCT offers more accurate tumor measurements particularly tumor volume, which may provide more accurate and early assessment of therapeutic response than conventional measurements do. MRI can also provide high quality anatomical information. It is, however, less frequently used as CT in oncology studies because of its cost, complexity, scan duration, and limitation in imaging the chest, a frequent site of metastatic spread. However, in some regions of the body, MRI is the preferred method (i.e., the head, for assessing brain tumors). Moreover, MRI should be used where CT is medically contraindicated. Because of multiple sequences that can be produced during a single MRI study, it is important that lesions must be measured at the same body levels on the same sequence images in baseline and subsequent follow-up examinations. In clinical trials, CT, and MRI are widely used for identification of target lesions on baseline study, monitoring of (target) lesion size changes during the course of therapy, and detection of new lesions in follow-up examinations.
PET scanner can capture signals emitted from a radiopharmaceutical (e.g., 18F-FDG) that contains a radionuclide and is injected into the patient. Studies have found that uptake of 18F-FDG signals on PET is associated with cell growth rate and proliferation capacity (2, 3). Cancer cells are more metabolically active, their FDG uptakes on the PET images are more intense than those of normal tissues. A decrease in the FDG uptake of a tumor indicates decay of the tumor activity and thus possible death of viable tumor cells. These findings allow the use of change in tumor glucose metabolism during therapy to assess tumor response. There are a number of semiquantitative or quantitative parameters derived from the 18F-FDG PET that are used to quantify tumor glucose metabolism. Since metabolic alterations precede anatomic changes after onset of a therapy, the ability to measure viable cancer cells by imaging metabolic activity of tumors has placed the PET imaging as the cutting-edge functional technology for early response assessment in clinical trials testing new anticancer drugs. 18 F-FDG PET has already added a special value to the accuracy of response assessment in lymphoma, in which a residual mass after completion of therapy often contains fibrosis or necrosis that is indistinguishable from viable tumor on CT scans. This scenario particularly limits the use of CT to categorize complete remission and partial response of tumors involving the lymph nodes. Functional PET imaging, however, can differentiate viable tumor components from fibrosis and necrosis by identifying focal areas of increased metabolic activities. One of the shortcomings of the PET imaging is the lack of anatomical information, which has been resolved with the advent of the hybrid PET/CT scanners. To enhance the ability to differentiate tumor cells from healthy tissue, more tumor-specific effective radiotracers are under intensive investigation. Among the numerous tracers being studied, the thymidine analogue 3`-deoxy-3`-18fluorothymidine (18F-FLT) shows promise as a potential biomarker for quantifying tumor proliferation in several types of cancers (4–6).
DCE-MRI Tumor survival needs oxygen, which is delivered by the blood. Angiogenesis, the formation of new capillaries from existing blood vessels, is an important process that is necessary for the growth of malignant tumors and the development of metastasis. Angiogenesis can be inhibited by antiangiogenic agents and existing blood vessels can be interrupted by vascular disrupting
241
25 IMAGING IN CLINICAL TRIALS
compounds. By tracking the pass of an injected intravenous bolus of a contrast agent (e.g., the low molecular weight paramagnetic gadolinium) through the tumor vasculature, intensity change of the repeatedly acquired T1-weighted DCE-MRI can be converted into contrast agent concentration data upon which kinetic modeling can be applied to produce modeled parameters that are sensitive to physiological processes including tissue microvessel perfusion, permeability, and extracellular extravascular leakage space (EES) (7). Changes in these parameters with therapy can thus be used to evaluate antiangiogenic effects of cancer treatments. Since alteration in tumor vascularity occurs earlier than that in tumor size after initiation of therapy (8), DCE-MRI is being increasingly used as a surrogate biomarker in early clinical trials testing new anti-angiogenics and vascular disrupting compounds.
WHAT CAN THESE MODALITIES BE USED FOR? With more and more anticancer drugs available and under evaluation, clinicians/radiologists can face more than one choice to treat cancer patients and to monitor tumor changes with therapy. Many factors, including (but not limited to) patient clinical characteristics, cancer type and stage, specific drug and available imaging equipment, will jointly influence determination of the treatment plan and methodology for assessing response to therapy. Personalized medicine will play a central role for the future clinical trials testing novel anticancer agents. Imaging modalities can play different roles at the different stages of clinical trials. For instance, they can be used to stratify patients to appropriate clinical trials (e.g., PET) (9), serve as early surrogate biomarkers for prediction of survival (e.g., PET and DCE-MRI) (10) and assess time-to-progression (e.g., CT). In the following subsections, the use of imaging modalities to quantify tumor changes (as surrogate biomarkers), identify target lesion in baseline study, detect new lesion in follow-up studies, and confirm tumor response after completion of therapy will be discussed.
of action so that they can be used to evaluate treatment efficacy. Quantitative imaging biomarkers can thus be changes in tumor size measured on CT or MRI, uptake of 18F-FDG on PET, and modeled parameters derived from DCE-MRI during the course of therapy. Tumor should eventually shrink if a drug is effective. Change in tumor size measured between baseline and follow-up scans has been widely used to quantify tumor response and progression in clinical trials and routine clinical practice as well. To date, tumor mass has been approximated by the greatest diameter (unidimensional measurement) or the product of the two greatest perpendicular diameters (bidimensional measurement) of the tumor in a transversal image plan. Although contemporary CT scanners can produce submillimeter isotropic image data that allows accurate estimation of tumor volume/volume change, and evidence starts to show the potential of early detection of tumor change by tumor volume rather than by conventional one or two diameters (Fig. 25.2) (12), unidimensional and bidimensional response assessment criteria are widely used in clinical trials in part because of the historical reasons. Lack of efficient and robust computer-aided measurement tools also hinders the volumetric response assessment from being thoroughly validated and widely accepted by clinical trials.
A
B
Surrogate Imaging Biomarkers
FIGURE 25.2
In clinical trials, especially in phase II trials, end points are likely to be surrogate end points such as biomarkers of the disease. A biomarker is defined as a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic process, or pharmacologic responses to a therapeutic intervention (11). Biomarkers should be relevant to drug mechanisms
Unidimensional, bidimensional, and volumetric measurements of tumor in CT Scans. A non-small-cell lung cancer (NSCLC) patient was treated with gefitinib-therapy. (A) baseline. (B) 28-days after onset of therapy. Between baseline and follow-up studies, tumor unidimensional, bidimensional, and volumetric measurements decreased 2.6%, 14.2%, and 25.4%, respectively.
242
ONCOLOGY CLINICAL TRIALS
FDG uptake on PET is a measure of glucose metabolism, which is proven to be associated with growth rate and proliferation capacity (3, 13, 14). Glucose metabolism is higher in tumor cells than in healthy cells. Therefore, a decrease in the uptake of tumor after the initiation of therapy indicates possible death of tumor cells and efficacy of the treatment. A number of methods have been developed to assess FDG uptake of tumors on PET, including visual interpretation (e.g., visual comparison of metabolic activities between tumor and background tissues such as mediastinal blood pool on 18F-FDG PET), semiquantitative measurements (e.g., standard uptake value [SUV]), and analytical kinetic techniques (e.g., rate of glucose metabolism) (15, 16). The SUV is defined as the attenuation-corrected 18F-FDG accumulation in a lesion normalized to the injected dose and the patient’s body weight, body surface area, or lean body mass. Although the SUV is a single snapshot of a dynamic process, it has been widely used in response assessment because of its simplicity and ability to quantify the number of viable tumor cells (Fig. 25.3). Full kinetic quantitative analysis can provide an absolute rate of 18F-FDG uptake over the measurement time. However, complexity of the dynamic data acquisition and the need for arterial blood sampling make the kinetic analysis difficult to implement in clinical settings.
FIGURE 25.3 18
F-FDG PET/CT in the monitoring of tumor size and metabolic changes with therapies. (A) Baseline CT (upper row) and PET (lower row) (obtained from a hybrid PET/CT scanner) taken in an NSCLC patient who had discontinued treatment with gefitinib. One target lesion (arrow) in right lung is indicated on CT and on corresponding PET. (B) PET/CT after 3 weeks without receiving gefitinib. Both tumor size (i.e., unidimensional maximal tumor diameter) measured on CT and tumor SUVmax measured on 18FFDG PET increased. The patient then resumed treatment with gefitinib. (C) PET/CT 3 weeks after resumption of the gefitinib treatment. Both tumor size and SUVmax decreased. At this time point, daily everolimus was added. (D) PET/CT 3 weeks after the combined treatment with gefitinib and everolimus. Tumor size and SUVmax continued to decrease (17).
A number of semiquantitative and quantitative methods are being applied to DCE-MRI data to assess tissue physiology and contrast agent kinetics (18). Semiquantitative parameters include simple features that characterize the signal intensity-time curve such as gradient, initial area under the time-signal curve (AUC), and time to maximum enhancement. These parameters are readily compute from the T1-weighted sequentially acquired MRI. However, they do not accurately reflect the contrast agent concentration and present considerable variation due to different scanner settings and individual examinations. Difficulties in the direct comparisons among different studies and within individual studies render the semiquantitative methods less useful in the evaluation of responses to therapies. Quantitative parameters, derived from pharmacokinetic modeling of contrast agent concentration data that is a nonlinear conversion of MRI signal intensity changes observed during the dynamic acquisition, allow more robust study on contrast agent kinetics including vessel permeability, perfusion/flow, blood volume, and EES volume (7). The two parameters of the most interest in the vascular response assessment are the initial area under the contrast agent concentration-time curve (IAUC) and the volume transfer constant of the contrast agent between the blood plasma and the EES (KTrans). The former is a modelfree parameter that is easy to compute and robust. However, its relationship with the underlying physiology is complex and undefined. The latter is a parameter derived from relatively simple, compartmental kinetic models and reflects a nonlinear composite of various physiologic processes of blood flow, endothelial surface, and endothelial permeability. A decrease in KTrans indicates that a drug is active. The quantitative parameters are independent of acquisition protocol, but complicated to derive. Interpretation/Meaning of a parameter change during the course of therapy needs to take into consideration the specific model used for analysis. Target Lesion Identification Anatomical imaging, particularly CT, remains the modality of choice for cancer detection, diagnosis, treatment planning, response assessment, and disease recurrence monitoring. CT has ability to accurately and reproducibly depict both normal and abnormal anatomical structures inside the body. Furthermore, CT is widely available, relatively inexpensive, and easy to operate. Because of these, CT scan is often the first choice for baseline study (i.e., a study performed shortly before the onset of a therapy). If two dates (i.e., the date the diagnosis is made and the date
25 IMAGING IN CLINICAL TRIALS
the therapy initiates) are close enough, the diagnostic CT scan can be used as the baseline CT study in which the target lesions (and nontarget lesions as well) that can reside in different sites of the body will be identified and their sizes be quantitatively measured for monitoring tumor changes in the follow-up studies. Identification of Response Objective response rate (ORR) is the most widely used quantitative measure to categorize tumor response to a therapeutic agent and to assess efficacy of the new agent in clinical trials. The ORR is defined as the percentage change in tumor size (or other response parameters) between baseline and follow-up studies. Methodology used to assess tumor response is vitally important as a trial’s outcome may provide support for the drug approval by regulatory agencies and thus determine the fate of the experimental drug under investigation. Drug mechanism of action along with cancer type will play a critical role in the selection of imaging technique and response assessment methodology for both clinical trials and routine clinical practice. Traditional cytotoxic drugs were designed to break off the division of rapidly dividing cells. It is anticipated that tumor will shrink considerably and be ultimately damaged over a certain time period after the onset of therapy. In addition, the available imaging modalities have been the anatomical modalities until recent emerging of the functional imaging technologies. Tumor size change measured between serial imaging examinations has therefore been used as a surrogate marker for evaluation of clinical benefits in the development and the use of therapies. Development of the newer generation of molecular-targeted agents, however, is aiming at identifying molecular targets that play certain roles in tumor cell growth and differentiation and inhibiting these targets. So far, there are over 1,000 such molecular targets found to link to some aspects of neoplasia in preclinical, clinical development, or on the market. Two promising targets are the vascular endothelial growth factor (VEGF) pathway, a key mediator of tumor angiogenesis, and the epidermal growth factor receptor (EGFR), the cell-surface receptor for members of the epidermal growth factor family of extracellular protein ligands. The molecular-targeted agents are usually cytostatic rather than cytotoxic. They may slow down or stop the growth of cancer cells without causing tumor size shrinkage that can be observed on the anatomical images. Therefore, traditional sizebased response methodologies may no longer be appropriate for recognizing tumor changes, or may only be able to detect tumor response (by size) well after the
243
initiation of therapy. If molecular-targeted drugs inhibit the proliferation of cancer cells, treatment-induced metabolic change can be detected by the reduction of FDG uptake, the metabolic biomarker for measuring the efficacy of the molecular-targeted therapies. For the emerging vascular-targeted compounds that prevent the blood flowing to and from the tumors, DCE-MRI is a more appropriate biomarker for measuring efficacy of the antiangiogenics and vascular disrupting agents because of its high spatial resolution and ability to quantify the metabolic changes, such as the rate of tumor blood flow. In comparison with the tumor size change, the functional biomarkers allow the metabolic and/or vascular alterations of tumor to be detected in a much earlier time point. Therefore, these functional biomarkers are increasingly used in early clinical trials testing new anticancer agents. Being able to provide both metabolic and anatomic information with fused images, the hybrid 18F-FDG PET/CT is more and more requested after a certain time period of the completion of a treatment to confirm treatment result and to monitor disease recurrence. Assessment of Time-to-Progression Survival prolongation is the ultimate goal for cancer treatments. However, such clinical outcome usually takes a long time to attain and may be complicated by other factors, such as noncancer related death or multiple treatments possibly given to a patient. In addition to the ORR, another surrogate end point that was proven to be predictive of mortality in metastatic colorectal cancer and NSCLC was time-to-progression (TTP) (19). The TTP is defined as the time from patient’s entry onto clinical trial to the date when progression of disease or death is documented. Disease progression is usually assessed based on radiological changes of tumors and the established response evaluation criteria. To obtain the TTP information, a number of follow-up examinations need to be performed every 6 to 8 weeks during the treatment.
RESPONSE ASSESSMENT CRITERIA Uniform interpretation and valid comparison of the outcomes among clinical trials, particularly among multicenter trials, demand at least the use of consistent imaging acquisition technique/protocol and standardized response assessment criteria. Guidelines in terms of appropriate imaging procedures and protocols to optimally visualize and reliably measure tumors on serial examinations for evaluating tumor response to
244
ONCOLOGY CLINICAL TRIALS
therapy should be recommended based upon our best understandings of drug mechanisms of action and previous pertinent experiences. Historically, tumor response to therapy has been assessed by measuring tumor size on serial radiographic examinations using the criteria known as the World Health Organization (WHO) (20, 21) or the Response Evaluation Criteria in Solid Tumors (RECIST) (22). The latter was a simplification and an extension of the former criteria. With recent innovations in discovering molecular-targeted agents and functional imaging technologies, the evaluation of tumor response to therapy has opened a new window by measuring tumor metabolism/vasculature at molecular level. The guidelines for the functional/molecular response assessment were proposed for 18F-FDG PET (23, 24) and DCEMRI (25–27), and functional imaging biomarkers are being intensively investigated and validated in early phase clinical trials worldwide. WHO Criteria The WHO response assessment criteria were proposed in late 1970s and early 1980s, after recognizing a need to standardize reporting of clinical outcomes from cancer treatment trials (20, 21). The guideline suggested to use bidimensional measurement (i.e., cross product of the greatest diameter of the tumor and its greatest perpendicular diameter on a transverse image) to approximate tumor burden and the sum of the cross products of target lesions to quantify tumor change on serial examinations. Based upon change rate of the sum of tumor bidimensional measurements, the WHO
recommended reporting the result of cancer response to therapy using the following four categories: complete response (CR), partial response (PR), stable disease (SD), and progressive disease (PD) (Table 25-1). A size reduction of 50% or more from the baseline study was considered as response (PR), whereas a size increase of 25% or more was deemed to be progressive (PD). Presence of any new disease would be considered progressive disease and any substantial enlargement in tumor size that was not easily measured would also be considered progression. RECIST Criteria Since the inception of the WHO criteria, several variations of the response assessment criteria had been used in clinical trials. In 1994, the European Organization for Research and Treatment of Cancer (EORTC), the National Cancer Institute (NCI) of the United States and the NCI of Canada trials group set up a task force to review the existing criteria used in the evaluation of responses to treatments. The committee then published a new guideline for the response assessment in solid tumors based on a thorough, retrospective review of collaborative studies (22). Their recommendations included simplifying tumor size measurement by only measuring the greatest diameter (unidimensional measurement) and use of the sum of the greatest diameters rather than the sum of the cross products to quantify tumor change on serial examinations. These new criteria became known as the RECIST (Table 25.1) and were guided by several important principles: (1) the need to maintain the standard four
TABLE 25.1
WHO and RECIST Response Criteria. RESPONSE CATEGORY
WHO DEFINITION
RECIST DEFINITION
Complete Response (CR)
Disappearance of all disease, as confirmed at 4-week follow-up 50% decrease or more in the sum of the cross products of measurable disease, as confirmed at 4 weeks
Disappearance of all disease, as confirmed at 4-week follow-up 30% decrease or more in the sum of the maximal diameters of measurable disease, as confirmed at 4 weeks Neither PR nor PD 20% increase or more in the sum the maximal diameters of measurable disease, presence of new disease
Partial Response (PR)
Stable Disease (SD) Progressive Disease (PD)
Neither PR nor PD 25% increase or more in the sum of the cross products or in a single lesion, or presence of new disease
25 IMAGING IN CLINICAL TRIALS
categories of response assessment (CR, PR, SD, PD), (2) the goal of maintaining constancy of results such that no major discrepancy in the meaning of partial response would exist between the older WHO criteria and the new criteria, (3) the recognition of both the arbitrary nature of the cutoff for PR and the need to maintain this cutoff until other potentially more reliable or powerful surrogates could be developed, (4) concern about categorizing patients as PD too easily, and (5) recognition that cytostatic agents may not have the same measurement activity and that other serum markers and specific tumors may present unique challenges (22). International Working Group (IWG) and International Harmonization Project (IHP) Criteria Lymphoma, unlike the most solid tumors, is a type of cancer that usually resides in the normal structure of the lymph nodes. Conventional response criteria, when applied to lymphoma clinical trials, have encountered difficulties, particularly in classifying CR and PR, because of the size variation of the normal lymph nodes and posttreatment residual masses that often consist of nontumor components, such as fibrosis, necrosis, or inflammation, that can be indistinguishable from tumors. A study showed that changing definitions of the normal node size from 1 cm × 1 cm to 1.5 cm × 1.5 cm (the same as to 2 cm × 2 cm) had significantly affected the CR rates in a rituximab trial (28). Realizing confusions occurred in the lymphoma clinical trials, a consensus on the standardization of response assessment in adult patients with indolent and aggressive non-Hodgkin’s lymphomas (NHL) was made by an IWG of lymphoma experts including physicians, radiologists, and pathologists. The resulted guidelines were published in 1999 (29). The IWG criteria specifically defined the size ranges of the normal lymph nodes after treatment: For baseline lymph nodes greater than 1.5 cm, CR will be reached if these nodes are regressed to less than or equal to 1.5 cm after therapy; for baseline nodes whose sizes fall into 1.1 cm and 1.5 cm, the size range of the normal nodes after therapy should be less than or equal to 1.0 cm in order to identify CR. The IWG guidelines have provided clinicians with the uniform criteria to interpret and compare outcomes of lymphoma clinical trials and facilitated the development and approval of new therapies for lymphoma. However, these 1999 criteria were established based on the then best available imaging technique (i.e., CT) and cannot help with the differentiation of viable tumor component from necrosis or fibrosis.
245
Over the past 15 years, remarkable advances have been made in the molecular imaging technologies. Functional imaging PET with the 18F-FDG radiotracer is particularly suitable for imaging the treated lymphoma as it is capable of distinguishing between viable tumor and necrosis/fibrosis in residual mass(es) (30). The increased availability of FDG PET scanners and uses of immunohistochemistry and flow cytometry drove an IHP to considerably revise the IWG criteria for lymphoma clinical trials (31). Of notice, the new criteria are valid for all types of lymphoma and tumor response to therapy is assessed based on the combined information of the tumor changes observed/measured on FDG PET and CT (Table 25.2). In addition, the IHP criteria also suggest the time points of follow-up studies at which response should be assessed (i.e., >= 3 weeks for chemotherapy and 6–12 weeks for chemoimmunotherapy and radiation therapy). 18
F-FDG PET Response Criteria
To ensure the quality of PET imaging of tumors and compare PET findings reported from multicenter trials, it required standardization of 18F-FDG PET imaging technique/protocol and response criteria for clinical trials. Such standardization is imperative as there are numbers of factors associated with this technology that can influence FDG uptake measurement and thus response assessment. In 1999, after having reviewed the status of the technique, the EORTC PET study group proposed and published the first recommendation for common measurement standard and criteria while reporting changes in 18F-FDG uptake to assess clinical and subclinical responses to anticancer treatments (23). The guidelines made recommendations on patient preparation, timing of PET scan, attenuation correction, 18F-FDG dose, methods to measure 18FFDG uptake, tumor sampling, measurement reproducibility, and definition of the metabolic response. In parallel to the RECIST criteria, the EORTC criteria also use four categories (i.e., metabolic progressive disease, stable disease, partial response, and complete response) to define the metabolic response and progression based on percentage change rate of the 18FFDG uptake (Table 25.3). The EORTC criteria have served as the guidelines for monitoring tumor response in early phase clinical trials using 18F-FDG PET as surrogate biomarker. With increasing use of the 18F-FDG PET in oncology practice over the past years, more experience has been gained with FDG PET. In 2005, the Cancer Imaging Program of the NCI of the United States convened a workshop and reviewed the latest progress of the 18FFDG PET in both diagnosis and response assessment.
246
ONCOLOGY CLINICAL TRIALS
TABLE 25.2
IHP Response Criteria for Lymphoma (31). RESPONSE
DEFINITION
NODAL MASSES
SPLEEN LIVER
BONE MARROW
CR
Disappearance of all evidence of disease
(a) FDG-avid or PET positive prior to therapy; mass of any size permitted if PET negative (b) Variably FDG-avid or PET negative; regression to normal size on CT
Not Palpable, nodules disappeared
Infiltrate cleared on repeat biopsy; if indeterminate by morphology immunohistochemistry should be negative
PR
Regression of measurable disease and no new sites
≥ 50% decrease in SPD
≥ 50% decrease in
Irrelevant if positive prior to therapy; cell type should be specified
Failure to attain CR/PR or PD
(a) FDG-avid or PET positive prior to therapy; PET positive at prior sites of disease and no new sites on CT or PET
SD
of up to 6 largest dominant masses; no increase in size of the other nodes (a) FDG-avid or PET positive prior to therapy; one or more PET positive at previously involved site (b) Variably FDG-avid or PET negative; regression on CT
SPD of nodules (for single nodule in greatest transverse diameter); no increase in size of liver or spleen
(b) Variably FDG-avid or PET negative, no change in size of previous lesions on CT Relapsed disease of PD
Any new lesion or increase by ≥ 50% of previously involved sites from nadir
Appearance of a new lesion(s) > 1.5 cm in any axis, ≥ 50% increase in SPD of more than one node, or ≥ 50% increase in longest diameter of a previously identified node >1 cm in short axis Lesions PET positive if FDG-avid lymphoma or PET positive prior to therapy
> 50% increase from nadir in the SPD of any previous lesions
New or recurrent involvement
Abbreviations: CR, complete remission; FDG, [18F]fluorodeoxyglucose; PET, positron emission tomography; CT, computed tomography; PR, partial remission; SPD, sum of the product of the diameters; SD, stable disease; PD, progressive disease. Reprinted with permission of (31).
25 IMAGING IN CLINICAL TRIALS
TABLE 25.3
EORTC response criteria (23). RESPONSE CATEGORY
DEFINITION
Progressive Metabolic Disease (PMD)
25% increase or more in tumor 18F-FDG SUV, visible increase in the extent of 18FFDG tumor uptake (20% or more in the longest diameter) or appearance of new 18FFDG uptake in metastatic lesions Neither PMR nor PMD
Stable Metabolic Disease (SMD) Partial Metabolic Response (PMR)
Complete Metabolic Response (CMR)
15% reduction or more in tumor 18F-FDG SUV after one cycle of chemotherapy and 25% reduction or more after more than one treatment cycle Complete resolution of 18 F-FDG uptake within the tumor volume
As the outcome of the workshop, revised/refined consensus recommendations were published in 2006 (24). These recommendations will be used to design and guide the ever-growing number of NCI-sponsored clinical trials utilizing 18F-FDG PET as an indicator of therapeutic response (32). DCE-MRI Response Criteria In recent years, DCE-MRI is increasingly being used as an end point to assess vascular response in early phase clinical trials evaluating antiangiogenic and antivascular agents. Because of the technical complexity, disease heterogeneity, and nonlinear relationships between the derived response parameters (e.g., Ktrans) and physiological processes, DCE-MRI readouts can be considerably variable among studies and need to be interpreted carefully. Over the past several years, consensus recommendations for design and analysis of clinical trials that incorporate DCE-MRI investigations have been outlined by specialist panels in the United Kingdom and the United States and are being increasingly used for the development and validation of the DCE-MRI technique as a surrogate imaging biomarker in early clinical trials (25–27). The recommendations have mainly focused on the following issues: imaging protocol, type of the measurement methods, primary and secondary end
247
points, trial design, pharmacokinetic models, data analysis, measurement reproducibility, and future developments. The parameter Ktrans is commonly used as the primary end point in early phase clinical trials and a change (i.e., decrease) in Ktrans > 40% is considered as response (33, 34). DCE-MRI as used in the assessment of therapy response is still in its infancy and intensive investigations are under way.
SOURCES OF VARIABILITY FOR EACH MODALITY Quantitative imaging response assessments are based on measured changes in tumor size, metabolism, or vascularity on serial examinations. A measured tumor change, however, consists of the real tumor change (if tumor does change) and measurement variations. Although each type of the imaging modalities has its own sources of variability, in general, variations arise during the procedures of image acquisition and tumor measurement. In addition, patient/tumor change (e.g., loss of the weight/tumor shrinkage during therapy) can also introduce measurement variation. To date, how these factors affect change measurement and thus influence response assessment are not yet fully understood and systematically explored. To meaningfully interpret and compare trial results, use of uniform imaging technique and imaging protocol throughout the trial(s) and central review will play a vital role. Variations in the tumor size measurement on serial CT scans can be caused by nonuniform imaging techniques (e.g., contrast-enhanced images acquired at different phases), inconsistent imaging parameters (e.g., slice thickness, image reconstruction algorithm), different measurement tools (e.g., manual measurement, computer software), and repeat CT scans (35–40). The biggest variation may come from radiologists’ manual measurements (i.e., the so-called intraand intermeasurement variations) or from computer software developed to assist in the size measurements. Such variations can be profound if lesions possess irregular shapes and have fuzzy boundaries. Change in 18F-FDG uptake measured on PET represents a complex biological process, though it is linked to the alteration of tumor’s proliferative activity. Sources of variability in the uptake measurement include chemosensitivity of tumor to drug, blood glucose level of patient, patient body weight and tumor size, fasting time before the PET scan, dose of 18F-FDG injected, time to the start of scanning after tracer injection, image reconstruction algorithm, data analysis software, method chosen to measure 18F-FDG uptake, and selection of tumor region-of-interest (ROI) (23, 24).
248
ONCOLOGY CLINICAL TRIALS
Interpretation of tumor perfusion through pharmacokinetic analyses of DCE-MRI is complicated and subject to variations because of the dynamic acquisition procedure, different kinetic models, and data analysis based on a human defined ROI (or VOI). There are numbers of factors that can bring variations into the measurements of the response parameters and affect the DCE-MRI outcomes. These factors include, but are not limited to, imaging technique/protocol, injection of contrast medium bolus, heterogenous physiologies of tumors and their reactions to agents, selection of the kinetic models, measurement of arterial input function (AIF) and tumor ROI placement (26, 27).
COMPUTER-AIDED RESPONSE ASSESSMENT Quantitative imaging biomarkers can be either manually measured by radiologist (e.g., tumor diameter) or calculated with the help of computer or computer modeling (e.g., Ktrans). Manual measurement is prone to subject error and often lack of reproducibility. Moreover, tumor volume—an accurate measure of tumor burden—will be extremely time-consuming and impractical if obtained manually. Computer algorithms, as anticipated, will play an important role in providing accurate and reproducible volume measurements. Contemporary imaging viewing workstations with powerful manipulation, measurement, and analysis tools provided by either medical imaging equipment companies or software application vendors have been assisting radiologists/physicians in efficient and accurate image interpretations for more than a decade. In the field of the quantitative imaging response assessment, more and more diversified and robust image processing methods are expected to be developed for automated/semiautomated segmentation, registration, and change analysis. Such software packages can be used either in a standalone mode or as additional modules to be integrated into the existing diagnostic workstations. Successful development of computer software tools would ensure the successful development and qualification of the existing and new quantitative imaging parameters as surrogate imaging biomarkers for the assessment of tumor response to therapy for an ever-increasing number of future clinical trials and routine clinical practice as well.
SUMMARY Drug mechanism of action and cancer type and staging are playing a key role in the selection of appropriate imaging modality or a combination of multiple
imaging modalities for the assessment of tumor response to therapy in clinical trials. Image acquisition protocol needs to be defined ahead of the time a trial starts and strictly followed throughout a trial to minimize possible measurement inconsistency caused by varying imaging techniques and protocols. A general good principle to follow in imaging of response assessment is that the subsequent follow-up examinations should be performed with the same modality and the same imaging technique. In this manner, comparison of tumor change is possible and can be meaningful. The introduction of new modalities or imaging of new body parts should be performed if there are questions based upon initial imaging, or if new clinical symptoms are present. Increasingly, radiologists are also asked to participate in clinical investigations or clinical trials involving novel imaging or therapeutic agents to assess their efficacy. In the years to come, advanced computer software for automated or semiautomated quantification of tumor changes with therapy are expected to be developed, validated, and applied to clinical trials, assisting in the development and qualification of the existing and new imaging biomarkers as surrogate end points for clinical trials, especially for early phase clinical trials evaluating novel anticancer agents. General guidelines regarding good clinical practice are appropriate in these investigations as well, and at present the greatest difference lies in the degree of quantification for clinical investigational studies as well as the documentation of the findings. Several consensus recommendations with regard to trial design, response assessment methodology and criteria, quality control, and outcome analysis have been proposed for different imaging modalities to guide ongoing and future clinical trials. Ultimately, the individual patient care and overall accurate interpretation of the imaging findings is paramount, whether in a routine case or in one involving a clinical investigation.
References 1. Ollivier L, Husband JE, Leclere J. Assessment of response to treatment. In: Imaging in Oncology. 3rd ed. Husband JH and Reznek RH, eds. Taylor and Francis; 2009 (in press). 2. Higashi K, Clavo AC, Wahl RL. Does FDG uptake measure proliferative activity of human cancer cells? In vitro comparison with DNA flow cytometry and tritiated thymidine uptake. J Nucl Med. 1993;34:414–419. 3. Minn H, Joensuu H, Ahonen A, et al. Fluorodeoxyglucose imaging: a method to assess the proliferative activity of human cancer in vivo. Cancer. 1988;61:1776–1781. 4. Vesselle H, Grierson J, Muzi M, et al. In vivo validation of 3`deoxy-3`-[(18F)]fluorothymidine (18F-FLT) as a proliferation imaging tracer in humans: correlation of 18F-FLT uptake by positron emission tomography with Ki-67 immunohistochemistry and flow cytometry in human lung tumors. Clin Cancer Res. 2002;8:3315–3323.
25 IMAGING IN CLINICAL TRIALS
5. van Westreenen HL, Cobben DC, Jager PL, et al. Comparison of 18F-FLT PET and 18F-FDG PET in esophageal cancer. J Nucl Med. 2005;46:400–404. 6. Chen W, Cloughesy T, Kamdar N, et al. Imaging proliferation in brain tumors with 18F-FLT PET: comparison with 18F-FDG. J Nucl Med. 2005;46:945–952. 7. Tofts PS, Brix G, Buckley DL, et al. Estimating kinetic parameters from dynamic contrast-enhanced T(1)-weighted MRI of a diffusible tracer: standardized quantities and symbols. J Magn Reson Imaging. 1999;10:223–232. 8. Morgan B, Thomas AL, Drevs J, et al. Dynamic contrastenhanced magnetic resonance imaging as a biomarker for the pharmacological response of PTK787/ZK222584, an inhibitor of the vascular endothelial growth factor receptor tyrosine kinases, in patients with advanced colorectal cancer and liver metastases. J Clin Oncol. 2003;21:3955–3964. 9. Lordick F, Ott K, Krause B, et al. PET to assess early metabolic response and to guide treatment of adenocarcinoma of the oesophagogastric junction: the MUNICON phase II trial. Lancet Oncol. 2007;8:797–805. 10. de Geus-Oei LF, van der Heijden HFM, Corstens FHM, et al. Predictive and prognostic value of FDG-PET in NSCLC: a systematic review. Cancer. 2007;110:1654–1664. 11. Biomarkers Definitions Working Group. Biomarkers and surrogate endpoints: preferred definitions and conceptual frame work. Clin Pharmacol Ther. 2001;69:90–95. 12. Zhao B, Schwartz LH, Moskowitz C, et al. Computerized quantification of tumor response in lung cancer—initial results. Radiology. 2006;241:892–898. 13. Herhole K, Rudolf J, Heiss WD. FDG transport and phosphorylation in human gliomas measured with dynamic PET. J Neuro Oncol. 1992;12:159–165. 14. Duhaylongsod FG, Lowe VJ, Patz EF, et al. Lung tumor growth with glucose metabolism measured by fluoride-18 fluorodeoxyglucose positron emission tomography. Ann Thorac Surg. 1995;60:1348–1352. 15. Larson SM, Erdi Y, Akhurst T, et al. Tumor treatment response based on visual and quantitative changes in global tumor glycolysis using PET-FDG imaging. The visual response score and the change in total lesion glycolysis. Clin Positron Imaging. 1999;2:159–171. 16. Hoekstra CJ, Paglianiti I, Hoekstra OS, et al. Monitoring response to therapy in cancer using [18F]-2-fluoro-2-deoxy-Dglucose and positron emission tomography: an overview of different analytical methods. Eur J Nucl Med. 2000;27:731–743. 17. Riely GJ, Kris MG, Zhao B, et al. Prospective assessment of discontinuation and re-initiation of erlotinib or gefitinib in patients with acquired resistance to erlotinib or gefitinib followed by the addition of everolimus. Clin Cancer Res. 2007;13:5150–5155. 18. Collins DJ, Padhani AR. Dynamic magnetic resonance imaging of tumour perfusion. IEEE Eng Med Biol Mag. 2004;23:65–83. 19. Johnson KR, Ringland C, Stokes BJ, et al. Response rate or time to progression as predictors of survival in trials of metastatic colorectal cancer or non-small-cell lung cancer: a meta-analysis. Lancet Oncol. 2006;7:741–746. 20. WHO handbook for reporting results of cancer treatment. Geneva (Switzerland): World Health Organization Offset Publication No. 48; 1979. 21. Miller AB, Hogestraeten B, Staquet M, et al: Reporting results of cancer treatment. Cancer. 1081;47:207–214. 22. Therasse P, Arbuck SG, Eisenhauer EA, et al: New guidelines to evaluate response to treatment in solid tumors. J Natl Cancer Inst. 2000;92:205–216.
249
23. Young H, Baum R, Cremerius U, et al. Measurement of clinical and subclinical tumour response using [18F]-fluorodeoxyglucose and positron emission tomography: review and 1999 EORTC recommendations. Eur J Cancer. 1999;35:1773–1782. 24. Shankar LK, Hoffman JM, Bacharach S, et al. Guidelines for the use of 18FDG-PET as an indicator of therapeutic response in patients in National Cancer Institute trials. J Nucl Med. 2006;47:1059–1066. 25. Leach MO, Brindle KM, Evelhoch JL, et al. Assessment of antiangiogenic and antivascular therapeutics using MRI: recommendations for appropriate methodology for clinical trials. Br J Radiol. 2003;76(suppl 1):S87–S91. 26. Leach MO, Brindle KM, Evelhoch J, et al. The assessment of antiangiogenic and antivascular therapies in early-stage clinical trials using magnetic resonance imaging: issues and recommendations. Br J Cancer. 2005;92:1599–1610. 27. Evelhoch J, Garwood M, Vigneron D, et al. Expanding the use of magnetic resonance in the assessment of tumor response to therapy: workshop report. Cancer Res. 2005;65:7041–7044. 28. Grillo-Lopez AJ, Cheson BD, Horning SJ, et al. Response criteria for NHL: importance of “normal” lymph node size and correlations with response rates. Ann Oncol. 2000;11:3 99–408. 29. Cheson BD, Horning SJ, Coiffier B, et al. Report of an international workshop to standardize response criteria for nonHodgkin’s lymphomas. J Clin Oncol 1999;17:1244–1253. 30. Juweid ME, Wiseman GA, Vose JM, et al. Response assessment of aggressive non-Hodgkin’s lymphoma by integrated international workshop criteria and fluorine-18- fluorodeoxyglucose positron emission tomography. J Clin Oncol. 2005;23: 4652–4661. 31. Cheson BD, Pfistner B, Juweid ME, et al. Revised response criteria for malignant lymphoma. J Clin Oncol 2007;25:579–586. 32. Larson SM and Schwartz L. 18F-FDG PET as a candidate for “qualified biomarker”: functional assessment of treatment response in oncology. J Nucl Med. 2006;47:901–903. 33. Galbraith SM, Lodge MA, Taylor NJ, et al. Reproducibility of dynamic contrast-enhanced MRI in human muscle and tumors: comparison of quantitative and semi-quantitative analysis. NMR Biomed. 2002;15:132–142. 34. Padhani AR, Hayes C, Landau S, et al. Reproducibility of quantitative dynamic MRI of normal human tissues. NMR Biomed. 2002;15:143–153. 35. Schwartz LH, Ginsberg MS, DeCorato D, et al. Evaluation of tumor measurements in oncology: use of film-based and electronic techniques. J Clin Oncol. 2000;18:2179–2184. 36. Zhao B, Schwartz HL, Moskowitz C, et al. Effect of CT slice thickness on measurements of pulmonary metastases—initial experience. Radiology. 2005;234:934–939. 37. Erasmus JJ, Gladish GW, Broemeling L, et al. Interobserver and intraobserver variability in measurement of non-small-call carcinoma lung lesions: implications for assessment of tumor response. J Clin Oncol. 2003;21:2574–2582. 38. Wormanns D, Kohl G, Klotz E, et al. Volumetric measurements of pulmonary nodules at multi-row detector CT: in vivo reproducibility. Eur Radiol. 2004;14:86–92. 39. Goodman LR, Gulsun M, Washington L, et al. Inherent variability of CT lung nodule measurements in vivo using semiautomated volumetric measurements. Am J Roentgenol. 2006; 186:989–994. 40. Zhao B, James LP, Moskowitz C, et al. Evaluating variability in tumor measurements from same-day repeat CT scans in patients with non-small cell lung cancer. Radiology 2009; 252:263–272.
This page intentionally left blank
26
Pharmacokinetic and Pharmacodynamic Monitoring in Clinical Trials: When Is It Needed? Ticiana Leal Jill M. Kolesar
Pharmacokinetics (PKs), also known as doseconcentration relationships, is the mathematical relationship describing the time course of absorption, distribution, metabolism, and excretion (ADME) of drugs and metabolites in the body (1). The biological, physiological, and physicochemical factors that influence the transfer processes of drugs in the body also influence the rate and extent of ADME of those drugs in the body. Typically, PK parameters are determined by measuring plasma or other relevant distributions site drug concentrations. Pharmacodynamics (PDs), is the study of the biochemical and physiological effects of drugs on the body, the mechanisms of drug action, and the relationship between drug concentration and effect (2). Establishing the relationship between blood concentrations and PD responses and possible differences among population subsets, is the concentration-response (often called PK-PD) relationship. See Figure 26-1. In many cases, pharmacological action as well as toxicological action is related to plasma concentration of drugs, and one of the primary reasons the Food and Drug Administration (FDA) specifies that PK and PD studies are required as part of drug development (3). PHASE I TRIALS PK evaluations are critical to phase 1 clinical trials and are used to characterize the dose concentration relationships for new anticancer drugs (4). The goal of
a PK study is an accurate description of the plasma distribution of the drug in individual subjects. To accomplish this, most phase I studies employ an intensive and extensive blood sampling schedule. Samples are obtained frequently, and over a period of five halflives, if possible. After samples are obtained, drug concentrations in plasma are determined by validated assay methodologies and data is typically analyzed by noncompartmental methods or nonlinear regression analysis. The advantage of noncompartmental PK analysis is its model-independent PK parameters (5). See Table 26.1 for a description of PK parameters. The main assumption for noncompartmental analysis is PK linearity and the assumption that PK parameters do not vary with time or dose. Compartmental analysis requires the use of nonlinear regression and curve fitting and parameters and is useful for estimate PK parameters as well as predicting plasma concentrations and simultaneous analysis PK and PD data (6). PK studies have a variety of uses in phase I clinical trials (7). Some drugs may be administered as a continuous infusion to a target level. An example of this is the antisense molecule G3139 (8). Preclinical evaluations demonstrated activation of complement at levels greater than 5 uM, therefore, PKs were evaluated in clinical trials and dose escalation was discontinued before this threshold was surpassed. In addition, PK parameters may predict toxicity. Carboplatin is dosed in the clinical setting to achieve an area under the curve 251
252
ONCOLOGY CLINICAL TRIALS
FIGURE 26.1 Pharmacokinetic and pharmacodynamic relationships.
(AUC), or, as this measure of exposure is associated with toxicity (9). See Table 26.2.
PHASE II TRIALS PD Dose exposure, or PD relationships are often explored as part of Phase II clinical trials in oncology and can be used to evaluate the relationship of exposure (whether dose or concentration) to a response (e.g., nonclinical biomarkers, potentially valid surrogate end points, or short-term clinical effects). Both the magnitude of an effect and the time course of an effect are important to choosing dose, dosing interval, and monitoring parameters (20).
Biomarkers as Selection Criteria in Clinical Trials The characterization of genomic biomarkers is currently a routine part of research and development for new therapeutics. A recent review of 1,200 drug package inserts showed that 121 drug package inserts contained pharmacogenomic information (21). Biomarkers may be physiologic, pathologic, or anatomic measurements that relate to some aspect of normal or pathological biologic processes. In oncology drug development, biomarkers are typically evaluated as a way to select or target therapy to a particular population (e.g., epidermal growth factor receptor [EGFR] inhibitors may be used only in patients who are wildtype for KRAS ). Selection factors are typically incorporated in Phase II clinical trials as an inclusion criteria with drug approval limited to the specific patient population (22). See Table 26.3.
In 2005, the FDA issued a guidance for the pharmaceutical industry, describing the type of pharmacogenomic information that should be submitted during the drug development process (23). The FDA defines a valid biomarker as a “biomarker that is measured in an analytical test system with well established performance characteristics and for which there is an established scientific framework or body of evidence that elucidates the physiologic, toxicologic, pharmacologic, or clinical significance of the test results.” The classification of biomarkers is context specific, meaning that the biomarker is specifically validated for a specific disease and a specific drug, typically in the context of a clinical trial. For example, a number of anticancer agents, including cetuximab, erlotinib, and panitumomab, are categorized as EGFR inhibitors. Despite their shared mechanism of action, each drug has a different discussion in the package label, depending on how it was studied in clinical trials. The anticancer agents with valid biomarkers as defined by FDA are summarized in Table 26.3. Biomarkers as Determinants of Efficacy Biomarkers are not acceptable surrogate end points for a determination of efficacy of a new drug unless they have been empirically shown to function as valid indicators of clinical benefit (i.e., are valid surrogates), however, changes in biomarkers may occur more rapidly than clinical end points, and may be more closely related to the time course of plasma drug concentrations, possibly with a measurable delay. For this reason, exposure-response relationships based on biomarkers can help establish the dose range for clinical trials intended to establish efficacy. An example of this is bortezomib, and proteosome inhibition (24). Bortezomib is a reversible inhibitor of the chymotrypsin-like
253
26 PHARMACOKINETIC AND PHARMACODYNAMIC MONITORING IN CLINICAL TRIALS
TABLE 26.1
Pharmacokinetic Parameters. PARAMETER
DEFINITION
EQUATION
Concentration maximum (Cmax)
The highest concentration of the drug
Time of maximum concentration (Tmax)
The time the concentration is highest
Tmax = time point at Cmax
Half-life (t1/2)
Time required for the drug concentration to decrease by 50%
t1 =
Area under the curve (AUC)
The concentration of drug, added up over a period of time
AUC∞ = AUC(0−t ) +
Clearance (CL)
The amount of drug eliminated per unit of time
CL =
Volume (V)
The tank or the distribution of the drug
V=
Bioavailibility (F)
For an oral drug, the amount of drug absorbed
Determined experimentally
Cmax =
2
FD − λctmax e V
0.693.V CL Cn λz
FD AUC0
FD AUC(0−t ) λz
*D = Dose administered
activity of the 26S proteasome in mammalian cells. Clinical trials of bortezomib evaluated proteasome inhibition, demonstrating maximum inhibition of approximately 80% within 5 minutes of dose administration. This information was used as part of dose escalation decisions and as a measure of efficacy. Although response rates and survival were integral to the approval of the drug, proteosomal inhibition was also critical information demonstrating the efficacy of the drug. Biomarkers and PK Relationships Dose concentration relationships may contribute to approval of different doses, dosing regimens, or dosage forms, or use of a drug in different populations, when effectiveness is already well established in other settings and the study demonstrates a PD relationship that is similar (25). In some cases, measurement of systemic exposure levels (e.g., plasma drug concentrations) as part of doseresponse studies can provide additional useful information. Systemic exposure data are especially useful when an assigned dose is poorly correlated with plasma concentrations, obscuring an existing concentrationresponse relationship. This can occur when there is a
large degree of interindividual variability in PKs or there is a nonlinear relationship between dose and plasma drug concentrations. Blood concentrations can also be helpful when (1) both parent drug and metabolites are active, (2) different exposure measures (e.g., Cmax, AUC) provide different relationships between exposure and efficacy or safety, (3) the number of fixed doses in the dose-response studies is limited, and (4) responses are highly variable and it is helpful to explore the underlying causes of variability of response. Phase III Trials Population PKs Population PKs is used to evaluate the sources and correlates of variability in drug concentrations among individuals who are the target patient population receiving clinically relevant doses of a drug of interest and are typically conducted as part of phase III clinical trials (26). Patient specific factors; including, ethnicity, renal and hepatic function, interacting medications, body weight, and gender can change dose-concentration relationships. Population PKs may be able to identify the factors that cause changes in the dose-concentration relationship and if these changes have a clinical impact.
254
ONCOLOGY CLINICAL TRIALS
TABLE 26.2
Examples of Chemotherapy Agents, Demonstrating a Link between Pharmacokinetics and Pharmacodynamics. CHEMOTHERAPY
MEASURE
DYNAMIC EFFECT
SETTING
Busulfan
AUC*
Hepatotoxicity Thrombocytopenia Nephrotoxicity Myelosuppression, efficacy Myelosuppression
5-Fluorouracil
AUC Cmax AUC Duration above target concentration AUC
Irinotecan Methotrexate
AUC Css**
Diarrhea Efficacy
Paclitaxel
Duration above target concentration Css
Myelosuppression
Bone marrow transplantation (10) Solid tumors (11) Solid tumors (12) NSCLC (13) Pediatric solid tumors (14) Head and neck tumors (15) Solid tumors (16) Acute lymphocytic leukemia (17) Ovarian and breast cancer (18) Pediatric tumors (19)
Carboplatin Cisplatin Docetaxel Etoposide
Topotecan
OF
EXPOSURE
Survival
Myelosuppression
* Area under the concentration-time curve ** Steady-state concentration
A key difference between the standard PK approach and population PKs is interindividual variability. In the standard approach, variability is a factor to be minimized via study design or strict inclusion criteria. The population PK approach embraces and attempts to explain interindividual variability by collecting PK information in the target population likely to be treated with the drug, identifying and measuring variability, and correlating that variability with patient specific factors. Population PK data can often be used to modify dosing recommendations. One example of this in oncology is docetaxel, in which Bruno et al. (27) demonstrated that individuals with mild to moderate liver function impairment (SGOT and/or SGPT > 1.5 × ULN concomitant with alkaline phosphatase > 2.5 × ULN) had a decrease in docetaxel clearance by an average of 27% with an increased risk of toxicity. This information was used to provide recommendations for dose reduction of docetaxel for hepatic impairment. Phase IV Trials Special Populations Renal Impairment Most drugs are cleared by elimination of unchanged drug by the kidney and/or by metabolism in the liver. For a drug eliminated primarily via renal excretory
mechanisms, impaired renal function may alter its PKs and PDs to an extent that the dosage regimen needs to be changed from that used in patients with normal renal function (28). Although the most obvious type of change arising from renal impairment is a decrease in renal excretion, or possibly renal metabolism, of a drug or its metabolites, renal impairment has also been associated with other changes, such as changes in absorption, hepatic metabolism, plasma protein binding, and drug distribution. These changes may be particularly prominent in patients with severely impaired renal function and have been observed even when the renal route is not the primary route of elimination of a drug. Thus, for most drugs that are likely to be administered to patients with renal impairment, PK characterization should be assessed in patients with renal impairment to provide rational dosing recommendations. Hepatic Impairment The liver is involved in the clearance of many drugs through a variety of oxidative and conjugative metabolic pathways and/or through biliary excretion of unchanged drug or metabolites (29, 30). Alterations of these excretory and metabolic activities by hepatic impairment can lead to drug accumulation or (less often) failure to form an active metabolite.
255
26 PHARMACOKINETIC AND PHARMACODYNAMIC MONITORING IN CLINICAL TRIALS
TABLE 26.3
Oncology Drugs with Valid Genomic Biomarkers. BIOMARKER
DRUG
APPROVED LABEL CONTENT
Biomarkers for Selection of Therapy c-kit
Imatinib
Chromosome 5 deletion
Lenalidomide
EGFR expression
Erlotinib, Panitumomab, Gefitinib Cetuximab
Her-2neu
Trastuzumab, Lapatinib Busulfan, Dasatinib
Philadelphia chromosome
PML/RAR fusion gene
Tretinoic acid
Gleevec is also indicated for the treatment of patients with Kit (CD117) positive unresectable and/or metastatic malignant gastrointestinal stromal tumors (GIST). Lenalidomide is indicated for the treatment of patients with dependent anemia due to low- or intermediate-1-risk myelodysplastic syndromes associated with a deletion 5q cytogenetic abnormality with or without additional cytogenetic abnormalities. Erlotinib EGFR expression was determined using the EGFR pharmDxTM kit. In contrast to the 1% cutoff specified in the pharmDx kit instructions, a positive EGFR expression status was defined as having at least 10% of cells staining for EGFR. The pharmDx kit has not been validated for use in pancreatic cancer. An apparently larger effect, however, was observed in two subsets: patients with EGFR positive tumors (HR = 0.68) and patients who never smoked (HR = 0.42). Analysis of the impact of EGFR expression status on the treatment effect on clinical outcome is limited because EGFR status is known for 326 NSCLC study patients (45%). Cetuximab (Colon Cancer) Patients enrolled in the clinical studies were required to have immunohistochemical evidence of positive EGFR expression using the DakoCytomation EGFR pharmDxTM test kit. Detection of HER2 protein overexpression is necessary for selection of patients appropriate for HERCEPTIN therapy. Busulfan is clearly less effective in patients with chronic myelogenous leukemia who lack the Philadelphia (Ph1) chromosome. Dasatinib is indicated for the treatment of adults with Philadelphia chromosome-positive acute lymphoblastic leukemia (Ph + ALL) with resistance or intolerance to prior therapy. Initiation of therapy with tretinoic acid may be based on the morphological diagnosis of acute promyelocytic leukemia (APL). Confirmation of the diagnosis of APL should be sought by detection of the t (15, 17) genetic marker by cytogenetic studies. If these are negative, PML/RAR (alpha) fusion should be sought using molecular diagnostic techniques. The response rate of other AML subtypes to tretinoic acid has not been demonstrated; therefore, patients who lack the genetic marker should be considered for alternative treatment.
Biomarkers for Preventing Toxicity TPMT
Azathioprine, Mercaptopurine
UGT1A1
Irinotecan
Thiopurine methyltransferase deficiency or lower activity due to mutation at increased risk of myelotoxicity. TPMT testing is recommended and consideration be given to either genotype or phenotype patients for TPMT. Individuals who are homozygous for the UGT1A*28 allele are at increased risk for neutropenia following initiation of camptosar treatment. A reduced initial dose should be considered for patients known to be homozygous for the UGT1A*28 allele. Heterozygous patients may be at increased risk of neutropenia; however, clinical results have been variable and such patients have been shown to tolerate normal starting doses. (Continued)
256
ONCOLOGY CLINICAL TRIALS
TABLE 26.3
(Continued) BIOMARKER
DRUG
APPROVED LABEL CONTENT
DPD deficiency
5-FU, Capecitabine
Rarely, unexpected, severe toxicity (e.g., stomatitis, diarrhea, neutropenia, and neurotoxicity) associated with 5-fluorouracil has been attributed to a deficiency of dihydropyrimidine dehydrogenase (DPD) activity. A link between decreased levels of DPD and increased, potentially fatal toxic effects of 5-fluorouracil therefore cannot be excluded.
Hepatic disease can alter the absorption and disposition of drugs (PKs) as well as their efficacy and safety (PDs). Even though clinically useful measures of hepatic function to predict drug PKs and PDs are not generally available, clinical studies in patients with hepatic impairment, usually performed during drug development, can provide information that may help guide initial dosing in patients (31). If hepatic metabolism and/or excretion accounts for a substantial portion (>20 percent of the absorbed drug) of the elimination of a drug or active metabolite in patients with impaired hepatic function or it accounts for lesser amount (<20%) but the drug has a narrow therapeutic window PK studies are recommended. Pregnancy/Lactation As women delay childbearing, it is increasingly common to have cancer diagnosed during pregnancy. The diagnostic and therapeutic management of a pregnant patient with cancer is especially difficult because it involves two individuals—the mother and the fetus (32). Few PK studies have been done in pregnant women receiving chemotherapy, and pregnant women are routinely excluded from clinical trials. When it is necessary to treat a pregnant woman, she receives similar weight-based doses as women who are not pregnant, adjusted with the continuing weight gain. The increased blood volume (by almost 50%), and increased renal clearance might decrease active drug concentrations compared with women who are not pregnant and who are the same weight. Increased drug clearance from the body can lead to a reduced drug exposure and effect. A faster hepatic mixed-function oxidase system might also lower drug concentrations, and changes in gastrointestinal function can affect drug absorption. The volume of distribution, peak drug concentration, and half-life of administration is also sometimes changed during pregnancy. Plasma albumin decreases, increasing the amount of unbound active drug; however, estrogen increases other plasma
proteins, which might decrease active drug fractions. Additional considerations include the ability of the drug to pass the placenta. Pediatrics In the pediatric population, growth and development may alter the PKs of a drug when compared to the adult population. Of note, the PKs in adolescents >16 years of age and weight >50kg is very similar to adults. Developmental changes in the pediatric population that can affect absorption include effects on gastric acidity, rates of gastric and intestinal emptying, surface area of the absorption site, gastrointestinal enzyme systems for drugs that are actively transported across the gastrointestinal mucosa, gastrointestinal permeability, and biliary function. Distribution of a drug may be affected by changes in body composition as well as differences in protein binding. Studies on drug metabolism of specific drugs in children is limited. In general, it is extrapolated that children will form the same metabolites as adults via pathways such as oxidation, reduction, hydrolysis, and conjugation, but rates of metabolite formation can be different. Similarly, drug excretion can vary according to the developmental stage, especially when a drug is predominantly excreted via renal mechanisms. In addition, there are laws that regulate clinical studies and use of anticancer drugs in pediatric oncology. The Pediatric Research Act of 2003 provides legislative authority for FDA to require companies to do pediatric testing for drugs and biologics. Geriatrics Usually, there are no significant differences in PKs for cytotoxic agents based on age alone. A progressive decrease in physiologic reserve that affects each individual at a unique pace accompanies aging (16, 17). Both cancer and its treatment can be considered physiologic stressors and the age-related decrease in physiologic reserve can affect tolerance to cancer treatment.
26 PHARMACOKINETIC AND PHARMACODYNAMIC MONITORING IN CLINICAL TRIALS
A number of age-related changes in drug ADME with aging can contribute to differences in treatment tolerance between older and younger patients. The absorption of drugs can be affected by decreased gastrointestinal motility, decreased splanchnic blood flow, decreased secretion of digestive enzymes, and mucosal atrophy (33). As a person ages, body composition changes, with an increase in body fat and a decrease in lean body mass and total body water. The increase in body fat leads to a rise in the volume of distribution for lipid soluble drugs and a diminution in the volume of distribution for hydrophilic drugs. In the cancer population, malnutrition and hypoalbuminemia can result in an increased unbound concentration of drugs that are albumin-bound (34). Over a lifespan, renal mass decreases by approximately 25% to 30%, and renal blood flow decreases by 1% per year after age 50 (35). The decline in glomerular filtration rate (GFR) with age is estimated at 0.75 mL/min per year after age 40; however, approximately one-third of patients have no change in creatinine clearance with age. This reduced renal function does not usually result in increased serum creatinine levels because of the simultaneous loss of muscle mass. Therefore, serum creatinine is not an adequate indicator of renal function in the older patient. The decline in GFR with age translates into PK alterations of drugs or their active metabolites, which are excreted by the kidneys. Due to the physiologic decline in renal function with age, chemotherapy agents that are primarily renal excreted must be dosed with caution in older patients. Despite increasingly widespread drug exposure with age, older patients did not have adverse outcomes with standard dose. Studies with docetaxel likewise did not find clinically relevant changes in PKs in the elderly (36). Drugs that do not have different PKs in the elderly as compared to younger patients can have different toxicities. Myelosuppression in the elderly is an important concern; also, with fluoropyrimidines, mucositis can be severe in the elderly. These effects can be difficult to predict. The increased use of hematopoietic growth factors has led to a shift in the toxicity profile. The dose-limiting toxicity of many regimens has shifted to nonhematologic toxicity, particularly neuropathy and gastrointestinal toxicity, which remain significant problems for older patients (37). In conclusion, there are few reasons to modify dosing based purely on age. If an elderly patient has an adequate performance status as well as renal and hepatic function, the same dose that would be used in a younger patient should be used (38).
257
Food Effect Oral chemotherapy agents are a relatively new dosage form in cancer therapy. Food effect bioavailability studies are usually conducted for new oral drugs and drug products during the IND period to assess the effects of food on the rate and extent of absorption of a drug when the drug product is administered shortly after a meal (fed conditions), as compared to administration under fasting conditions. Food effects on bioavailability are generally greatest when the drug product is administered shortly after a meal is ingested. The nutrient and caloric contents of the meal, the meal volume, and the meal temperature can cause physiological changes in the GI tract in a way that affects drug product transit time, luminal dissolution, drug permeability, and systemic availability. For example, lapatinib, an oral EGR/Her2-neu tyrosine kinase inhibitor, was approved based on results of a phase III trial in metastatic breast cancer. The recommendation is that it should be taken as five 250 mg tablets fasting. Most recent studies demonstrated that the bioavailability of lapatinib is greatly increased by food, especially a high-fat meal. A randomized, crossover, food-effect study demonstrated that both peak concentration and AUC were increased markedly when a single 1,500 mg dose of lapatinib was taken with food as opposed to when fasting, and was increased further by a high-fat meal (39).
Drug-Drug Interaction In vitro studies can frequently serve as a screening mechanism to rule out the importance of a metabolic pathway and the drug-drug interactions that occur through this pathway so that subsequent in vivo testing is unnecessary. This opportunity should be based on appropriately validated experimental methods and rational selection of substrate/interacting drug concentrations. In addition to in vitro metabolism and drug-drug interaction studies, appropriately designed PK studies, usually performed in the early phases of drug development, can provide important information about metabolic routes of elimination, their contribution to overall elimination, and metabolic drug-drug interactions. Not every drug-drug interaction is metabolism based, but may arise from changes in PKs caused by absorption, distribution, and excretion interactions. Drug-drug interactions related to transporters are being documented with increasing frequency and are important to consider in drug development. Although less well studied, drug-drug interactions may alter PK/PD relationships.
258
ONCOLOGY CLINICAL TRIALS
Many metabolic routes of elimination, including most of those occurring through the P450 family of enzymes, can be inhibited or induced by concomitant drug treatment. Observed changes arising from metabolic drug-drug interactions can be substantial—an order of magnitude or more decrease or increase in the blood and tissue concentrations of a drug or metabolite—and can include formation of toxic and/or active metabolites or increased exposure to a toxic parent compound. These large changes in exposure can alter the safety and efficacy profile of a drug and/or its active metabolites in important ways. This is most obvious and expected for a drug with a narrow therapeutic range (NTR), but is also possible for non-NTR drugs as well. For example, CYP3A4 was determined to be the primary enzyme responsible for the oxidative metabolism of temsirolimus in humans. Potential drug-drug interactions can occur with drugs that are metabolized by or are substrates of CYP3A4 (40). It is important that metabolic drug-drug interaction studies explore whether an investigational agent is likely to significantly affect the metabolic elimination of drugs that are routinely taken concomitantly and, conversely, whether these drugs are likely to affect the metabolic elimination of the investigational drug. One example of this is the drug-drug interactions that occur in patients taking tamoxifen and selective serotonin reuptake inhibitors (SSRIs), which are often prescribed to alleviate tamoxifen-associated hot flashes. In a prospective clinical trial, coadministration of tamoxifen and the SSRI paroxetine (an inhibitor of CYP2D6) led to decreased plasma concentration of endoxifen, which is an active metabolite of tamoxifen. Another interesting point in this study was that women with differing CYP2D6 genotypes had significant metabolic differences. Endoxifen concentrations decreased by 64% (95% CI = 39% to 89%) in women with a wildtype CYP2D6 genotype, but by only 24% (95% CI = 23% to 71%) in women with a variant CYP2D6 genotype (P = .03) (41). In cases where it is necessary to administer an interacting drug, it is crucial to have the adequate scientific background in order to adjust the dosage regimen, avoiding significant resultant toxicity. Conclusions The ultimate goal of PKs and PDs is to maximize the likelihood of response in the setting of minimal drugrelated toxicity. Traditional practice has been based on administration of agents utilizing the maximum tolerated dose, and it is less clear whether this is the optimal dose, especially in the era of targeted agents.
References 1. Wagner JG. Pharmacokinetics. Annu Rev Pharmacol. 1968;8:67–94. 2. Kolbye AC Jr. Cancer in humans: exposures and responses in a real world. Oncology. 1976;33(2):90–100. 3. Center for drug evaluations: www.fda.gov. Accessed Oct 22, 2008. 4. Workman P. Pharmacokinetics and cancer: successes, failures and future prospects. Cancer Surv. 1993;17:1–26. 5. Gillespie WR. Noncompartmental versus compartmental modelling in clinical pharmacokinetics. Clin Pharmacokinet. 1991 Apr;20(4):253–262. 6. Ranson MR, Scarffe JH. Population and Bayesian pharmacokinetics in oncology. Clin Oncol (R Coll Radiol). 1994;6(4):254–260. 7. Graham MA, Kaye SB. New approaches in preclinical and clinical pharmacokinetics. Cancer Surv. 1993;17:27–49. 8. Liu G, Kolesar J, McNeel DG, et al. A phase I pharmacokinetic and pharmacodynamic correlative study of the antisense Bcl-2 oligonucleotide g3139, in combination with carboplatin and paclitaxel, in patients with advanced solid tumors. Clin Cancer Res. 2008 May 1;14(9):2732–2739. 9. Calvert H, Judson I, van der Vijgh WJ. Platinum complexes in cancer medicine: pharmacokinetics and pharmacodynamics in relation to toxicity and therapeutic activity. Cancer Surv. 1993;17:189–217. 10. Grochow LB, Jones RJ, Brundett RB, et al. Pharmacokinetics of busulfan: correlation with veno-occlusive disease in patients undergoing bone marrow transplantation. Cancer Chemother Pharmacol. 1989;25:55–61. 11. Egorin MJ, Van Echo DA, Tipping SJ, et al. Pharmacokinetics and dosage reduction of cis-diammine(1,1-cyclobutanedicarboxylato) platinum in patients with impaired renal function. Cancer Res. 1984;44:5432–5438. 12. Reece PA, Stafford I, Russel J, et al. Creatinine clearance as a predictor of ultrafilterable platinum disposition in cancer patients treated with cisplatin: relationship between peak ultrafilterable platinum plasma levels and nephrotoxicity. J Clin Oncol. 1987;5:304–309. 13. Bruno R, Hille D, Riva A, et al. Population pharmacokinetics/ pharmacodynamics of docetaxel in phase II studies in patients with cancer. J Clin Oncol. 1998;16:187–196. 14. Sonnichesen DS, Ribeiro RC, Luo X, et al. Pharmacokinetics and pharmacodynamics of 21-day continuous oral etoposide in pediatric patients with solid tumors. Clin Pharmacol Ther. 1995;58:99–107. 15. Milano G, Etienne MC, Renee N, et al. Relationship between fluorouracil systemic exposure and tumor response and patient survival. J Clin Oncol. 1994;12:1291–1295. 16. Xie R, Mathijssen RH, Sparreboom A, et al. Clinical pharmacokinetics of irinotecan and its metabolites in relation with diarrhea. Clin Pharmacol Ther. 2002;72:265–275. 17. Evans WE, Crom WR, Abromowitch M, et al. Clinical pharmacodynamics of high-dose methotrexate in acute lymphocytic leukemia. Identification of a relation between concentration and effect. N Engl J Med. 1986;314:471–477. 18. Gianni L, Kearns CM, Giani A, et al. Nonlinear pharmacokinetics and metabolism of paclitaxel and its pharmacokinetic/pharmacodynamic relationships in humans. J Clin Oncol. 1995;13:180–190. 19. Stewart CF, Baker SD, Heideman RL, et al. Clinical pharmacodynamics of continuous infusion topotecan in children: systemic exposure predicts hematologic toxicity. J Clin Oncol. 1994;12:1946–1954. 20. Kobayashi K, Jodrell D, Ratian M. Pharmacodynamicpharmacokinetic Relationships and Therapeutic Drug Monitoring. Cancer Surveys. 1993;17:1–26. 21. Frueh FW, Amur S, Mummaneni P, et al. Pharmacogenomic biomarker information in drug labels approved by the United
26 PHARMACOKINETIC AND PHARMACODYNAMIC MONITORING IN CLINICAL TRIALS
22. 23.
24. 25.
26. 27.
28. 29. 30. 31.
States food and drug administration: prevalence of related drug use. Pharmacotherapy. 2008 Aug;28(8):992–998. Table of Valid Biomarkers. http://www.fda.gov/cder/genomics/ genomic_biomarkers_table.htm. Accessed Oct 23, 2008. E15 Definitions for Genomic Biomarkers, Pharmacogenomics, Pharmacogenetics, Genomic Data and Sample Coding Categories. http://www.fda.gov/cder/guidance/index.htm. Accessed Oct 25, 2008. Utecht KN, Kolesar J. Bortezomib: a novel chemotherapeutic agent for hematologic malignancies. Am J Health Syst Pharm. 2008 Jul 1;65(13):1221–1231. FDA guidances: exposure-dose relationship, PK, PD, special populations. (Accessed October 8, 2009, from http://www.fda.gov/ downloads/Drugs/GuidanceComplianceRegulatoryInformation/ Guidances/UCM072109.pdf.) Vozeh S, Steimer JL, Rowland M, et al. The use of population pharmacokinetics in drug development. Clin Pharmacokinet. 1996 Feb;30(2):81–93. Bruno R, Vivier N, Veyrat-Follet C, Montay G, Rhodes GR. Population pharmacokinetics and pharmacokineticpharmacodynamic relationships for docetaxel. Invest New Drugs. 2001 May;19(2):163–169. Fabre J, Balant L. Renal failure, drug pharmacokinetics and drug action. Clin Pharmacokinet. 1976;1(2):99–120. Rowland M, Benet LZ, Graham GG. Clearance concepts in pharmacokinetics. J Pharmacokinet Biopharm. 1973 Apr;1(2): 123–136. Levy RH, Bauer LA. Basic pharmacokinetics. Ther Drug Monit. 1986;8(1):47–58. Cohen A. Pharmacokinetic and Pharmacodynamic data to be derived from early-phase drug development designing
32. 33. 34. 35. 36.
37. 38. 39. 40. 41.
259
informative human pharmacology studies. Clin Pharmacokinet. 2008;47(6):373–381. Pentheroudakis G. Cancer and pregnancy. Ann Oncol. 2008;19(suppl 5):v38–v39. Hurria A. Clinical trials in older adults with cancer: past and future. Oncol. (Williston Park). 2007 Mar;21(3):351–358; discussion 363–364, 367. Green JM, Hacker ED. Chemotherapy in the geriatric population. Clin J Oncol Nurs. 2004 Dec;8(6):591–597. Vincenzi B, Santini D, Spoto S, et al. The antineoplastic treatment in the elderly. Clin Ter. 2002 May–Jun;153(3):207–215. Kloft C, Wallin J, Henningsson A, Chatelut E, Karlsson MO. Population pharmacokinetic-pharmacodynamic model for neutropenia with patient subgroup identification: comparison across anticancer drugs. Clin Cancer Res. 2006 Sep 15;12(18): 5481–5490. Lichtman SM. Pharmacokinetics and pharmacodynamics in the elderly. Clin Adv Hematol Oncol. 2007 Mar;5(3):181–182. No abstract available. Hurria A, Lichtman SM. Pharmacokinetics of chemotherapy in the older patient. Cancer Control. 2007;14:32–43. Ratain MJ, Cohen EE. The value meal: how to save $1,700 per month or more on lapatinib. J Clin Oncol. 2007;25: 3397 –3398. Malizzia LJ, Hsu A. Temsirolimus, an mTOR inhibitor for treatment of patients with advanced renal cell carcinoma. Clin J Oncol Nurs. 2008 Aug;12(4):639–646. Stearns V, Johnson MD, Rae JM, et al. Active tamoxifen metabolite plasma concentrations after coadministration of tamoxifen and the selective serotonin reuptake inhibitor paroxetine. J Natl Cancer Inst. 2003 Dec 3;95(23):1758–1764.
This page intentionally left blank
27
Practical Design and Analysis Issues of Health Related Quality of Life Studies in International Randomized Controlled Cancer Clinical Trials Andrew Bottomley Corneel Coens Murielle Mauer
While many clinicians believe that saving lives is the priority, there is increasing recognition among some clinicians that in certain cases it is not the length of survival gained, but the quality of the survivorship that is critical to the patient and those closely associated with the patient. While new treatments often offer only small incremental survival gains, sometimes as low as weeks or perhaps a month or two, the question remains how does the patient value the extra survival gained? Frequently survival is at the cost of toxic drugs/therapy and/or surgery which can leave the patient with feelings of sadness, anxiety, possibly disfigured after some surgeries (e.g., mastectomy), and sometimes inconsolable. For decades, clinicians aware of health-related quality of life (HRQOL) have asked patients informally “how do you feel?” This technique, which may help assist treatment selections, is now formalized and reframed as health-related quality of life. This chapter explains how to design an HRQOL randomized clinical trial (RCT), and how to analyze and interpret the results with a focus on the most commonly used tool in oncology—the European Organization for Research and Treatment of Cancer (EORTC) QLQC3O (1), along with other measurement tools. Some examples of effective trials together with inappropriate studies are presented outlining lessons learned from the early RCTs of HRQOL research. Hopefully, this chapter will invigorate the researcher into including HRQOL end points in future trials.
DESIGNING TRIALS WITH HRQOL HRQOL application ranges from individual cancer care planning and patient monitoring to populationwide surveillance programs. The primary research area remains within RCTs where different interventions are directly compared in terms of patient benefit. Implementing HRQOL into clinical trials is very similar to the methodology used for the classical clinical end points of, for example, survival and disease control. However, some specific issues arise mainly due to the inherent nature of patient involvement. Any RCT must have a well-defined purpose, in whole and in part. The HRQOL element must be based on a sound and detailed research objective. All too often, HRQOL is added to a clinical trial without a clear hypothesis to justify its inclusion. Vague statements such as “The objective is to measure HRQOL over time” or “We are interested in evaluating changes in HRQOL” are totally inadequate. A well-formulated objective includes: specification of the research question to be answered; selecting the relevant areas of HRQOL domains; the scope of the time period of interest; and the direction, magnitude, and duration of the expected difference. The objective of the RCT will impact upon all stages of the further design, analysis, and reporting (See Table 27.1). If no clear objective can be specified (e.g., due to lack of data or poor design of the protocol), all resulting
261
262
ONCOLOGY CLINICAL TRIALS
TABLE 27.1
Key Issues in Designing a Trial. Objectives Research question HRQOL domains Timing Expected differences Instrument selection Method of administration Data scoring Interpretation Trial schedule Assessment plan Time windows Analysis Compliance and missingness mechanism Main analysis Sensitivity analyses Reporting Compliance Main results Reliability: missing data and multiple testing Validity: clinical significance Context
analyses will be exploratory in nature. In this case, an overall description of HRQOL in the study population can be established for future hypothesis generating initiatives. Any exploratory analysis is, however, intrinsically limited and needs to be confirmed in an independent setting to be conclusive. Once the objective has been established, an appropriate and validated instrument is selected (2). Most HRQOL instruments consist of a simple questionnaire. While it is often tempting for the investigator to construct a questionnaire specifically tailored to the study-in-hand, this is by far the worst choice possible. A good HRQOL instrument needs to be tested for reliability and validity in order to be useful for clinical application. Many tools exist, designed for generic, cancer specific, cancer-site specific, or even cancer-symptom specific application. When choosing, keep in mind the method of administration and the intended RCT population. In elderly patients, shorter questionnaires (e.g.. EORTC measure for the elderly) may be preferred in order to reduce noncompliance. When setting up an international study, verify whether validated translations exist. Make sure that the information needed can be derived from the questionnaire. For example, if pain is going to be a major issue, the instrument should have questions related to pain and its impact, such as the brief pain inventory.
The method of data gathering and scoring should be clear for the instrument (i.e., it should be clear how patient answers translate into the numbers used in the analysis). Finally, there is interpretation of the results. Ideally, a minimal clinical important difference (MCID) should be defined for the tool. This is the smallest difference in outcome that is considered to be clinically relevant. Such a difference is instrument specific and cannot be easily substituted. One might in this respect prefer to choose a questionnaire that is already widely used in the research in order to allow easier comparison with other RCTs. When using the EORTC QLQC30, for example, a change of at least 10 points in a HRQOL parameter is considered to be of minimal clinical importance to the patients (3). A next step in the design is choice of time schedule for the RCT. Individual patient HRQOL usually changes over time; the interest lies in evaluating whether such evolution differs between treatment arms. The choice of the assessment times should be chosen to coincide with the objective as well as to minimize the total patient burden. It is unrealistic to expect all patients to complete their questionnaire on a specific day. Therefore, time windows (i.e., extended periods of time) are often used to construct an assessment schedule rather than time points. Upper and lower limits around a target date should be defined so that HRQOL data obtained within the corresponding time window are considered valid assessments. Preferably this should be done a priori. Attention must be paid to minimize bias when choosing the time windows by avoiding differences between the treatment arms, overlapping time windows, or time windows including treatment interventions (e.g., surgery, start of a chemotherapy cycle). Finally, the instrument recall time should be taken into account. Instruments can gauge patient HRQOL at a specific time point or during a determined period (e.g., the QLQ-C30 has a recall time of two weeks). Always exercise care to avoid overlapping recall windows or the occurrence of major interventions at the edge of the recall window. A final design step is checking if the RCT sample size is sufficient to answer the objective. If HRQOL is only a secondary end point, the study sample size still needs to justify the HRQOL research. Sample size evaluation should take rates of attrition, noncompliance and the ability to detect clinically relevant differences into account.
ANALYSIS OF TRIALS WITH HRQOL Before starting HRQOL analysis, its feasibility needs to be assessed by checking the amount of available data, the compliance rates, and the overall design.
27 PRACTICAL DESIGN AND ANALYSIS ISSUES
Should too few data be available (<50 valid patients with HRQOL data per treatment arm needed to detect a 10-point difference), the compliance rate be too low (<50% [4]), or the trial design not permit unbiased comparison, then no analyses should be performed. HRQOL compliance for a group of patients at any given point is defined as the number of valid forms at that time point divided by the number of expected forms at that time point. HRQOL forms are expected for each patient within the HRQOL assessment schedule (e.g., on follow-up and alive or without progression according to protocol specifications collecting HRQOL up to death or progression). Comparative tests can be used to determine whether there are significant differences in compliance between treatment arms that could bias treatment comparison results. Several HRQOL instruments make use of different scoring systems. In general, for the EORTC QLQC30, this consists of grouping and transforming the individual items of a questionnaire into single 0–100 ranged outcomes (standardized scores) via linear transformation (e.g., weighted averaging). The scoring for the EORTC QLQ-C30 (1) questionnaire and any of its supplementary EORTC QLQ modules is detailed in the EORTC QLQ-C30 scoring manual. In general, it states that for functional scales, higher scores represent a higher level of functioning which is better for the patient; for symptom scales, higher scores represent a higher degree of symptom burden which is patient deliberating. Even if a completed HRQOL form is returned, some questions might have been left blank or without a valid answer. Therefore, the scale compliance for each specific scale to be analyzed (defined as the ratio of the number of valid HRQOL forms for which the scale could successfully be scored over the number of total valid forms) is presented. Based on the compliance results, only data from valid HRQOL forms should be analysis included. Analyses should preferably be done on all patients with at least one valid HRQOL form according to the intention-totreat (ITT) principle, unless not feasible. If compliance at a specific time point is too problematic, a protocol deviation dropping or merging certain assessments together can be considered to salvage the analysis. Missing data is a common problem, occurring when patients do not complete all or some questionnaire items at the time of a scheduled HRQOL assessment. Missing data also occur when patients drop out from the HRQOL assessment due to death or progression, when HRQOL data are collected up to progression, or at the end of the period of clinical follow-up. Missing data may seriously bias the results: patients who very rapidly deteriorate are less likely to
263
fill out the questionnaires than the fitter ones who still complete the questionnaires. Also, if patients progress or die more rapidly in one arm compared to the other arm, the collection of HRQOL data will be stopped earlier in one group compared to the other, and differences in patient HRQOL at latter time points may simply reflect a more severe selection of patients in one arm compared to the other arm. Therefore, missing HRQOL data must be handled very carefully. The baseline characteristics and efficacy outcomes should be compared between patients included and excluded for the HRQOL analyses. Reasons for missing HRQOL data should be presented if they were collected. If the proportion of missing data is nontrivial (exceeds 10% overall), the missingness mechanism should be further examined. Different techniques may be used to analyze HRQOL data collected over time: longitudinal modeling using linear mixed models, cross-sectional analysis (analysis of the patient HRQOL scores at a specific time point), or the use of summary statistics which allow the reduction of the total HRQOL data into a single patient result. Longitudinal modeling has the advantage of making use of the longitudinal relationship between the different time points and providing estimates for the evolution of HRQOL data over the entire assessment period in the different study arms. This methodology is best suited allowing adjustment for missing data. Cross-sectional analysis and the use of summary statistics allow the application of simpler analysis techniques. The reduction in complexity is achieved by either selecting a specific time point for the treatment comparison or reducing HRQOL data to one single statistic (e.g., maximum increase or decrease from baseline, proportion of patients experiencing an increase or decrease of a preset number of points from baseline over time). These methods may provide biased results in the presence of informative missing data. However, these methods can improve the statistical power to detect treatment differences (5). In addition, group changes are not always representative of individual patient changes. The mean scores will appear stable over time if half of the patients improve and half of the patients deteriorate. Therefore, it is important to supplement a longitudinal analysis modeling the evolution of mean scores over time by at least the proportion of patients experiencing a severe worsening or a considerable improvement from baseline during follow-up. Because missing HRQOL data may seriously bias results, it is strongly advised to produce supportive analyses to check main result robustness (2, 5). These additional sensitivity analyses may consist of analyzing
264
ONCOLOGY CLINICAL TRIALS
HRQOL scales other than the primary selected scales, selecting patient subgroups, using different analysis techniques or using imputational techniques for missing data for confirmation. When interpreting results, statistical significance and the clinical relevance of the HRQOL scores should be considered; irrelevant difference may be found statistically significant in a large sample. At this point, the instrument-specific and predefined MCID should be used as a benchmark for interpreting the magnitude and relevance of the treatment comparison. When interpreting scores over time, researchers need to be careful not to understand global trends as absolute patient changes. In addition to the selective reporting bias caused by missing data, it is very probable that patients will adjust their expectations over time resulting in different levels of self-reporting. Such subjective change is known as response shift and can result in either increased or decreased scores (6). Therefore, it is advisable to compare the results with evidence from similar studies concentrating on the same cancer site(s) and sample size. In addition to the analysis of HRQOL data over time for treatment comparison, a prognostic factor analysis of baseline HRQOL data can be undertaken to investigate the utility of HRQOL data in predicting patient outcome—overall survival (OS), progressionfree survival (PFS). Such analyses with complex multidimensional HRQOL data need careful performance (7, 8). See Table 27.1.
CHALLENGING TRIALS—A VIEW FROM THE PAST One may ask why do HRQOL researchers have to be so prescriptive in guidance and design, analysis and interpretation. However, we have learned over many years (particularly in the early 1980s when HRQOL really began to take a more significant role in RCTs) that many with HRQOL failed for a number of reasons. While researchers had undertaken HRQOL measurements for years, few internationally validated and translated measures were available, such as the EORTC tool; many studies used ad hoc or measures lacking validation. Therefore, such RCTs could not be robust or allow for comparison. Furthermore, as noted above, the analysis of HRQOL data is complex and in the 1980s limited attention was given to the more effective methodology to use. Simple issues such as clinical significance, which is the ability to understand trial effects on patients were not understood, were not addressed (3). EORTC had several aborted RCTs in genito-urinary cancers and leukemia cancer
simply due to current policy of not ensuring HRQOL a mandatory part of the protocols, not undertaking necessary and resource intensive steps to monitor HRQOL data, hence resulting in low compliance, with one trial having less than 40% baseline compliance. This changed in the 1990s. Since then, EORTC and other groups have made significant gains in methodology, practices, and policy (9). In EORTC RCTs, baseline HRQOL is now an eligibility criterion, showing the patient and investigator the value of this from day one of the study. The fact that the American Society of Clinical Oncology (ASCO), the Food and Drug Administration (FDA), and the European Medicines Agency (EMEA) now support HRQOL as a valid clinical RCTs endpoint illustrates considerable progress.
SOME EXAMPLES OF SUCCESSFUL TRIALS EORTC (10) examined HRQOL of patients with advanced breast cancer treated with a standard anthracycline-based chemotherapy regimen versus a doseintensified anthracycline regimen (EORTC trial 10921). HRQOL assessments were carried out before random assignment and then once a month for the first three months and then every six months to month 54. Four hundred forty-eight patients were entered onto this trial, of whom 384 (86%) completed the baseline HRQOL questionnaire. The clinical results demonstrated no differences in survival between the two groups. The intensified treatment resulted in significantly poorer HRQOL outcomes during the first three months of follow-up. However, thereafter, the HRQOL scores returned to baseline levels and at 12 months no significant differences were observed between the treatment arms. During the remainder of the follow-up, the HRQOL levels were comparable between the study arms. The combined clinical and HRQOL results from this trial suggest that dose-intensive therapy achieves similar survival outcomes without sacrificing patient HRQOL. This trial gave patients a choice. If, for example, the patient has a busy working life with children involved and a lifestyle demanding rapid treatment, regardless of toxicity, they can choose the more aggressive option. If the patient is elderly, has no economic needs, the choice may be the less aggressive therapy. Clearly, if survival opportunity is basically equal, HRQOL can help a practicing clinician give the patient choices. Taphoorn et al. (11) reported on the HRQOL component of a phase III RCT for glioblastoma patients conducted by the EORTC brain cancer and radiotherapy groups in collaboration with the National Cancer Institute of Canada (NCIC). In this study, 573 newly diagnosed glioblastoma multiforme (GBM)
27 PRACTICAL DESIGN AND ANALYSIS ISSUES
patients were randomly assigned to radiotherapy alone or radiotherapy and concomitant adjuvant temozolomide. OS was the primary end point, with HRQOL examined at baseline and during treatment at 3-month intervals until progression. Changes from baseline on seven selected HRQOL domains were measured: fatigue, overall health, social functioning, emotional functioning, future uncertainty, insomnia, and communication deficit. The clinical results showed a significant 3-month median OS benefit for the temozolomide arm. Baseline HRQOL questionnaires were available for 89.7% of patients (n = 514). Significant differences (P = .005) were observed at the end of radiotherapy for only one of the seven HRQOL domains— social functioning—in favor of standard treatment with radiotherapy. However, during follow-up, significant clinically meaningful differences were no longer observed between the standard arm and the temozolomide arm. The addition of temozolomide during and after external-beam radiotherapy for newly diagnosed GBM significantly improved survival, and, importantly, without any negative long-term impact on HRQOL. These data contributed to setting a new standard of care. The FDA recently approved temozolomide for the treatment of GBM based on the results of this trial.
REPORTING AND EVALUATING PUBLISHED STUDIES OF HRQOL IN CANCER CLINICAL TRIALS EORTC has been one of the few organizations to have undertaken work to systematically evaluate the quality of reporting of HRQOL RCT in many different disease sites, with the aim to provide guidance and improve the quality of reporting HRQOL in cancer clinical trials. Initial findings indicated several RTC with an HRQOL element were poor and divers in terms of the quality of reporting, some very poor, and often not as high standard as needed to allow researchers to make comprehensible interpretations of the data. A series of systematic reviews were undertaken in NSCLC (12), breast cancer (13), prostate cancer (14), and other disease sites that clearly showed this. We have issued guidance on how to draft a robust RCT HRQOL paper, published in the Journal of Clinical Oncology (JCO), thereby with others trying to set standards for publishing high quality trial (15). These guidelines specify much of what we have addressed above, that is, regarding the importance of having a hypothesis, specification and use of reliable tools, undertaking and reporting good analysis of data, and presenting the data with clinical significance noted and in an easy way for clinicians
265
to interpret. However, recent work by Joly et al. (16) has indicated that RCT with HRQOL still need to further improve their reporting of the HRQOL elements to be of greater value to clinicians and helping continue to guide treatment decisions.
CONCLUSIONS As noted in this chapter, clinicians have assessed HRQOL for decades, informally. More recently, RCTs have been formalized using evidence-based guidance and robust tools. These robust, scientifically valid HRQOL tools can be added into almost any RCT. Using some of the practical, applied design and analysis approaches mentioned in this chapter, we hope these should assist investigators who design or plan RCT with an HRQOL component.
References 1. Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Research and Treatment of Cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst. 1993;85:365–376. 2. Fayers P, Machin D. Quality of Life—Assessment, Analysis and Interpretation. Chicester, UK: Wiley & Sons; 2000. 3. Osoba D, Rodrigues G, Myles J, et al. Interpreting the significance of changes in health-related quality-of-life scores. J Clin Oncol. 1998;16:139–144. 4. Fairclough DL. Design and Analysis of Quality of Life Studies in Clinical Trials. New York: Chapman & Hall; 2002 5. Fairclough, DL: Patient reported outcomes as endpoints in medical research. Stat Meth in Med Res. 2004;13(2):115–138. 6. Schwartz CE, Sprangers MA. Methodological approaches for assessing response shift in longitudinal health-related qualityof-life research. Soc Sci Med. 1999;48:1531–1548. 7. Mauer ME, Bottomley A, Coens C, Gotay CC. Prognostic factor analysis of health-related quality of life data in cancer: a statistical methodological evaluation. Expert Rev. Pharmacoeconomics Outcomes Res. 2008;8(2):179–196. 8. Gotay CC, Kawamoto CT, Bottomley A, Efficace F. The prognostic significance of patient-reported outcomes in cancer clinical trials. J Clin Oncol. 2008;26(8):1355–1363. 9. Bottomley A, Aaronson NK. International perspective on health related quality of life research in cancer clinical trials: the European Organisation for Research and Treatment of Cancer Experience. J Clin Oncol. 2007;25(32):5082–5086. 10. Bottomley A, Therasse P, Piccart MJ, et al. Health-related quality of life in survivors of locally advanced breast cancer: an international randomized controlled phase III trial. Lancet Oncol. 2005;6:287–294. 11. Taphoorn MJ, Stupp R, Coens C, et al. Health related quality of life in patients with glioblastoma: a randomized controlled trial. Lancet Oncol. 2005;6(12):937–944. 12. Bottomley A, Thomas R, Vanvoorden V, Ahmedzai S. Health related quality of life in non small cell lung cancer: methodological issues in randomised controlled trials. J Clin Oncol. 2003;21(15):2989–2992. 13. Bottomley A, Vanvoorden V, Flechtner H, Therasse P, EORTC Quality of Life Group, EORTC Data Center. The challenges and achievements involved in implementing quality of life research in cancer clinical trials. Eur J Cancer. 2003;39: 275–285.
266
ONCOLOGY CLINICAL TRIALS
14. Efficace F, Bottomley A, van Andel G. Health related quality of life in prostate carcinoma patients. Cancer. 2003;97(2): 377–388. 15. Efficace F, Bottomley A, Osoba D, et al. Beyond the development of health-related quality of life (HRQOL) measures: a checklist for evaluation HRQOL outcomes in cancer clinical
trials—does HRQOL evaluation in prostate cancer research inform clinical decision making. J Clin Oncol. 2003;21(18): 3502–3511. 16. Joly F, Vardy J, Pintilie M, Tannock I F. Quality of life and/or symptom control in randomized clinical trials for patients with advanced cancer. Ann Oncol. 2007;18:1935–1942.
28
Clinical Trials Considerations in Special Populations
Susan Burdette-Radoux Hyman Muss
WHO ARE THE ELDERLY? Traditionally, age 65 years and older has been used to define the U.S. elderly population as this age has historically been used for entitlement of social security and Medicare benefits. However, the major improvement in life span during the second half of the twentieth century has resulted in a U.S. population that consists of many healthy elders. Cancer in the United States is a disease of aging with a median age at diagnosis of 67 years and a median age at death of 73 years. Currently, about 25% of new cancer diagnoses are in patients 65 to 74 years, about 22% in patients 75 to 84 years, and about 7.5% in those 85 years and older (1). Moreover, a healthy 65-year old has an average remaining life expectancy of about 20 years, a healthy 75 year old of 12 years, and a healthy 85 year old about 6 years (2). The number of older patients with cancer will certainly increase as the U.S. population continues to age. In the past, many clinical trials excluded older patients and even now with more flexible eligibility criteria, older patients continue to be underrepresented (3, 4). Even outside of the clinical trial setting, older patients have consistently received less screening, less aggressive surgery, and less systemic therapies for cancer (5–7). This undertreatment of older patients (resulting mainly from age bias) and the lack of geriatric education of cancer-focused health care professionals has resulted in a scarcity of data about the
management of cancer in the elderly, and has resulted in poorer outcome for some cancer sites (8, 9). Factoring Comorbidity into Clinical Trials The major difference between older and younger patients with cancer is that older patients commonly have coexisting illness or comorbidity. Coexisting illness can compete with cancer in shortening life expectancy and minimize the effects of cancer treatment. To date no highly accurate means of factoring comorbidity into clinical trials has been adopted for widespread use, although some Web-based programs such as adjuvantonline (www.adjuvantonline.com) allow factoring in comorbidity for making treatment decisions for the adjuvant therapy of breast, colorectal, and lung cancer. For cancers with poor prognosis but where curative therapy exists, using state-of-theart cancer treatment is justified as treatment benefits may outweigh risks. For palliative cancer therapy, factoring in comorbidity is more important as the goals of treatment in this setting are the control of the cancer while preserving the highest possible quality of life. Another major factor in developing trials for older patients is the effect of the proposed treatment on the patients’ overall function as well as other important domains affecting older patients (10). Functional evaluation includes assessment of activities of daily living (ADL), which includes eating, bathing, dressing,
267
268
ONCOLOGY CLINICAL TRIALS
toileting, and getting in and out of a bed or a chair, as well as instrumental ADL (IADL), which includes managing personal finances, meal preparation, shopping, traveling, doing housework, using the telephone, and taking medications (11, 12). A comprehensive geriatric assessment (CGA) includes a multidisciplinary evaluation of functional status, cognition, social support, psychological state, nutritional status, and medication use (polypharmacy). It is best performed by geriatric health care professionals who can predict morbidity and mortality from cancer as well as other illnesses (12). CGA results can be used to direct appropriate interventions designed to maintain function that improve quality of life for older cancer patients (13). Older patients most likely to benefit from CGA are those with substantial comorbidity, loss of function, or frailty. At present the shortage of trained geriatricians coupled with the increasing number of older cancer patients has led to the development of shorter instruments that can accurately predict functional decline and risk of dying (14–16). Currently, this has become a major interest of cancer Trialists and has lead to the inclusion of these instruments as part of major phase II and phase III trials. For example, the Cancer and Leukemia Group B (CALGB) is testing the use of a short, mostly selfadministered CGA instrument in women 65 and older as a companion trial in patients entered into other National Cancer Institute (NCI) sponsored cooperative group trials (CALGB 360401—see www.cancer.gov). Increased use of such instruments holds great promise and may help predict treatment related toxicity and the effect of cancer as an additional comorbidity for patients with other serious illnesses. More importantly, such instruments might eventually be accurate enough to help in treatment selection. In addition, developing models that can predict survival (i.e., developed prostate cancer) may be helpful in selection (17). Although increasing age and increasing comorbidity are closely correlated, greater efforts are needed to train health care providers to focus on life expectancy and not age when considering older patients for trials. Designing Clinical Trials for Older Patients Eligibility Criteria The major issues to consider when designing a trial for older patients are listed in Table 28.1. Healthy older patients with life expectancies exceeding 5 to 10 years should be offered participation in state of the art trials. In designing trials to include elders, specific attention needs to be paid to eligibility criteria, which should be as broad as possible (18, 19). First, there should be no upper age restrictions—using life expectancy as an eligibility criterion such as “Estimated life expectancy
TABLE 28.1
Major Issues to Consider in Designing Trials for Elders. 1. What is the intent of trial? a. Curative b. Palliative 2. Eligibility criteria: flexibility is key 3. Trial/Statistical design a. Will there be enough older patients to generalize the result to an older group if this is a phase II or phase III trial for all age groups? b. Should I design trial for older patients only? 4. Potential barriers a. Logistics b. Personnel c. Funding
greater than 5 years” is a fairer and less restrictive means of accruing older patients. Second, one should not exclude on the basis of organ dysfunction unless the specific treatment being studied is metabolized by or has an effect on the organ in question. For instance, there is no reason to use specific renal function exclusion criteria for a trial of a drug that is completely metabolized in the liver. Similarly, hematologic, cardiac, hepatic, and other organ function thresholds should be omitted when not related to the treatment being evaluated. In one study, it was estimated that appropriately relaxing eligibility criteria would have increased elderly participation by 60% (19). Treatment Intent The treatment intent of the trial is important in factoring in elderly participation. For prevention and screening trials, it is imperative that only older patients who might derive survival benefits from the specific interventions being tested be offered participation. Older patients with short life expectancy or frailty should be excluded. For trials focused on potentially curative therapies, eligibility criteria should be as flexible as possible to include a larger percentage of older patients. Ironically, trials of early stage cancer are less likely to accrue older patients than trials for advanced disease (19). Older patients should also be encouraged to participate in clinical trials of therapy for advanced cancer in which the goal is not to cure. Having older patients in these trials is especially important since many are focused on testing new agents, and underrepresentation of elders limits the generalizability of results for the population at risk. This may retard the use of effective new agents for the older population because of concerns about toxicity.
28 CLINICAL TRIALS CONSIDERATIONS IN SPECIAL POPULATIONS
Should Trials be Designed Specifically for Older Patients? A common debate among geriatric oncologists is whether to design clinical trials for older patients. Like in most debates, there is no right or wrong answer. Since so many of the newest and most beneficial cancer treatments have not been studied extensively in older patients, designing trials to determine the benefits of such treatments in elders seems to be an obvious strategy. However, retrospective reviews of adjuvant trials for colorectal and breast cancers have shown that healthy older patients who met the eligibility criteria of these trials derived similar proportional benefits in relapse and mortality reductions as younger patients and with moderate (but not overwhelming) increases in toxicity (20–23). Another issue relates to differences in the natural history of cancer among different age groups. For example, older patients with acute myelogenous leukemia (AML) are more likely than younger patients to have tumor cytogenetics indicative of greater resistance to standard therapies. For this reason trials specifically addressing elders with AML and using investigational therapy have been developed (24), frequently with excellent accrual (25). Likewise, trials focusing on frail elders with cancer are needed. Such patients are usually excluded from major trials yet many are treated. Trials of less intensive but potentially beneficial therapy for this group are needed and should incorporate CGA to maximize the effect of treatment on function as well as outcome. Another strategy for trials development in older patients is to use robust existing databases for hypothesis generation. The Surveillance, Epidemiology, and End Results (SEER) database (seer.cancer.gov) is a large searchable database containing a wealth of information concerning cancer treatment. Combining the results of this database with Centers for Medicare and Medicaid Services (CMS) database (www.cms.hhs.gov) is a means of exploring cancer issues in older patients that can be quite rewarding (26, 27). Another strategy is combining an NCI sponsored cooperative group database with the CMS database (28). Analyses like these can provide exciting insights that can then be explored in more focused trials. Statistical Considerations Accruing elders to major trials remains a challenge. One way to improve accrual would be to require that a specific number of older patients be required in all phase II and phase III trials. The number of elders would be based on the trial end points. For example, in a phase III trial of two different treatments one might include enough older patients for an adequate test for
269
heterogeneity—that is ruling in or out an age-related treatment interaction. However, such a requirement could substantially delay trial closure if elders are accrued slowly. Other strategies are obviously needed. Another issue relates to drop-out rates of older patients on trials which may be larger than for younger patients. Enlarging the trail size to account for this increased drop-out rate should be considered. Other Barriers to Trial Participation Barriers to clinical trials participation in the United States have been well described and include: (1) aggressive treatments that are potentially too toxic in older patients; (2) the presence of comorbidity; (3) fewer trials aimed at older patients; (4) limited expectations for long-term benefits by patients, health care professionals, and families; and (5) a lack of financial, logistic, and social support for participation (29). Federally funded Medicare insurance currently covers costs for the standard cancer care of patients 65 years and older, including patients entered on to clinical trials. However, other financial and logistic issues remain. Although not offering a trial may represent good medical judgment on the part of the physician, not offering a trial is frequently due to age bias alone. In one study of breast cancer patients conducted by the CALGB, 34% of breast cancer patients greater than 65 years were offered participation in a clinical trial available in their institution compared to 68% of women less than 65 years (P = 0,0004) (30). Among those offered trials, similar numbers of older (50%) and younger (56%) patients consented to participate. In this trial, physicians reported that the most important barrier to accrual of older patients was the presence of significant comorbid conditions not excluded by the protocol but that might affect treatment response—the potential for poor compliance, treatment related toxicity, and not meeting the protocol eligibility criteria. Oncologists suggested that the most effective interventions for improving the accrual included making clinic personnel available to explain the trials to the older patients and their families, and providing physicians with educational materials concerning potential treatment toxicity in this group (31). Financial barriers still exist to trials participation, especially barriers related to logistics of care. NCI funding of all cancer clinical trials is currently inadequate, leading in some instances to caps on patient accrual for NCI-sponsored clinical trials. Industry-sponsored trials have occasionally focused on older patients and represent another source of funding. However, a review of accrual of older patients on U.S. Food and Drug Administration (FDA) cancer drug registration trials
270
ONCOLOGY CLINICAL TRIALS
showed a significant underrepresentation of older patients on trials of new drugs or new indications for older drugs (32). The proportion of patients ≥65 years, ≥70 years, and ≥75 years entered on these industrysponsored registration trials compared to corresponding percentages with similar illness in the U.S. cancer population was 36% versus 60%, 20% versus 46%, and 9% versus 31%, respectively. To overcome logistic barriers for accrual of older patients, additional funds are necessary to help provide for transportation and support personnel. Conclusions Accruing older patients to clinical trials remains challenging but is essential as cancer is a disease of aging and there will be more and older cancer patients to care for as the population ages. Flexible eligibility criteria, adequate staff to help explain the purpose of trials and help with logistics, and educating health care personnel concerning the issues of elders and cancer are essential if we are to improve accrual of elders to trials. Trial designs focused on elders are also appropriate to answer specific questions about treatment, especially for frail patients and those with substantial comorbidity. Improving clinical trials opportunities for elders is not an option but a necessity so as to adequately explore the risks and benefits of new and emerging treatments in the largest population of Americans with cancer.
patient population. As a result, more patients may develop cancer after a diagnosis of renal disease, with as many as 6% of patients having cancer as a comorbidity at the time of initiation of hemodialysis (33). The incidence of renal insufficiency in patients diagnosed with cancer is high. A recent study conducted in France, the Renal Insufficiency and Anticancer Medications (IRMA) study (34), showed that more than 50% of 4,684 patients receiving chemotherapy had some degree of impaired renal function using a calculated creatinine clearance, whereas only 7% of these patients had an abnormal serum creatinine. In patients over 75 years of age, almost 75% had impaired renal function. In this study, 80% of the patients received an anticancer drug that was either nephrotoxic or that required dose adjustment for renal impairment. Identification of patients with impaired renal function is the first step in obtaining data on the safety and efficacy of drugs in this patient population. Conversely, underestimating the number of patients with renal function impairment in a clinical trial may alter results when testing drugs with renal metabolism, leading to an overestimate of toxicity. Serum creatinine measurement is an inadequate estimate of creatinine clearance or glomerular filtration rate, therefore a calculation which takes into account additional variables should be used. There are many proposed formulae for estimating creatinine clearance, the two most widely accepted being the Cockcroft-Gault formula (35) and the four-variable abbreviated Modification of Diet in Renal Disease (aMDRD) formula (36) (Table 28.2).
CLINICAL TRIALS IN POPULATIONS WITH ORGAN IMPAIRMENT Cancer chemotherapy treatment for patients with organ impairment is a difficult problem for clinicians. As the population ages and treatment for other diseases improves, more and more patients with significant comorbid conditions will become candidates for chemotherapy treatment. There is little data on appropriate management of patients with impaired organ function needing chemotherapy, and clinical trials in this setting are badly needed. Renal and hepatic impairment present particular challenges, since one or both of these organs are responsible for metabolism of most chemotherapy drugs and impairment of their function may substantially increase toxicity. Renal Impairment The incidence of renal impairment in the general population increases with age, and is often underrecognized. With improvements in the treatment of endstage renal disease, life expectancy has increased in this
TABLE 28.2
Prediction Equations for Glomerular Filtration Rate Based on Serum Creatinine. Cockcroft-Gault formula (35): CCr = ((140-Age) × Weight)/ (72 × SCr) × 0.85 (if female) aMDRD formula (4 variable) (44): GFR = 186 × SCr−1.154 × Age−0.203 × 0.742 (if female) × 1.21 (if black) Quadratic GFR equation (38): GFR = exp (1.911 + 5.249/SCr-2.114/SCr2−0.00686 × Age-0.205 (if female)) If SCr < 0.8 mg/dL, use 0.8 for SCr CCr = creatinine clearance (mL/min); SCr = serum creatinine (mg/dL); GFR = glomerular filtration rate (mL/min per 1.73 m2); aMDRD = abbreviated Modification of Diet in Renal Disease. Age is in years.
28 CLINICAL TRIALS CONSIDERATIONS IN SPECIAL POPULATIONS
Choosing the most appropriate formula for estimating glomerular filtration rate (GFR) depends on the patient population and investigational agents under study. The Cockcroft-Gault formula has wide acceptance (and is easiest to use) but does not take into account as many variables as other formulae. The aMDRD is a more accurate estimate than the Cockcroft-Gault formula for patients over the age of 65 years (37). This formula, however, is normalized for an average surface area of 1.73, and therefore will be less accurate for patients at the extremes of body height or weight. The aMDRD formula was validated in a patient population with chronic kidney disease, and therefore may overestimate the number of patients with renal insufficiency in the general population. Rule et al. (38) have recently devised a quadratic GFR equation that correlates with other measures of GFR more accurately in a population with a lower incidence of renal insufficiency (see Table 28-2). Once renal function has been calculated, patients may be grouped into categories according to their degree of renal impairment. The Working Group of the National Kidney Foundation has developed clinical practice guidelines identifying five stages of renal function impairment according to GFR (Table 28.3) (39). These stage groupings provide a basis for studying pharmacokinetics (PKs) in renal impairment and making recommendations for dose adjustments based on alterations in PKs. Established guidelines for the use of chemotherapeutic agents in patients with renal impairment are listed in the prescribing information for each drug and in several reviews (40–42). TABLE 28.3
Stages of Chronic Kidney Disease (39). STAGE 1
GFR (ML/MIN/1.73
M2)
≥90
2
60–89
3 4 5
30–59 15–29 <15 (or dialysis)
DESCRIPTION Kidney damage with normal or ↑ GFR Kidney damage with mild ↑ GFR Moderate ↓ GFR Severe ↓ GFR Kidney failure
GFR = glomerular filtration rate Chronic kidney disease is defined as either kidney damage or GFR <60 mL/min/1.73 m2 for ≥3 months. Kidney damage is defined as pathologic abnormalities or markers of damage, including abnormalities in blood or urine tests or imaging studies (39).
271
TABLE 28.4
Common Chemotherapeutic Agents Requiring dose Modification and Monitoring for Renal Insufficiency (42). Antimetabolites Antifolates: Methotrexate Pyrimidine analogs: Capecitabine Purine analogs: Mercaptopurine Fludarabine Hydroxyurea Alkylating Agents Nitrogen mustard derivatives: Cyclophosphamide Melphalan Ifosfamide Nitrosoureas: Carmustine Lomustine Metal salts: Carboplatin Cisplatin Oxaliplatin Aziridines: Thiotepa Mitomycin Natural Products Antibiotics: Bleomycin Topoisomerase inhibitors: Topotecan Etoposide Biologics Interleukin-2 Supportive Care Agents Bisphosphonates Antirejection medications: Cyclosporin Tacrolimus Nonsteroidal anti-inflammatory agents
Table 28.4 lists chemotherapeutic agents known to require dose adjustment in renal impairment. Most drugs do not require adjustment with creatinine clearance above 60 mL/min, (groups 1 and 2), therefore these patients can safely be included in clinical trials of new drugs and drug combinations. Special attention must be paid to drug interactions with concomitant medications possessing potential nephrotoxic properties, such as nonsteroidal anti-inflammatory agents. For many drugs, the PK profile in renal impairment is poorly characterized and dosing guidelines do
272
ONCOLOGY CLINICAL TRIALS
not exist, presenting a dilemma for clinicians and their patients with renal impairment who wish to avail themselves of these therapies. Since phase III trials typically exclude patients with renal impairment (CCr < 60 mL/min), newer drugs are less available to patients with renal impairment because of lack of dosing and toxicity data. Clinical trials are clearly needed to expand our repertoire of drugs that can safely be used in patients with renal impairment. The FDA provides guidance to investigators who wish to perform such trials (43). A summary of their recommendations is listed in Table 28.5.
TABLE 28.5
Summary of FDA Guidelines for Clinical Trials in Patients with Organ Impairment (43). 1. Pharmacokinetic characterization should be assessed for all drugs likely to be affected by alterations in organ function. 2. For a full study design, approximately equal numbers of patients should be assessed in each functional level (Tables 28-3, 28-6, and 28-7), and groups should be similar to each other with respect to age, gender, and weight. 3. The number of patients enrolled in the study should be sufficient to detect pharmacokinetic differences large enough to warrant dosage adjustments. 4. Pharmacokinetic studies may be single-dose or multiple-dose studies: In single dose studies of renal impairment, the same dose may be administered to all patients, since peak levels are usually not affected by renal excretion. In multiple dose studies, enough doses should be given to achieve a steady state, and a phase I design should be employed to establish a maximum tolerated dose (MTD). 5. For agents that are judged unlikely to require dose adjustment, a reduced/staged study design may be employed, using only groups of patients at the extremes of organ function (i.e., those with normal and those with severely impaired function). If organ impairment does not alter pharmacokinetics sufficiently to warrant dose adjustment, then no further study is warranted. 6. Dose recommendations should be based on mathematical modeling of the relationship between organ function and relevant pharmacokinetic parameters. More detailed guidance on study design, data analysis, and recommendations for dose adjustments are available on the FDA Center for Drug Evaluation and Research Web siteat: http://www.fda.gov/cder/guidance/.
Hepatic Impairment Hepatic impairment is commonly encountered in patients with cancer, and may often be related to the disease itself, or result from treatment toxicity. Because hepatic impairment is multifactorial, its incidence may vary greatly depending on the patient population. In primary cancers with a propensity to metastasize to the liver, hepatic impairment due to metastases may preclude the very treatment needed for treatment of the disease. In other disease sites, patients may be susceptible to additional causes of liver impairment, such as hepatitis in patients requiring multiple blood transfusions after leukemia treatment. Many chemotherapy drugs themselves have the potential for hepatotoxicity, which may compromise continued treatment with active agents. Unlike patients with renal failure, who have the option for dialysis, patients with hepatic failure only have the option of orthotopic liver transplant, which may be contraindicated due to their cancer diagnosis. Life expectancy of patients with more advanced liver dysfunction may be limited, requiring careful weighing of the risks and benefits of cancer treatment. Given all of these issues, it is crucially important to have accurate information on the safety and dosing of chemotherapy agents in the setting of hepatic impairment. In contrast to renal impairment, the degree of hepatic impairment is difficult to quantify due to the complexity of liver functions, and there is no general agreement on a system of classification. The classification system most commonly used is the Childs-Pugh system (43) (Table 28.6). This system was designed for surgical evaluation of patients with alcoholic cirrhosis and has not been validated for use as a basis for clinical trials of chemotherapeutic agents. Some studies have simply used bilirubin level as an indicator of liver function, but this approach fails to take into account hepatic metabolism of drugs other than biliary excretion. The CALGB Pharmacology and Experimental Therapeutics Committee takes a tailored approach to the evaluation of organ function. For example, patients were divided into cohorts according to albumin and bilirubin for a study of erlotinib, but used bilirubin alone for a study of sorafenib (Table 28.7). Each of these trials included both patients with renal insufficiency and those with hepatic impairment, allowing for gathering data on both situations for the study drug without the expense of two separate trials. Many of the recommendations for dose adjustments according to hepatic function are based on empiric observation rather than clinical trials. Specific dose adjustments can be found in prescribing information for individual drugs. Drugs that can be given in full dose in the setting of mild to moderate hepatic impairment are listed in Table 28.8, and those that are
273
28 CLINICAL TRIALS CONSIDERATIONS IN SPECIAL POPULATIONS
TABLE 28.6
Childs-Pugh Classification of Hepatic Impairment (43). POINTS SCORED
Encephalopathy grade* Ascites Serum bilirubin, mg/dL Serum albumin, g/dL Prothrombin time, sec prolonged
FOR
OBSERVED FINDINGS
1
2
3
None Absent <2 >3.5 <4
1 or 2 Slight 2 to 3 2.8 to 3.5 4 to 6
3 or 4 Moderate >3 <2.8 >6
* Grade 0: normal consciousness, personality, neurological examination, electroencephalogram Grade 1: restless, sleep disturbed, irritable/agitated, tremor, impaired handwriting, 5cps waves Grade 2: lethargic, time-disoriented, inappropriate, asterixis, ataxia, slow triphasic waves Grade 3: somnolent, stuporous, place-disoriented, hyperactive reflexes, rigidity, slower waves Grade 4: unrousable coma, no personality/behavior, decerebrate, slow 2–3 cps delta activity OPERATIVE
RISK
Good Moderate Poor
CLASSIFICATION
POINTS SCORED
A or mild B or moderate C or severe
5 or 6 7 to 9 10 to 15
TABLE 28.7
Examples of CALGB Classification of Hepatic and Renal Impairment. CALGB 60301: Pharmacokinetic and Phase I Study of Sorafenib for Solid Tumors and Hematologic Malignancies in Patients with Hepatic or Renal Dysfunction (45) COHORT* DESCRIPTION
HEPATIC
RENAL
Normal organ function
1) Bil ≤ ULN and AST ≤ ULN AND CreatCl ≥ 60 ml/min 2) Bil > ULN but ≤1.5 × ULN and/or AST > ULN 4) Bil > 1.5 × ULN to ≤ 3 × ULN and any AST 6) Bil > 3 × ULN to 10 × ULN and any AST 8) Albumin < 2.5 mg/cL, any Bil and any SGOT
3) CreatCl between 40 and 59 mL/min 5) CreatCl between 20 and 39 mL/min 7) CreatCl < 20 mL/min
Mild dysfunction Moderate dysfunction Severe dysfunction Very severe dysfunction
9) Hemodialysis, any CreatCl
CALGB 60101: Phase I Study of OSI-774 for Solid Tumors in Patients with Hepatic or Renal Dysfunction (46) COHORT* 1 2 3
ALBUMIN
BILIRUBIN
AST
CREATININE
<2.5 Any >2.5
<1.0 1.0−7.0 <1.0
Any Any <3 × ULN
Normal Normal 2.5–5.0 mg/dl
OSI-774 = erlotinib *Patients must meet criteria for only one cohort (i.e., patients with both elevated liver function and reduced creatinine clearance are not eligible.
274
ONCOLOGY CLINICAL TRIALS
TABLE 28.8
TABLE 28.9
Drugs that do not Require dose Adjustment for Mild to Moderately Impaired Hepatic Function (47, 48).
Drugs Known to Require dose Adjustment for Moderately to Severely Impaired Hepatic Function (47, 48).
Antimetabolites Pyrimidine analogs: Capecitabine Fluroruracil Gemcitabine Alkylating Agents Nitrogen mustard derivatives: Cyclophosphamide Mechlorethamine Melphalan Metal salts: Carboplatin Cisplatin Oxaliplatin Natural Products Antibiotics Bleomycin Mitomycin Topoisomerase inhibitors: Topotecan
known to require dose adjustment are listed in Table 28.9. Note that some drugs are listed in both tables; for these drugs, full doses may be given in mild to moderate, but not in moderate to severe hepatic impairment. Because rigorous trials have not been performed for many agents, recommendations made by different investigators do not always agree. Some drugs cannot be given safely at any dose in patients with severely abnormal liver function, and in this situation, alternative treatments or supportive care must be considered. PK studies according to FDA guidelines (see Table 28.5) are essential for clarifying safe dosing parameters for existing drugs and for establishing safety and dosing guidelines for new therapeutic agents. Other Organ Impairment Impairment of organs other than the kidney or liver can limit the type of therapeutic agent available for certain patient populations. For example, impairment of cardiac function requires limiting exposure to anthracyclines, trastuzumab, and high-dose cyclophosphamide. Abnormal pulmonary function may preclude the use of bleomycin and nitrosoureas. Methotrexate should not be given to patients with pleural effusions because of a reservoir effect which alters the PKs of the drug and increases toxicity. When such drugs are employed in the clinical trial setting, patients must be screened for the appropriate
Antimetabolites Antifolates Pyrimidine analogs: 5-Fluorouracil Cytarabine Gemcitabine Alkylating Agents Hydrazine derivatives: Procarbazine Natural Products Antibiotics: Anthracyclines: Daunorubicin Doxorubicin Epirubicin Dactinomycin Taxanes: Docetaxel Paclitaxel Topoisomerase inhibitors: Irinotecan Etoposide Vinca Alkaloids: Vinblastine Vincristine Vinorelbine
organ function prior to enrollment, and those patients with normal organ function must be monitored for the development of organ impairment during treatment. Appropriate screening parameters and an appropriate schedule of evaluation for the specific organ function in question should be incorporated into the clinical trial design. The development of drugs designed to counteract specific toxicities (e.g., cardioprotectants) is an important area of ongoing research aimed at offering treatment to patients who would otherwise not be able to receive potentially beneficial therapy. A thorough knowledge of the toxicities of each drug used in a clinical trial is important to appropriate trial design, eligibility criteria, and monitoring of patients during their course of treatment. Challenges of Clinical Trials in Special Populations Conducting clinical trials in the elderly population and in patients with organ impairment remains a challenge for the investigator. As the general population ages, and treatment of cancer and other diseases improves, oncologists will be facing increasing numbers of
28 CLINICAL TRIALS CONSIDERATIONS IN SPECIAL POPULATIONS
patients with comorbidities who are candidates for cancer treatment. Since many clinical trials have excluded these patients in the past, data on appropriate management for them has lagged behind advances in treatment for the general population. Well-designed clinical trials addressing these issues will provide the basis for offering a full spectrum of cancer treatment to these special populations in the future.
References 1. Surveillance Epidemiology and End Results Program (SEER). http://seer.cancer.gov/statfacts/html/all. 2008. 2. U.S. Life Expectancy. United States Life Tables, 2003. http://www.cdc.gov/nchs/data/statab/lewk3_2003.pdf. 2008. 3. Hutchins LF, Unger JM, Crowley JJ, Coltman CA Jr., Albain KS. Underrepresentation of patients 65 years of age or older in cancer-treatment trials. N Engl J Med. 1999;341: 2061–2067. 4. Sateren WB, Trimble EL, Abrams J, et al. How sociodemographics, presence of oncology specialists, and hospital cancer programs affect accrual to cancer treatment trials. J Clin Oncol. 2002;20:2109–2117. 5. Mor V, Masterson-Allen S, Goldberg RJ, Cummings FJ, Glicksman AS, Fretwell MD. Relationship between age at diagnosis and treatments received by cancer patients. J Am Geriatr Soc. 1985;33:585–589. 6. Asch SM, Sloss EM, Hogan C, Brook RH, Kravitz RL. Measuring underuse of necessary care among elderly Medicare beneficiaries using inpatient and outpatient claims. JAMA. 2000;284:2325–2333. 7. Townsley C, Pond GR, Peloza B, et al. Analysis of treatment practices for elderly cancer patients in Ontario, Canada. J Clin Oncol. 2005;23:3802–3810. 8. Eaker S, Dickman PW, Bergkvist L, Holmberg L. Differences in management of older women influence breast cancer survival: results from a population-based database in Sweden. PLoS Med. 2006;3:e25. 9. Hebert-Croteau N, Brisson J, Latreille J, Rivard M, Abdelaziz N, Martin G. Compliance with consensus recommendations for systemic therapy is associated with improved survival of women with node-negative breast cancer. J Clin Oncol. 2004;22:3685–3693. 10. Extermann M, Hurria A. Comprehensive geriatric assessment for older patients with cancer. J Clin Oncol. 2007;25: 1824–1831. 11. Katz S, Ford AB, Jackson BA, Jaffe MW. Studies of illness in the aged. The index of ADL: a standardized measure of biological and psychosocial function. JAMA. 1963;185: 914–919. 12. Lawton MP, Brody EM. Assessment of older people: self-maintaining and instrumental activities of daily living. Gerontologist. 1969;9:179–186. 13. Maas HA, Janssen-Heijnen ML, Olde Rikkert MG, Machteld Wymenga AN. Comprehensive Geriatric assessment and its clinical impact in oncology. Eur J Cancer. 2007;43: 2161–2169. 14. Saliba D, Elliott M, Rubenstein LZ, et al. The Vulnerable Elders Survey: a tool for identifying vulnerable older people in the community. J Am Geriatr Soc. 2001;49:1691–1699. 15. Hurria A, Gupta S, Zauderer M, et al. Developing a cancerspecific geriatric assessment: a feasibility study. Cancer. 2005; 104:1998–2005. 16. Rodin MB, Mohile SG. A practical approach to geriatric assessment in oncology. J Clin Oncol. 2007;25:1936–1944.
275
17. Halabi S, Small EJ, Kantoff PW, et al. Prognostic model for predicting survival in men with hormone-refractory metastatic prostate cancer. J Clin Oncol. 2003;21:1232–1237. 18. George SL. Reducing patient eligibility criteria in cancer clinical trials. J Clin Oncol. 1996;14:1364–1370. 19. Lewis JH, Kilgore ML, Goldman DP, et al. Participation of patients 65 years of age or older in cancer clinical trials. J Clin Oncol. 2003;21:1383–1389. 20. Sargent DJ, Goldberg RM, Jacobsen SD, et al. Adjuvant chemotherpapy for colon cancer in elderly patients: a pooled analysis of 3,351 patients. N Engl J Med. 2001;in press. 21. Muss HB, Woolf S, Berry D, et al. Adjuvant chemotherapy in older and younger women with lymph node-positive breast cancer. JAMA. 2005;293:1073–1081. 22. Muss HB, Biganzoli L, Sargent DJ, Aapro M. Adjuvant therapy in the elderly: making the right decision. J Clin Oncol. 2007;25:1870–1875. 23. Kumar A, Soares HP, Balducci L, Djulbegovic B. Treatment tolerance and efficacy in geriatric oncology: a systematic review of phase III randomized trials conducted by five National Cancer Institute-sponsored cooperative groups. J Clin Oncol. 2007;25:1272–1276. 24. Estey E. Acute myeloid leukemia and myelodysplastic syndromes in older patients. J Clin Oncol. 2007;25:1908–1915. 25. Stone RM, Berg DT, George SL, et al. Postremission therapy in older patients with de novo acute myeloid leukemia: a randomized trial comparing mitoxantrone and intermediatedose cytarabine with standard-dose cytarabine. Blood. 2001; 98:548–553. 26. Patt DA, Goodwin JS, Kuo YF, et al. Cardiac morbidity of adjuvant radiotherapy for breast cancer. J Clin Oncol. 2005; 23:7475–7482. 27. Patt DA, Duan Z, Fang S, Hortobagyi GN, Giordano SH. Acute myeloid leukemia after adjuvant breast cancer therapy in older women: understanding risk. J Clin Oncol. 2007;25: 3871–3876. 28. Lamont EB, Herndon JE, Weeks JC, et al. Measuring diseasefree survival and cancer relapse using Medicare claims from CALGB breast cancer trial participants (companion to 9344). J Natl Cancer Inst. 2006;98:1335–1338. 29. Trimble EL, Carter CL, Cain D, Freidlin B, Ungerleider RS, Friedman MA. Representation of older patients in cancer treatment trials. Cancer. 1994;74:2208–2214. 30. Kemeny MM, Peterson BL, Kornblith AB, et al. Barriers to clinical trial participation by older women with breast cancer. J Clin Oncol. 2003;21:2268–2275. 31. Kornblith AB, Kemeny M, Peterson BL, et al. Survey of oncologists’ perceptions of barriers to accrual of older patients with breast carcinoma to clinical trials. Cancer. 2002;95: 989–996. 32. Talarico L, Chen G, Pazdur R. Enrollment of elderly patients in clinical trials for cancer drug registration: a 7-year experience by the US Food and Drug Administration. J Clin Oncol. 2004;22:4626–4631. 33. US Renal Data System. USRDS 2003 Annual Data Report: Atlas of end-stage renal disease in the United States. Bethesda, MD: National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases; 2003. 34. Launay-Vacher V, Oudard S, Janus N, et al. Prevalence of renal insufficiency in cancer patients and implications for anticancer drug management. Cancer. 2007;110:1376–1384. 35. Cockcroft DW, Gault MH. Prediction of creatinine clearance from serum creatinine. Nephron. 1976;16:31–41. 36. Levey AS, Bosch JP, Lewis JB, Greene T, Rogers N, Roth D. A more accurate method to estimate glomerular filtration rate from serum creatinine: a new prediction equation. Modification of diet in Renal Disease Study Group. Ann Intern Med. 1999;130:461–470. 37. Verhave JC, Fesler P, Ribstein J, du Cailar G, Mimran A. Estimation of renal function in subjects with normal serum creatinine levels: influence of age and body mass index. Am J Kidney Dis. 2005;46:233–241.
276
ONCOLOGY CLINICAL TRIALS
38. Rule AD, Larson TS, Bergstralh EJ, Slezak JM, Jacobsen SJ, Cosio FG. Using serum creatinine to estimate glomerular filtration rate; accuracy in good health and in chronic kidney disease. Ann Intern Med. 2004;141:929–937. 39. National Kidney Foundation. K/DOQI clinical practice guidelines for chronic kidney disease: evaluation, classification and stratification. Am J Kidney Dis. 2002;39:S1–266. 40. Aronoff FR, ed. Drug Prescribing in Renal Failure: Dosing Guidelines for Adults. 4th ed. Philadelphia, PA: American College of Physicians; 1999. 41. Tomita M, Aoki Y, Tanaka K. Effect of haemodialysis on the pharmacokinetics of antineoplastic drugs. Clin Pharmacokinet. 2004;43:515–527. 42. Eneman JD, Philips GK. Cancer management in patients with end-stage renal disease. Oncology. 2005;19:1199–1212. 43. U.S. Food and Drug Administration, Center for Drug Evaluation and Research. Guidance Documents. http://www .fda.gov/cder/guidance/. 2008.
44. Levey AS, Greene T, Kusek JW, Beck GJ, Group MS. Simplified equation to predict glomerular filtration rate from serum creatinine. J Am Soc Nephrol. 2000;11:828A (abstract). 45. Miller AA, Murry DJ, Owzar K, et al. Phase I and pharmacokinetic study of sorafenib in patients with hepatic or renal dysfunction: CALGB 60301. 2008 (manuscript submitted for publication) 46. Miller AA, Murry DJ, Owzar K, et al. Phase I and pharmacokinetic study of erlotinib for solid tumors in patients with hepatic or renal dysfunction: CALGB 60101. 2007; 25:3055–3060. 47. Eklund JW, Trifilio S, Mulcahy MF. Chemotherapy dosing in the setting of liver dysfunction. Oncol. 2005;19:1057–1063; discussion 1–63–4, 1069. 48. King PD, Perry MC Hepatotoxicity of chemotherapy. Oncol. 2001;6:162–176.
29
A Critical Reader’s Guide to Cost-Effectiveness Analysis
Greg Samsa
Most cancer therapies involve trade-offs. For example, aggressive therapy of a uniformly fatal condition such as stage 4 cancer of the pancreas might gain patients an average of one extra month of life, in exchange for the inconvenience and possible complications of the therapy and their negative impact on quality of life, plus its economic cost. Costeffectiveness analysis (CEA) is a systematic way of exploring these trade-offs. CEA is a specialized branch of applied health economics. Indeed, CEA is sufficiently specialized that the typical trial investigator will not perform the CEA him or herself, but instead will engage an expert collaborator. This collaboration should begin during study design, in order to guarantee that all of the data elements needed for the CEA will be collected within the trial, and at a sufficient level of accuracy. Although CEA is too substantial to ever become entirely routine, various organizations are making progress in helping to standardize many of its elements. The interested reader is referred to, among others, the set of standards proposed by ISPOR (1). Drummond’s book (2) provides a readable introduction on how to perform a CEA, and journals such as Pharmacoeconomics and Value in Health provide model manuscripts that the new analyst can emulate. The following references are some examples of cost-effectiveness analyses within the discipline of cancer published in Value in Health during 2007–2008 (3–6).
In view of the many excellent resources alluded to above, the purpose of this chapter is not to recapitulate for the reader precisely how a CEA is performed. Instead, it is to provide for the critical reader a few crucial questions to ask in order to determine whether a CEA that he or she encounters is sufficiently sound to be persuasive. In practice, these critical questions might be used in conjunction with a more comprehensive checklist such as that provided by Drummond. Whereas, the latter checklist will provide the reader the reassurance that nothing has been overlooked, the purpose of our recommendations is different—namely, to point the reader toward those elements of a CEA that are often the most crucial. Our examples are stylized versions of CEA for various types of cancer-related interventions. For clarity of exposition, these examples simplify the clinical issues and cost-effectiveness analyses being studied.
EXAMPLE INTERVENTIONS We consider two examples of cancer-related interventions, both of which we assume have been demonstrated in randomized trials to be efficacious. What remains to be seen is whether they are worth it. This is the purpose of CEA. Intervention 1 is a new drug for stage 4 cancer of the pancreas. The drug is given over 1 month, and
277
278
ONCOLOGY CLINICAL TRIALS
during this month all patients suffer extreme fatigue and nausea, but none of these symptoms are life threatening. Almost every patient gains a survival benefit, which on average is 1 extra month of life. The intervention is expensive. The trade-off is the extra survival versus the side effects and cost of the treatment. Put another way, patients can choose to pay for the new drug, in which case they will gain an extra month of life, but a month during which they will suffer extreme fatigue and nausea. Intervention 2 is a program of palliative care planning for patients with terminal cancer being considered by a managed care organization. The intervention consists of various interdisciplinary meetings, with a focus on pain management, patient education, family education, and hospice planning. The intervention doesn’t increase survival, but should improve quality of life, primarily through better pain management. The trade-off is the improved quality of life for patients versus the net cost of the program to the managed care organization. CRITICAL QUESTIONS In our experience, the six critical questions to ask are as follows: 1. Does the CEA model reflect the most important trade-offs? 2. Does the CEA include all the outcomes of interest to the relevant stakeholders? 3. Is the time frame appropriate? 4. Are all critical inputs estimated sufficiently well? 5. How should outcomes be summarized? 6. How sensitive are the conclusions to changes in the critical assumptions?
PANCREATIC CANCER EXAMPLE The CEA is to be attached to a randomized trial. In this trial, patients were randomized to either the new treatment or usual care, and the results (reported on
a per-patient basis) were as reported in Table 29-1. Thus, in quantitative terms, patients receiving the new treatment obtain 1 extra month of life in exchange for $5,000 more in treatment costs and a certainty of extreme fatigue or nausea during the month of treatment. All medical costs are included. Does the CEA Model Reflect the Most Important Trade-Offs? Here, the trade-off is the extra survival versus the side effects and cost of the treatment. The model explicitly includes all of these elements. Does the CEA Model Reflect All the Outcomes of Interest to the Relevant Stakeholders? Answering this question first requires identifying the relevant stakeholders. The usual reference case for a CEA assumes that the stakeholder is society (i.e., the societal perspective), the implication of which is that all medically related costs are included regardless of their source. That is, there is no accounting of who pays for what. The statement that all medical costs are included implicitly takes the societal perspective. The other obvious perspective in this CEA is that of the patient. Indeed, Table 29.1 contains essentially all the information needed to counsel an individual patient regarding the decision. If the costs are paid by an insurer, the decision amounts to quantity of life versus quality of life and the patient’s individual preference regarding the certainty of having extreme fatigue and nausea for the extra month of life that the treatment offers. If the costs of the drug are not paid by an insurer, the trade-off would also include the extra $5,000 associated with the therapy. (In practice, of course, outcomes are not known to this level of certainty, and this uncertainty would also be part of the patient’s decision). When presented with this trade-off, different patients would make different decisions based on their individual values. The difference between counseling an individual patient and a CEA is that the latter is
TABLE 29.1
Randomized Trial Results (Per Patient). NEW TREATMENT Survival (months) Extreme fatigue and nausea during the month of treatment Cost of treatment
USUAL CARE
DIFFERENCE
7 100%
6 0%
1 100%
$35,000
$30,000
$5,000
279
29 A CRITICAL READER’S GUIDE TO COST-EFFECTIVENESS ANALYSIS
intended to determine what the treatment strategy should be for populations of patients or (equivalently) for a typical or average patient. Is the Time Frame Appropriate? The time frame of the above model is time until death, which occurs in the short term (e.g., within a year) for all patients. In general, CEA models with a short time frame are simpler than those with a longer time frame. One reason is that there is no need to consider the question of discounting, which (for example) recognizes that a dollar saved today is more valuable than a dollar saved some time in the future. Another circumstance where the time frame of a CEA must be extended is when it is required in order to fully evaluate the benefits of the various strategies being studied. For example, suppose that a new device for the diagnosis of melanoma is to be assessed. To model the immediate impact of the device (in comparison to usual unaided physician diagnosis), the analyst must know the prevalence of melanoma, the sensitivity of unaided physician diagnosis, the specificity of unaided physician diagnosis, the sensitivity of the new device, the specificity of the new device, the cost of using the device, and the cost of the biopsies associated with making a definitive diagnosis. The trade-off might be described as follows: per 10,000 patients, at $100 per patient, the use of the device would cost an extra $1 million. Because the device is more sensitive and more specific, it would result in 1,000 fewer biopsies— at $500 per biopsy this would save $500,000. Moreover, the device would identify an additional 50 melanomas: 20 in situ, 15 stage 1, 10 stage 2, and 5 stage 3–4. Thus, the immediate trade-off per 10,000 patients is spending an additional $500,000 (i.e., the cost of using the device minus the offset associated with fewer biopsies) in exchange for identifying an additional 50 melanomas. The trade-off is thus $10,000 per additional melanoma identified. A more comprehensive evaluation, however, must take into account the value of early detection—in other words, by quantifying what is gained by identifying 50 melanomas early. This would require extending the time frame of the model into the future. Extending the model into the future requires making various assumptions about how soon the cases that are missed by unaided physician diagnosis will progress before eventually being diagnosed (i.e., thus quantifying the harm associated with a missed diagnosis). For an actual melanoma model, any such assumptions would be speculative, but for purposes of illustration we assume that all undetected melanomas will progress
TABLE 29.2
Melanoma Life Expectancy and Cost. STAGE In situ Stage 1 Stage 2 Stage 3–4
LIFE EXPECTANCY (YEARS)
COST ($)
40 35 20 2
2,000 5,000 25,000 100,000
a single stage before being subsequently diagnosed. Also suppose that the patient in question is 40 years old, with a 40-year life expectancy without cancer. Further assume that the life expectancy and associated melanoma-related medical costs (by stage of melanoma) are as provided in Table 29.2. For the 50 additional cases diagnosed by the new device (i.e., whose distribution is 20 in situ, 15 stage 1, 10 stage 2, and 5 stage 3–4), the distribution of stage for unaided physician diagnosis is 0 in situ, 20 stage 1, 15 stage 2, and 15 stage 3–4. The life expectancy for the physician-diagnosed cases is (0)(40) + (20)(35) + (15)(20) + (15)(2) = 1,030 years. The life expectancy for the device-diagnosed cases is (20)(40) + (15)(35) + (10)(20) + (5)(2) = 1,535 years. The difference in cost is obtained by similar weighted average calculations. That is, the expected cost of the physician-diagnosed cases is (0)(2,000) + (20)(5,000) + (15)(25,000) + (15)(100,000) = $1,875,000. The cost for the devicediagnosed cases is (20)(2,000) + (15)(5,000) + (10)(25,000) + (5)(100,000) = $865,000. After subtraction, per 10,000 patients the device is associated with a gain of 505 life-years and a long-term savings of $1,010,000, which would more than offset the extra $500,000 in immediate outlays. The point of this example is not any specific insights about the devices for diagnosing melanoma, as the inputs are hypothetical, but instead is to illustrate how long-term follow-up can be included in a CEA. Are all the Critical Inputs Estimated Sufficiently Well? Answering this question first requires identifying the critical inputs to the CEA. These inputs are usually survival, quality of life (sometimes) and costs. For a short-term CEA such as the pancreatic cancer example, survival will be measured directly in the randomized trial used to assess the efficacy of the intervention under study. Longer-term follow-up might also require additional assumptions, as in the melanoma example above.
280
ONCOLOGY CLINICAL TRIALS
In CEA, analysts often represent quality of life using utilities, which are scaled so that 0 denotes death and 1 denotes perfect health, with all other health states falling in between. It should be noted that utilities are, in general, difficult to elicit precisely (and, indeed, the closer one attempts to measure the theoretical construct of interest to economists, the harder utilities become to elicit). However, it is often the case that rough estimates are adequate. In any case, if it is determined that each day of life with extreme fatigue and nausea has a utility of 0.5, then 2 days in this health state are equivalent to 1 day in perfect health. Equivalently, the intervention adds 1 month of life expectancy and 0.5 months of quality-adjusted life expectancy. Costs are sometimes measured directly (e.g., as out-of-pocket costs for a patient or an insurer), but are more typically obtained by recording utilization and then multiplying this utilization by a price per unit. For example, the unit price of a day in the hospital might be $1,000, and the unit price of a 1-month supply of a drug might be $4,000. In a CEA, total costs are comprised of the sum of the costs of implementing each management strategy, plus any additional costs that are induced by the differences in patterns of care or outcomes that the strategies induce. In the pancreatic cancer example, the additional costs associated with the new drug are $4,000 (i.e., the cost of a monthly course of the drug plus the costs of any additional staff time and supplies required for dosing). The additional costs associated with differences in outcome are $1,000 in drug-related costs required to manage extreme fatigue and nausea. Often, the estimation of the cost inputs is simplified by the fact that the trade-off between the two treatment strategies only utilizes the right-most column of Table 29-1—that is, it is only the incremental difference in costs between the two treatment strategies that matters. This cancellation implies that costs that are identical between the two groups need not necessarily be measured precisely (or at all). Moreover, only those costs that are likely to be greatly different between the groups require careful accounting. In practice, a balance is usually struck between the benefits of a complete accounting of costs (e.g., to exclude the possibility of unexpected differences between the groups) and the burden of data collection. Translation of utilization into costs can become highly detailed and, in general, can proceed from the bottom up (e.g., by recording each component of a hospital day such as nursing time, supplies, etc.) or the top down (e.g., by assigning Medicare’s daily rate for the condition in question). Moreover, unit costs can be either based on those of the facilities that were studied in the underlying randomized trial, or made more
general by applying national averages or another similarly normative set of unit prices. As a critical reader, the first important thing to do is to use the information in the previous paragraph to identify all the elements of utilization that require data collection. The second important thing to do is to make sure that the unit prices for these elements of utilization are presented, and that they are intuitively reasonable. If the reader can identify and defend how the main elements of cost—particularly the main elements of any differences in cost between the strategies being compared— are derived, then things are probably all right. However efficacy is defined (e.g., in the pancreatic cancer example, efficacy is the improvement in survival), it is critical that it be estimated accurately. If the intervention in question is strongly efficacious, unless its cost is exorbitant it will almost always turn out to be cost-effective, and the CEA exercise focuses on estimating the magnitude of cost-effectiveness. If the intervention in question is not demonstrably efficacious, the CEA will not be convincing to a reader in any event. CEA is most valuable when the efficacy of an intervention is demonstrable but relatively modest. CEA is also helpful when the intervention involves a trade-off of two distinct elements—for example, between quantity and quality of life. How Should Outcomes be Summarized? Summarization of outcomes is not important to individual patients (indeed, for an individual patient the more detail about the trade-off the better), but is important to those that make decisions at a higher level of aggregation. One possible summary measure for the pancreatic cancer CEA is dollars per life-year. For the pancreatic cancer CEA (and multiplying dollars per extra month of life by 12), the new drug requires spending $60,000 per year of life saved. Dollars per life-year is a standard metric when interventions do not affect quality of life, and thus facilitates comparison across studies. In the melanoma example, dollars per life-year is a reasonable summary measure, although using this metric requires the reader to accept the validity of the longitudinal component of the CEA model. Dollars per additional cancer detected is another possible summary measure—it cannot necessarily be generalized to comparisons with noncancer interventions but is specifically relevant to the diagnosis of melanoma, and also can be estimated without recourse to a longitudinal model. Especially when interventions involve a trade-off between quantity and quality of life, dollars per QALYs is another standard summary measure. In general,
281
29 A CRITICAL READER’S GUIDE TO COST-EFFECTIVENESS ANALYSIS
interventions costing $50,000 to $100,000 per QALY are considered a reasonable value for the money, so our hypothetical pancreatic cancer drug is near the high end of what is generally considered acceptable. Dollars per QALY is not always an appropriate summary measure. In some cases, quality of life is a relevant construct, but cannot be reliably assessed using utilities. In other cases (e.g., when all that matters is the trade-off of cost versus survival), quality of life does not come into consideration at all. If QALYs cannot or should not be estimated, the analyst is under no requirement to do so. Both dollars per life year and dollars per QALY are examples of incremental cost-effectiveness ratios. The ratio comes from the presence of a numerator (always dollars—thus yielding bucks for the bang) and a denominator. The incremental comes from the fact that both the numerator and denominator are derived from differences between the management strategies being compared. How Sensitive are the Conclusions to Changes in the Critical Assumptions? Although more complex approaches can easily be envisioned, the simplest form of sensitivity analysis involves replacing each of the critical inputs (in turn) with the high and low values from each possible range. Table 29.3 illustrates how this process operates. A more complete sensitivity analysis might include the likelihood of extreme fatigue and nausea as well.
In any event, a sensitivity analysis (a) determines whether the main conclusions are robust to changes in model inputs, and (b) identifies which of the model inputs has the largest impact on the conclusions, thus facilitating interpretation of the results.
PALLIATIVE CARE EXAMPLE A managed care organization is considering an integrated palliative care program for patients with terminal cancer. A randomized trial suggested that for a typical cancer patient with 12 months to live, the various additional meetings involved a total of 2 days of staff time. The result of the intervention was that patients had significantly better pain management before they entered a hospice, and that they entered a hospice a month before patients receiving usual care. Before entering hospice, intervention patients had an extra $2,000 of regular pain medications prescribed, but $6,000 less in rescue medications. The cost of hospice was $1,000 per day, but the savings in less intensive medical care was $900 per day. Intervention patients spent 10% of their time in moderate-to-high pain, in comparison with 40% of the time for usual care patients. Being in a moderate-to-high level of pain was assigned a utility of 0.4, whereas being in low pain was assigned a utility of 0.8. The utility for the intervention group was (.1)(.4) + (.9)(.8) = .76, whereas the utility for the control group was (.4)(.4) + (.6)(.8) = .64.
TABLE 29.3
Base Case and Sensitivity Analysis.
Base case Increase survival Decrease survival Increase treatment cost Decrease treatment cost Increase utility Decrease utility
DIFFERENCE ($)
UTILITY OF A MONTH WITH EXTREME FATIGUE AND NAUSEA
$ PER LIFE-YEAR
LIFE-YEARS (QALY)
1 1.5
5,000 5,000
0.5 0.5
60,000 40,000
120,000 80,000
0.5
5,000
0.5
120,000
240,000
1
8,000
0.5
96,000
192,000
1
2,000
0.5
24,000
48,000
1
5,000
0.6
60,000
100,000
1
5,000
0.4
60,000
150,000
SURVIVAL DIFFERENCE (MONTHS)
COST
$
PER
QUALITY-
ADJUSTED
282
ONCOLOGY CLINICAL TRIALS
The results are summarized in Table 29.4. The managed care organization is not insisting that the palliative care program be financially profitable, only that it be a good value for the money. Does the CEA Model Reflect the Most Important Trade-Offs? The managed care organization wants to explore the trade-off between the economic costs of the program and the improvement in patient-related quality of life. The economic costs of the program are divided into extra program-related expenditures plus any extra utilization induced by the program. The economic benefits of the program are any utilization-related savings. The quality of life benefit is assumed to be derived from improved pain management. This seems to cover the most important trade-offs. Does the CEA Include All the Outcomes of Interest to the Relevant Stakeholders? Yes. The managed care organization is primarily interested in the overall cost of implementing the program, including those that result from changes in utilization patterns that the program induces. Patients are primarily interested in quality of life. The intervention has no impact on survival. Is the Time Frame Appropriate? This CEA appropriately uses a 1-year time frame. Are All the Critical Inputs Estimated Sufficiently Well? For simplicity, we assumed that the cost to roll out the program is minimal, and that the primary cost of the programs is based upon the time required for various interdisciplinary conferences, prehospice planning,
education of the patient and family members, and associated follow-up. All of these elements could be estimated by the associated randomized trial. Assuming that there is no problem in staffing and that management is satisfied with an approximate accounting of costs, the approximate macro-costing approach used here seems satisfactory. Regarding quality of life, what seems most salient is the level of pain, which was assessed during the trial. The assignment of utilities based on level of pain is a bit more speculative, although a literature on this topic does exist and could be referenced if desired. How Should Outcomes be Summarized? Based on the information in Table 29-4, the trade-off is an extra $1,000 per patient buys patients being in low-to-moderate pain an additional 3 months during their last year of life. In QALYs, this represents an extra 0.12 quality-adjusted life-years per patient, or $8,333 per QALY. Indeed, this figure falls so far below the cutoff of $50,000 per QALY so as to provide reassurance that any imprecision in assigning utilities is unlikely to affect the conclusions of the analysis. Table 29-4 illustrates that the economic component of the trade-off is that although the costs that are directly associated with the intervention are moderate ($2,000), the intervention also induces additional costs associated with pain management and end-of-life planning ($2,000 in regularly prescribed pain medications and $30,000 in hospice costs). However, most of these costs are offset by a reduction in emergency-related costs (i.e., rescue medications, emergency department visits) and the lower intensity of care associated with hospice. If the managed care organization (i.e., the stakeholder) was not responsible for the entire set of costs the implications of this CEA might be very different. For example, if hospice costs, medication costs, usual staff costs, and emergency department costs were
TABLE 29.4
Palliative Care Trial Results (Per Patient). Dimension Prescribed pain medications Rescue medications and emergency visits Hospice costs (30 days at $1,000/day) Hospice savings (30 days at $900/day) Intervention-related staff time (2 days at $1,000/day) Percent time patients in moderate-to-high pain Utilities (in quality-adjusted life-years)
Intervention group (vs. control) $2,000 more $6,000 less $30,000 more $27,000 less $2,000 more 30% less .12 more
29 A CRITICAL READER’S GUIDE TO COST-EFFECTIVENESS ANALYSIS
283
on different budgets, this intervention would benefit some components of the organization at the expense of others.
gaps in reporting for those analyses that are not as well presented. This structured approach is not limited to analyses of cancer-related interventions.
How Sensitive are the Conclusions to Changes in the Critical Assumptions?
References
The conclusions are not sensitive to assumptions about quality of life, but are more sensitive to the cost offsets associated with the intervention. Depending on how costs flow within the managed care organization, the components of cost might or might not be relevant to the analysis. Applying the societal perspective and assuming that it doesn’t matter who pays the various components of cost could potentially be misleading; indeed, this problem is one of the main criticisms of this default approach to CEA. DISCUSSION Our main purpose in writing this chapter has been to demystify CEA by providing a structured method of deconstruction and critical review. In our experience, this structured approach works well for the vast majority of well-written analyses, and serves to highlight the
1. Ramsey S, Willke R, Briggs A, et al. Good research practices for cost-effectiveness analyses alongside clinical trials: the ISPOR RCT-CEA Task Force report. Value Health. 2005;8(5): 521–533. 2. Drummond MF, Schulpher MJ, Torrance GW, O’Brien BJ, Stoddart GL. Methods for the Economic Evaluation of Health Care Programmes. 3rd ed. New York: Oxford University Press USA; 2005. 3. Bonastre J, Chevalier J, Elias D, et al. Cost-effectiveness of intraperitoneal chemohyperthermia in the treatment of peritoneal carcinomatosis from colorectal cancer. Value Health. 2008;11(3): 347–353. 4. Cordony A, LeReun C, Smale A, Symanowski JT, Watkins J. Costeffectiveness of pemetrexed plus cisplatin: malignant pleural mesothelioma treatment in UK clinical practice. Value Health. 2008;11(1):4–12. 5. Eldar-Lissai A, Cosler LE, Culakova E, Lyman GH. Economic analysis of prophylactic pegfilgrastim in adult cancer patients receiving chemotherapy. Value Health. 2008;11(2):172–179. 6. Thompson D, Taylor DCA, Montoya EL, Winer EP, Jones SE, Weinstein MC. Cost-effectiveness of switching to exemestane after 2 to 3 years of therapy with tamoxifen in postmenopausal women with early-stage breast cancer. Value Health. 2007;10(5): 367–376.
This page intentionally left blank
30
Systematic Review and Meta-Analysis
Steven M. Brunelli Angela DeMichele James Guevara Jesse A. Berlin
When evaluating the efficacy of an intervention, the highest level of evidence comes from large, rigorous, multicenter randomized controlled trials (RCTs). In oncology, as with all medical specialties, new therapeutic or preventive approaches may be adopted without RCT evidence, or more likely, without confirmatory RCTs, given issues of cost and feasibility. If RCTs are conducted, they are often too small to provide definitive results. When larger definitive studies are conducted (e.g., by cooperative groups), results may take many years to mature, delaying dissemination of information that could be important to making individual clinical decisions. Oncologists must therefore often rely on the cumulative evidence of smaller trials or observational studies to guide practice. Meta-analyses of RCTs can provide more definitive evidence of effects by taking advantage of the large numbers of patients obtained through combining numerous trials. The demands of clinical practice often preclude the oncologist from comprehensively surveying the literature, and synthesizing findings across trials or observational studies. Instead, the oncologist may seek counsel from summary papers; these papers tend to take the form of either classical (aka narrative) or systematic reviews. Narrative reviews are commonplace in the literature, but have several key limitations. Foremost among these is that narrative reviews need not (and generally do not) consider all of the available evidence on a topic. Often, authors of narrative reviews
will start with an opinion, and selectively cite literature to support this position. In addition, narrative reviews do not yield a quantitative estimate of effect (i.e., a hazard or odds ratio), but instead offer general conclusions about whether or not the intervention is helpful. Systematic review and meta-analysis are tools that have been developed to overcome many of the limitations of narrative reviews. Formally speaking, systematic review is the process by which all evidence on a particular topic is identified and extracted, and metaanalysis is the mathematical process by which the results of these studies are pooled into a single estimate (1, 2). For simplicity, we will refer to the combined process of systematic review and meta-analysis as metaanalysis, as is commonplace in the literature. This should not be inferred to mean that systematic review is somehow less important than meta-analysis; indeed it is critical. The primary benefits of meta-analysis are that it: (1) explores differences in methods and patient populations among studies that may contribute to different findings from those studies; (2) provides a more precise effect estimate (i.e., a narrower confidence interval) than the individual component studies; (3) offers an ability to assess whether bias in the publication of results exists; and (4) identifies gaps in knowledge and suggests directions for future research (2). Successful meta-analysis requires the same methodological rigor of an RCT or observational
285
286
ONCOLOGY CLINICAL TRIALS
study, if not more. Any laxity can lead to erroneous conclusions, which can be particularly misleading given the increased precision of estimates and the high credence given to meta-analyses by readers. In this chapter, we review the procedures by which metaanalysis should be conducted and by which data should be analyzed. Our intent is to provide a conceptual framework for the reader as to how best to perform and/or critically analyze meta-analyses.
METHODS Key to performing a successful (meaningful and accurate) meta-analysis is formulation of, and adherence to, a well-designed and detailed protocol. This protocol should be in place a priori, and serves to ensure a consistent approach throughout the meta-analytic process. The remainder of this section examines considerations important in each step of this process. In order to make these concepts tangible, we will intermittently refer to published examples of metaanalyses in breast cancer. Breast cancer is a potentially curable disease when limited to the breast and ipsilateral axilla. Following surgical resection and lymph node sampling, adjuvant therapy with chemotherapy and/or endocrine therapy is often administered to improve outcome (recurrence, survival) in subsets of patients. Many clinical trials have addressed the efficacy of adjuvant chemo/endocrine therapeutic strategies. Two meta-analyses of these treatments will be considered in this chapter to illustrate the concepts presented. The first, performed by the Early Breast Cancer Trialists’ Cooperative Group (EBCTCG), examined the effects of tamoxifen as adjuvant endocrine therapy (hereafter referred to as the tamoxifen meta-analysis) (3). Patients with estrogen receptor (ER) positive tumors are eligible for endocrine therapy with one of several agents, including tamoxifen, an aromatase inhibitor, or a GnRH agonist. Tamoxifen has been studied extensively, and the EBCTCG reported their most recent meta-analysis of these trials in 2005, showing that tamoxifen significantly decreased mortality from breast cancer in patients with ER-positive tumors. The second example we will consider is one performed by investigators at the National Cancer Research Institute in Genoa, Italy, examining the effects of anthracycline-based adjuvant chemotherapy. Her2/neu is another important breast tumor receptor, and its overexpression leads to increased growth rates and more aggressive tumors (4). In their meta-analysis (hereafter referred to as Her2/anthracycline metaanalysis), the authors explored whether Her2/neu
tumor status influenced the therapeutic efficacy of adjuvant anthracycline therapy. Step 1: Posing the Research Question As with all good science, a good meta-analysis begins with a good research question. This question must be of appropriate scope, and must be answerable based on existing studies. Questions that are too broad will not yield meaningful answers. For example, the question “Does tamoxifen cure breast cancer?” is inappropriately broad, given that the answer likely varies according to the patient population and clinical scenario studied. Conversely, questions that are too narrow may not bear clinical relevance (e.g., “Does tamoxifen prevent recurrence of breast cancer in patients under age 40 with 2 cm breast tumors and 3 positive axillary lymph nodes treated with dose-dense doxorubicin, cyclophosphamide, and paclitaxel?”). Moreover, the question must be answerable on the basis of existing studies; clinical questions for which no data are found to exist will lead to futile meta-analyses, and will require redefinition of the question. In our breast cancer examples, investigators sought to answer specific questions about adjuvant therapy that would guide clinical practice. The tamoxifen meta-analysis was designed to answer three specific questions about the effects of tamoxifen on recurrence and survival: (1) How does five years of tamoxifen therapy compare to none? (2) How does 1 to 2 years of tamoxifen compare to none? (3) How does 5 years of tamoxifen compare to 1 to 2 years (3)? The Her2/ anthracycline meta-analysis was designed to answer somewhat more specific questions: (1) Do anthracycline- based adjuvant chemotherapy regimens provide greater benefit than nonanthracyclinebased regimens? (2) Does the benefit of anthracyclinebased adjuvant chemotherapy differ between those whose tumors are Her2 positive compared to those whose tumors are Her2 negative (4)? These specific questions reflect both the clinical questions relevant at the time and the availability of RCT data sufficient to answer them. Step 2: Identifying Studies The meta-analytic protocol should document the strategy by which available studies will be identified. The goal at this stage is to be as comprehensive as possible (examination of each study’s appropriateness for inclusion will come later). Missed studies can potentially threaten the validity of meta-analytic findings, resulting in bias. Thus an exhaustive approach needs to be taken. Careful consideration is needed, though, so as not to define a search that is too broad and contains
30 SYSTEMATIC REVIEW AND META-ANALYSIS
too many clearly irrelevant citations (e.g., it’s usually helpful to exclude animal studies). Often, the strategy begins with a comprehensive search of one or more electronic databases (e.g., MEDLINE or EMBASE). The decision as to which database(s) and which years to search should be determined by clinical relevance (e.g., year first tamoxifen trial published). In addition to determining which databases to search, the oncologist must also determine the search terms that will be used (e.g., breast cancer, invasive carcinoma, tamoxifen, etc.). On both accounts, the assistance of a medical librarian can be valuable in planning and conducting this search. Because this search is critically important in evaluating the ultimate success of the meta-analysis, authors should include a detailed description of the search strategy used. In the Her2/anthracycline meta-analysis, the authors very clearly laid out their search strategy in the text of the article: We searched for relevant articles (1) by performing computer aided literature searches of the MEDLINE and CANCERLIT databases. . . . For database searches, the following strategy was used: “Breast Neoplasms” . . . AND “Chemotherapy, Adjuvant” . . . AND (“Anthracyclines” . . . OR “Anthracyclines/ therapeutic use” . . . ) AND (“Receptor, erbB-2” . . . OR “Genes, erbB-2”) (4). Searches based entirely on databases will be insufficient in many instances. Each database includes only a subset of available journals, and thus studies may be missed. In addition, these databases often do not contain information on unpublished studies that may be important in the overall analysis (e.g., recently completed trials). Therefore, database searches are often supplemented with interrogation of abstracts from relevant scientific meetings, bibliographies of identified studies or narrative reviews on the topic, and clinical trial registries (e.g., clinicaltrials.gov). In addition, it can be of value to contact directly prominent researchers in the field for identification of additional studies. Again, it cannot be overemphasized that the goal at this stage is to identify all available evidence. For example, in the tamoxifen meta-analysis, the EBCTCG took exhaustive steps to find every study of tamoxifen in nonmetastatic breast cancer worldwide, and was in direct contact with trial investigators, including them in the process in a collaborative way, so as to optimize the collection and accuracy of data included in the meta-analysis (3). Provision of this detail enables the reader to determine the thoroughness with which the search was
287
performed, and anticipate bias that may result from missing data sources. Step 3: Determining which Studies to Include After all studies have been identified, the oncologist must determine which studies to include. This should be done according to a set of prespecified inclusion and exclusion criteria (5). The first topic that bears consideration is which type of studies should be considered (e.g., RCTs, phase II studies, and observational studies). Meta-analysis was initially used in medical research for application to RCTs, and is generally considered most valid in this setting. Recently, methods have been developed to enable inclusion of phase II studies (which may be uncontrolled, or may use a crossover design) and observational investigations (6, 7). Inclusion of either crossover or observational studies adds methodological complexity, and it is advised that one enlist the aid of a seasoned meta-analyst if inclusion of these types of studies is anticipated. Second, studies should be included only if the population of interest is included in the study. These considerations should be guided by the research question at hand. In our example, “Does tamoxifen improve survival in patients with nonmetastatic breast cancer?” inclusion of trials that considered patients with carcinoma in situ or metastatic disease would be inappropriate. Third, studies should be included only if the outcome(s) of interest are reported by the study. In oncologic meta-analysis, outcomes considered often include tumor response and survival (relapse-free, progressionfree, or overall) among others. However, mixing outcomes is not advisable. Thus, for our tamoxifen example, it would be inappropriate to include studies measuring only disease-free survival (DFS) or relapsefree survival (RFS) when overall survival (OS) is the endpoint we seek to assess. In addition, the oncologist must determine how each study will be reviewed for inclusion. Ideally, this would take place by a full text review of each identified study. However, such an approach is inefficient, and will result in disadvantageous delays in completing the meta-analysis. Often authors take a three-step approach. The first step is examination of the title. Titles including terms such as “induces murine mammary tumorigenesis” will not bear direct relevance to clinical questions, and can therefore be excluded. The second step is examination of the abstract. Abstracts often contain useful information on the study design, patient population under consideration, and primary results of the study, and can therefore be useful in excluding studies that are not appropriate for the
288
ONCOLOGY CLINICAL TRIALS
ultimate meta-analysis. However, for each of these two steps, meta-analysts should be liberal, and allow continued consideration of any study not obviously violating inclusion criteria so as not to exclude key data. Finally, for all studies advancing past title and abstract review, full text articles should be examined. Application of strict inclusion and exclusion criteria should occur via a standardized process. Often, a data extraction instrument is useful in this regard. The specifics of any data extraction instrument may need to be tailored to the individual meta-analysis, but generic templates are available for use or adaptation (8). In addition, duplicate screening (e.g., parallel consideration by two or more persons who are initially blinded to each others’ decisions) is useful to ensure that criteria are applied consistently, and is reassuring to the reader in this regard. When duplicate screening is used, a prespecified process for adjudication of discordant findings should be in place. Meta-analyses should contain a flow diagram for study inclusion, which demonstrates the number of papers and their aggregate fates. Neither the tamoxifen meta-analysis nor the Her2/anthracycline meta-analysis provides such a flow diagram. Instead, we present as an example a flow diagram taken from the QUOROM guidelines for reporting meta-analyses of RCTs (Figure 30.1) (9). Step 4: Extracting Relevant Data Once all constituent studies have been assembled, the oncologist must extract relevant data from each study. Creation and use of data extraction tools is encouraged, as is duplicate data abstraction. The extracted data should include descriptors of the study (e.g., study design, sample size, use of blinding), the population under study (e.g., demographics, markers of comorbid disease), and the outcomes of interest. The oncologist should determine which data types (e.g., hazard ratios, relative risks), time points (e.g., 1-year vs. 5-year survival rates), and sources (e.g., those from tables vs. figures) will be used preferentially when multiple types of results exist, as occasionally results from tables and text are not the same. Extracted data are often presented in one or more tables. Examination of data in this manner enables the meta-analyst and reader to determine differences that exist among studies that may result in heterogeneity of findings. In addition, this process often demonstrates gaps in the existing literature, and is valuable to suggest directions for future research. In the Her2/anthracycline meta-analysis, extracted data were presented in two tables (4). The first table (Table 30.1) contains information on the study design and methods used in the constituent studies. The second table (Table 30.2)
Literature search Databases: PubMed, EMBASE, and the Cochrane Library Meeting abstracts: UEGW, DDW, and Intemational Workshop of the European Helicobacter Study Group Limits: English-language articles only
Search results combined (n = 115)
Articles screened on basis of title and abstract
Included (n = 27)
Excluded (n = 88) Not first-line eradication therapy: 44 Different regimen: 25 Helicobacter pylori status incorrectly evaluated: 10 Not recommended dose: 6 Multiple publications: 3
Manuscript review and application of inclusion criteria
Included (n = 21)
7-d vs. 14-d therapy (n = 10)
Excluded (n = 6) Not first-line eradication therapy: 2 Helicobacter pylori status incorrectly evaluated: 2 Results provided only on per-protocol basis: 1 Not recommended dose: 1
7-d vs. 10-d vs. 14-d therapy (n = 10)
7-d vs. 10-d therapy (n = 8)
FIGURE 30.1 Example of flow diagram for meta-analysis (9).
reports a synopsis of the findings from each study, as well as further detail on the methods used. Step 5: Pooling the Data One of the principal benefits of meta-analysis is its ability to create a pooled effect estimate, a weighted mean, that is typically more precise than that provided by any single constituent study. There are two general approaches to obtaining this pooled estimate: (1) pooling the overall (aggregate) effect estimates from constituent studies, and (2) pooling individual subject-data from constituent studies. (More detailed discussion of each of these will be provided later in the chapter.) While individual subject data are preferable, they are often not available, and so pooling of effect estimates is the more common approach. When pooling aggregate estimates, it is important that estimates be on the same scale. For example, it would be inappropriate to pool hazard ratios (HRs)
30 SYSTEMATIC REVIEW AND META-ANALYSIS
289
TABLE 30.1
Example of Tabular Summary of Study Factors (4). Table 1. Characteristics of the Eight Studies Included in the Pooled Analysis of Anthracycline-Based Adjuvant Chemotherapy vs Non–Anthracycline-Based Adjuvant Chemotherapy for Early Breast Cancer* STUDY NAME OR LOCATION (REFERENCE)
NO. OF RANDOMLY ASSIGNED PATIENTS
STUDY ARMS
PATIENT INCLUSION CRITERIA Node positive, ER negative, and/or PgR negative Node positive and age <49 y if PgR positive or node positive and age ≤59 y if PgR negative Node positive, stage II breast cancer, and premenopausal or node positive, postmenopausal, and ER negative or Stage III, either pre- or postmenopausal Node positive, pre- or postmenopausal Node positive (1–3 nodes) Node negative, premenopausal, and grade II-III or node positive, premenopausal, and hormone receptor negative/unknown or node positive or tumor size >5 cm, postmenopausal, and hormone receptor negative Node negative and hormone receptor negative or node positive and pre- or postmenopausal Node positive and premenopausal
NSABP B11 (10)
682
PF vs PAF
NSABP B15 (11)
2295
CMF vs AC
GUN 3 (12)
220
CMF vs CMF/EV
Belgium (8) Milan, Italy (13) DBCG 89D (9)
777 552 980
CMF vs HEC/EC CMF vs CMF→A CMF vs FEC
GOIRC (14)
348
CMF vs E
NCIC MA5 (15)
710
CMF vs CEF
*NSABP = National Surgical Adjuvant Breast and Bowel Project; PF = L-phenylalanine mustard and 5-fluorouracil; PAF: L-phenylalanine mustard, 5-fluorouracil, and doxorubicin; ER = estrogen receptor; PgR = progesterone receptor; CMF = cyclophosphamide, methotrexate, and 5-fluorouracil; AC = doxorubicin, cyclophosphamide; GUN = Gruppo Universitario Napoletano; EV = epirubicin, vinblastine; HEC = high-dose epirubicin, cyclophosphamide; EC = epirubicin, cyclophosphamide; CMF→A = CMF followed by doxorubicin; DBCG = Danish Breast Cancer Cooperative Group; FEC = 5-fluorouracil, epirubicin, and cyclophosphamide; GOIRC = Gruppo Oncologico Italiano di Ricerca Clinica; E = epirubicin; NCIC = National Cancer Institute of Canada; CEF = cyclophosphamide, epirubicin and 5-fluorouracil.
from some studies with risk differences from other studies. In addition, it may also be inadvisable to pool unadjusted estimates from some studies with adjusted estimates from others. Choice of which measures to pool is often dictated by data presented in the individual studies. In some (but not all) instances, opportunity exists to transform one effect estimate into another (such as transforming count data into odds ratios) (10, 11). Once a scale has been selected, pooling may begin. Essentially, pooling is obtaining a weighted average of the effect estimates from individual studies. Weighting is applied such that more precise results (i.e., those with more narrow confidence intervals, typically from larger studies) contribute more weight than results that are less precise. Most statistical packages offer software that is either explicitly designed for meta-analytic pooling or can be adapted to meta-analysis.
There are two common approaches used for estimating these weights: fixed effects or random effects. The exact mathematical difference between these two methods is beyond the scope of this chapter, and readers are referred to excellent reviews on this topic (12, 13). In general (but not always), random effects estimates are more conservative (i.e., require greater evidence to infer a true difference and produce wider confidence intervals), and are therefore always appropriate to use, at least for efficacy estimation. Fixed effects pooling is less conservative, and is appropriate when significant heterogeneity of results is not present (see below). In these instances, fixed effects estimates are more precise, and may be favored. This increased precision might be advisable when studying safety questions. There is no consensus among meta-analysts as to which model is preferable, so an analytic
TABLE 30.2
Example of Tabular Summary of Results of Constituent Studies (4). Table 2. Assessment of HER2 Status in the Eight Studies Included in the Meta-Analysis* HER2
STATUS
STUDY NAME OR LOCATION, REFERENCE
NO. OF PATIENTS ASSESSED/NO. OF PATIENTS RANDOMLY ASSIGNED (%)
NSABP B1110 NSABP B1511 GUN 312 Belgium8 Milan, Italy13 DBCG 89D9
638/682 (94%) 2.034/2.295 (89%) 123/220 (56%) 354/777 (46%) 506/552 (92%) 805/980 (82%)
239/638 (37%) 599/2.034 (29%) 30/123 (24%) 73/354 (21%) 95/506 (19%) 246/805 (31%)
266/348 (76%)
91/266 (34%)
28/710 (88%)
163/628 (26%)
GOIRC14 NCIC MA515
NO. OF PATIENTS HER2 POSITIVE/NO. OF PATIENTS ASSESSED (%)
HER2 ASSAY METHOD TYPE OF ASSAY (ANTIBODY)
SCORING CRITERIA HER2 POSITIVITY
Yes
IHC (cocktail) IHC (cocktail) IHC (MAB1) FISH IHC (CB11) HercepTest and FISH if 2+‡ IHC (CB11)
Yes
FISH
Fishnet appearance† Fishnet appearance† Not reported Staining intensity ratio >2 Strong membrance labeling +3 staining intensity ratio >2§ Strong membrane labeling in >50% stained cells Staining intensity ratio >2
CENTRAL TESTING Yes Yes Not reported Yes Yes Yes
FOR
* NSABP = National Surgical Adjuvant Breast and Bowel Project; IHC = immunohistochemistry; GUN = Gruppo Universitario Napoletano; MAB 1 = monoclonal antibody 1; FISH = fluorescence in situ hybridization; CB11 = monoclonal antibody CB11; DBCG = Danish Breast Cancer Cooperative Group; GOIRC = Gruppo Oncologico Italiano di Ricerca Clinica; NCIC = National Cancer Institute of Canada. † Positive if any tumor cell showed define membrane staining. ‡ All HER2 (2+) positive tumors underwent complementary FISH to investigate HER2 gene amplification. § All HER2 (3+) positive tumors and HER2 (2+) positive tumors with HER2 amplification investigated by FISH (fluorescence in situ hybridization; ≥ ratio 2) were considered to be HER2 positive.
291
30 SYSTEMATIC REVIEW AND META-ANALYSIS
approach that employs both models and arrives at similar conclusions may provide more robust results. If the two approaches disagree, further exploration and interpretation of the data may be needed. Results of meta-analysis are usually depicted by a forest plot. The forest plot is a graphical representation of the effect estimates of each constituent study and the overall pooled estimate. Each constituent study is represented by a line along the y-axis. The effect estimates are then plotted as either dots or squares along the x-axis, with whiskers extending to the low and high bounds of the confidence intervals (studies that yield more precise estimates will have narrower whiskers than those that yield less precise estimates). Often, the area of the square depicting the effect estimate is proportional to the study size. Typically, a vertical line is drawn to indicate the null value (1 for HR, odds ratio, relative risk; 0 for risk difference), and some indication is given to guide the reader which direction indicates favorable and unfavorable performance of the intervention. Finally, the bottom line of the forest plot usually contains a diamond that is centered at the pooled estimate and extends horizontally to encompass the lower and upper confidence bounds. Figure 30.2 shows a typical forest plot, taken from the Her2/anthracycline meta-analysis (4). This figure lists the results from the six trials by Her2/neu receptor status for which data were available to assess DFS. The
overall pooled estimate indicates a statistically significant 10% reduction in risk, which was accounted for exclusively by a 29% reduction in risk in the Her2/neu receptor positive tumor group. There was no effect on survival in the Her2/neu receptor negative tumor group. Earlier in the chapter, we noted that there were two types of data that can be pooled: aggregate data and individual patient-level data. Much of the preceding text has considered aggregate data, because metaanalyses based on published studies will generally be limited to this type of data. In contrast, analyses based on individual patient-level data can be performed in situations in which those data are available. Below, we contrast these two methods and discuss the relative merits and drawbacks of each. Aggregate Data The choice to use aggregate-level as opposed to patientlevel data is often one of convenience. Published reports will rarely (if ever) contain individual patient-level data, and thus using aggregate-level data is the only method applicable to meta-analysis of published reports. The principal benefit of using aggregate-level data is increased efficiency (i.e., it is much less time intensive and costly than acquiring and using individual patient-level data). However, as we will see, use of aggregate-level data has several significant drawbacks,
HER2 status
HR
(95% CI)
NSABP B11
+ –
NSABP B15
+ –
Belgium
+ –
Milan
+ –
DBCG 89D
+ –
NCIC MA5
+ –
0.60 0.96 0.84 1.02 0.65 1.35 0.83 1.22 0.75 0.79 0.52 0.91
(0.44 to 0.82) (0.75 to 1.23) (0.65 to 1.08) (0.86 to 1.20) (0.34 to 1.26) (0.93 to 1.97) (0.46 to 1.49) (0.91 to 1.64) (0.53 to 1.06) (0.60 to 1.05) (0.34 to 0.80) (0.71 to 1.17)
+ –
0.90 0.71 1.00
(0.82 to 0.98) (0.61 to 0.83) (0.90 to 1.11)
Study
Overall HER2 specific
0.00
0.50
1.00
Favors anthracyclines
1.50
2.00
Favors non-anthracyclines
HER2 positive heterogeneity χ = 5.2, df = 5, P = .38 HER2 negative heterogeneity χ2 = 7.6, df = 5, P = .18 Test for interaction χ2 = 13 7 P < 001 2
FIGURE 30.2 Example of forest plot from the Her2/anthracycline meta-analysis (3). Note: Because the authors were interested in effect modification on the basis of Her2 status, two HRs (one for Her2+ and one for Her2- patients) are reported for each study, and three pooled estimates are presented.
292
ONCOLOGY CLINICAL TRIALS
and thus oncologists should at least consider obtaining and analyzing individual patient-level data when available. Aggregate-level data may be reported in several forms. These include odds ratios, relative risks, HRs, or risk differences. Regardless of the measure used, it is necessary to capture both the point estimate, and some measure of precision (typically either a confidence interval or standard error are reported). It should be noted that for all relative measures (e.g., odds ratios, relative risks, HRs), pooling is often done on the log scale (i.e., pooling the log odds ratios), and pooled estimates are then converted back to the initial scale (e.g., a pooled odds ratio). Most statistical packages are able to transform data automatically, but care must be exercised in ensuring that such transformations are used. In the oncology literature, odds ratios and relative risks can be particularly problematic, as compared with HRs based on time-to-event (survival) analyses. The reason is that individual studies may consider odds ratios according to different time horizons. For example, one study may report the odds ratio for mortality at 2 years, and another study at 5 years. In general, it can be inappropriate to pool these discordant measures. In such instances four choices are possible. The first is to selectively use HRs. To some degree (particularly when time horizons are reasonably similar), HRs take into account differences in length of follow-up. However, not all studies will report HRs, thus negating this as a viable option. The second choice is to obtain and analyze individual patient-level data, which enables a more formal time-to-event analysis (see below). A third, and often the default, approach is to assume (usually without basis) that the odds ratio is independent of the choice of time horizon. A fourth option is to stratify results according to time horizon, and subsequently pool all studies only if similar estimates are seen in each stratum. Another limitation of aggregate-level analysis is the inability to adjust for potential confounders. This is particularly the case when unadjusted estimates are available and used (particularly from nonrandomized studies). Even in instances when adjusted estimates are available for all studies, and therefore used, different studies will often adjust for varying combinations of potential confounders, thus compromising potential validity. As we will show, the ability to adjust for potential confounders is an advantage when individual patient-level data are analyzed. Patient-Level Data Analysis of individual patient-level data circumvents some (but not all) of the limitations of aggregate-level
data meta-analysis, although it may introduce different challenges. Obtaining such data is often difficult and time consuming, and is subject to the willingness of investigators of individual trials to contribute data. However, when such obstacles can be surmounted, it can be a particularly useful tool. It will often be extremely helpful to engage the owners of the data in the performance of the patient-level analysis. As the people who know the data and the clinical issues the best, the original investigators can make important contributions. After individual patient-level data have been obtained, these data must be examined for quality and completeness. In addition, great care must be taken to ensure consistency of data across studies (e.g., definitions of variables). This is particularly important with categorical variables, such as tumor stage. For example, some studies may categorize stage as either disseminated (stage IIIb through IV) versus localized, whereas other studies may categorize in terms of individual stages. Similar consideration must be given to outcome measures. Analysis of patient-level data allows for examination of important clinical or demographic subgroups. For example, access to individual patient-level data would enable subgroup analysis of ER positive versus ER negative breast carcinomas, whereas having only aggregatelevel data would not (unless each study reported separate results for ER positive and negative tumors). Because patient-level data permit disaggregation of subgroups, analysis of patient-level data is less subject to differences in effects that may stem from differences in patient populations under study, when those populations cannot be separated within all studies. To some degree, individual patient-level data will enable the oncologist to overcome differences in definitions among studies. For example, if one study reported the odds ratio for mortality at 5 years and the other at 2 years, access to individual patient-level data would allow determination of the odds ratio at 2 years for all patients, and at 5 years (if these data were available, but not reported in the latter study). In addition, individual patient-level data enable performance of time-to-event analyses, such as application of KaplanMeier methods, whereas aggregate analysis does not easily accommodate time-to-event data. Individual patient-level analysis may be less subject to publication bias, given that data from ongoing studies (which have not been reported) or completed studies (which have not been written up or accepted for publication) can be included. However, publication bias is still somewhat of a concern, based on ability to identify all unpublished studies. Bias could arise, as well, in patient-level analyses, if the willingness of
30 SYSTEMATIC REVIEW AND META-ANALYSIS
investigators to contribute patient-level data depends on the results of the studies. Loss of precision may also result if not all studies contribute patient-level data and the analysis is restricted to those studies that do provide patient-level data. The tamoxifen meta-analysis is an excellent example of the use of patient-level data, stratifying on relevant factors including nodal status, age and year of follow-up (3). These data were used to generate a logrank statistic (the observed minus the expected [O-E] number in the treatment group who had the relevant event) and variance (V) for each trial and then summing these for an overall O-E and V. Thus women in each trial were only compared to other women in the same trial and not to women in other trials, who might differ in subtle ways. Weighting was used to account for differences in study size or effects that differ between earlier and later points in follow-up. Figure 30.3 (Figure 7 in the example) shows the forest plots for the effects of tamoxifen versus not on recurrence and mortality, by ER status and treatment duration in the EBCTCG tamoxifen meta-analysis (3). Because of the large number of trials included in this analysis, this plot shows only the overall results of the pooled data rather than the results of each study. However, the figure is useful in illustrating the methodology, including black squares plotting the ratios (proportional to the sample size of the group) and 99% confidence intervals. As can be seen, the pooled estimates from the ER positive patients showed a significant reduction in recurrence and mortality associated with tamoxifen that improved over time. There was no benefit observed among the ER negative patients. Shared Limitations Several limitations are shared by both types of analysis. As noted above, both are subject to potential publication bias. In aggregate analysis, this derives from reliance on published reports, given that smaller, negative trials often go unpublished. In patient-level analyses, this can arise from obstacles related to identifying unpublished studies or from unwillingness of investigators to provide data. We will discuss means to assess for publication bias later in the chapter. Heterogeneity also represents a shared limitation. Heterogeneity may result from differences in patient populations, measurements, or study methods. As discussed below, patient-level analysis enables some correction for differences in patient population and measurements among studies, by allowing separate analyses of the relevant subgroups. However, neither method can account, inherently, for significant methodological differences, such as among-study difference in
293
intensity or duration of therapy or allocation concealment (blinding). Step 6: Assessing Heterogeneity Heterogeneity of results among studies can result from either chance alone or from systematic differences in the studies themselves. The first step in interpreting the forest plot is to determine whether estimates tend to be similar or heterogeneous across studies (5, 14). This is accomplished via visual inspection (i.e., were all/most dots on the same side of the null and of similar magnitude) and through statistical procedures. There are two statistics that are typically used to assess whether significant heterogeneity exists: the Q statistic and the I2 statistic. The Q statistic yields a p value, which if <0.10 indicates statistically significant heterogeneity. Note that the p-value threshold of 0.10 is used, by convention, because of the low statistical power of the test of heterogeneity. The I2 is a more direct measure that indicates how much of the observed heterogeneity is explained by between-study variability, relative to overall variability: values <30% indicate relatively little systematic heterogeneity; values >50% represent substantial heterogeneity. When a moderate amount of heterogeneity exists, pooling should be done and interpreted cautiously. When marked heterogeneity exists, some would argue that pooling should not be done, or if done, should at least be interpreted very cautiously. The average of a collection of disparate effects from studies of heterogeneous design, with heterogeneous populations, may not apply to any particular population. Whether an overall summary is provided or not, the oncologist should go back and reexamine constituent studies for potential sources of heterogeneity (e.g., differences in studies’ methods, patient populations, etc.) (5, 14). Two commonly used methods of doing so include subgroup (or subset) analysis and meta-regression. Subgroup Analysis Subgroup analyses, as the name implies, are those in which consideration is limited to studies exhibiting certain characteristics. These may be characteristics of the study (e.g., only RCTs), or of the patient population considered (e.g., only those enrolling women over the age of 50). In instances where particular causes of heterogeneity are anticipated, these analyses should be specified a priori, and thus serve as corroborative evidence. For example, tamoxifen, as a selective estrogen receptor modulator (SERM), acts upon ERs on breast tumor cells. Therefore, one might anticipate, a priori,
Recurrence / Woman-years
Category
Events/woman-years
Tamoxifen events
Allocated tamoxifen
Logrank Variance O-E of O-E
Adjusted control
Breast cancer mortality / Women
Ratio of annual event rates Tamoxifen: Control
Category
–79.8
681.3
ER–poor
1632/26986 (6.0%/y)
1765/26709 (6.6%/y)
ER+
2875/53024 (5.4%/y)
3336/47152 –376.2 1266.8 (7.1%/y)
ER unknown
2277/37698 (6.0%/y)
2647/35243 –265.3 1043.1 (7.5%/y)
6784/ 117708 (5.8%/y)
7748/ –721.3 2991.2 109104 (7.1%/y)
556/13770 (4.0%/y) 1146/38856 (2.9%/y)
ER+
ER unknown
(a) subtotal
99% or
266/8119 (3.3%/y)
1968/ 60745 (3.2%/y)
0.91 (SE 0.04)
591.8
1458/4250 (34.3%)
—
0.74 (SE 0.02)
ER+
2229/7226 (30.8%)
2549/7110 –204.4 1016.8 (35.9%)
—
—
0.78 (SE 0.03)
ER unknown
1982/5228 (37.9%)
2279/5294 –163.7 (43.0%)
—
0.79 (SE 0.02) 2p < 0.00001
(a) subtotal
5564/ 16555 (33.6%)
935.9
0.82 (SE 0.03)
0.84 (SE 0.03)
0.85 (SE 0.02) 2p < 0.00001
6286/ –425.4 2544.5 166`54 (37.7%)
(b) 5 y tamoxifen (ER–poor vs. ER+: χ21 = 27.6; 2p < 0.00001) 1.04 (SE 0.07)
1607/33365 –335.5 (4.8%/y)
631.2
–49.0
ER–poor
407/2287 (17.8%)
0.59 (SE 0.03)
ER+
812/4205 (19.3%)
134.3
0.69 (SE 0.07)
ER unknown
206/1020 (20.2%)
2484/ –373.9 1012.5 55197 (4.5%/y)
0.69 (SE 0.03) 2p < 0.00001
(b) subtotal
99% or
95% confidence intervals 0
0.5 Tamoxifen better
1.0
1.5
1425/ 7512 (19.0%)
402/2344 (17.2%)
8.0
183.1
1.04 (SE 0.08)
1111/4106 –185.4 (27.1%)
444.2
0.66 (SE 0.04)
–21.6
99.5
0.80 (SE 0.09)
1750/ –199.0 7505 (23.3%)
726.8
0.76 (SE 0.03) 2p < 0.00001
237/1055 (22.5%)
95% confidence intervals
2.0
0
Tamoxifen worse
FIGURE 30.3
0.5 Tamoxifen better
Heterogeneity between effects in ER+ disease of different tamoxifen durations: χ21 = 14.5; 2p = 0.0001
Heterogeneity between effects in ER+ disease of different tamoxifen durations: χ21 = 23.2; 2p < 0.00001
Example of patient-level meta-analysis (4).
–57.3
1353/4101 (33.0%)
247.0
334/7683 (4.3%/y)
Ratio of annual death rates Tamoxifen : Control
ER–poor
10.6
543/14149 (3.8%/y)
Logrank Variance O-E of O-E
0.89 (SE 0.04)
(b) ≈ 5 y tamoxifen (ER−poor vs. ER+: χ21 = 58.6; 2p < 0.00001) ER–poor
Tamoxifen deaths
(a) 1–2 y tamoxifen (ER–poor vs. ER+: χ21 = 4.1; 2p = 0.04)
(a) 1–2 y tamoxifen (ER–poor vs. ER+: χ21 = 14.3; 2p = 0.0002)
(a) subtotal
Deaths/Women Allocated Adjusted tamoxifen control
1.0
1.5
2.0
Tamoxifen worse
30 SYSTEMATIC REVIEW AND META-ANALYSIS
that studies of patients with only ER positive tumors might differ from studies that included all patients, regardless of ER status. Figure 30-3 (above) shows the separate estimates of treatment effect in the EBCTCG for subgroups of women defined by ER status and duration of tamoxifen therapy. Of particular note is that (in general) the benefits of tamoxifen are clear in ER positive patients and absent in ER negative patients. In instances where no preexisting hypothesis exists, these analyses can still be performed but should be considered as exploratory (i.e., suggesting areas for future investigation). When the data permit, subgroup analyses can involve considering subgroups of patients within individual studies. For example, if many or all of the constituent studies report individual effects among women with ER positive and/or ER negative tumors, results from these subgroups can be combined using standard meta-analytic techniques. In addition, hybrid approaches are possible (e.g., combining studies that include only ER positive women with ER positive data from studies enrolling a broader population). One chief limitation of subgroup analyses is that they will be less powerful (and therefore less precise) than the overall meta-analysis. Meta-Regression Meta-regression is an analytic technique by which the effect estimates for constituent studies are regressed against variables that serve as potential sources of heterogeneity (15). (For relative measures, such as odds ratios, the values are plotted on a log scale.) If a horizontal line is observed, it suggests that the variable in question is not responsible for observed heterogeneity. If the line has a positive or negative slope, it suggests that the variable is associated with the study findings. Formal statistical techniques exist for quantitating the degree of association between the variable and the results, and measuring the degree of statistical significance. As with subgroup analysis, meta-regression is often underpowered. Thus authors may consider using more liberal cutoffs for statistical significance (e.g., p < 0.10).
295
have been created for this purpose (16–18). Most relate to clinical trials, but some do exist for observational studies as well. Most have not yet been validated. In general, these scoring systems assign points for various aspects of study design (e.g., allocation concealment) with studies of higher quality yielding higher scores. There are at least three ways by which quality scores have been applied to meta-analysis in the past (19, 20). The first is to use the score as part of the weighting process. In this case, higher quality studies will count more in the pooled analysis than lower quality studies. The second is to use the score as a stratification factor. In this case, studies of similar quality are pooled with one another, with a stratum-specific estimate of treatment effect obtained for each level of study quality. If estimates differ across strata, it may suggest that study quality influences the observed association between treatment and outcome. If estimates are similar across strata, it suggests that there is no influence of study quality on the observed association, and authors can then either pool stratum specific estimates (using Mantel Haenszel methods) or go back and pool all constituent studies using standard meta-analytic techniques. The third is to use the quality score as an exclusion criterion, considering only studies of a certain quality or higher. There is no consensus for which method to apply. In fact, there is debate about the merits of such scoring systems. The study factors considered vary across scoring systems, as do points allocated for individual factors, leading some to question the arbitrary nature of the scales. Some scales focus more specifically on the potential for bias in the studies, whereas others include measures of the quality of the reporting of the studies. One paper (21) showed that use of different scoring systems can lead to opposite conclusions regarding the association between study quality and study findings. For some scoring systems, the effect of treatment appeared smaller for the so-called high-quality studies; for others, the treatment effects were larger for the socalled high-quality studies. For this reason, some authors choose to consider individual factors (e.g., allocation concealment) when scoring studies. Step 8: Assessing for Publication Bias
Step 7: Assessing the Quality of Constituent Studies It cannot be overemphasized that the quality of a metaanalysis is absolutely contingent upon the quality of constituent studies. This realization generates two important questions. First, operationally, how does one assess study quality? Second, how can meta-analysis examine the effects of constituent study quality on overall results? Fortunately, several scoring systems
Publication bias is a potential threat to the validity of any meta-analysis. Publication bias arises when certain studies (e.g., those that demonstrate significant benefit) are published, whereas others (e.g., those that demonstrate no benefit) are not (22, 23). The principal safeguard against publication bias is an exhaustive search for all existing evidence on the topic at hand. However, even the most complete search may miss (typically negative) unpublished studies.
296
ONCOLOGY CLINICAL TRIALS
100,000
Sample size
10,000 Outlier study? 1000
100 10 0.1
10
1
Favors control
Favors treatment Risk ratio
funnel plot asymmetry could also arise even though it would not be a product of publication bias. When publication bias is evident based on any of the methods listed above, one need take caution in reporting and interpreting meta-analytic results. Even so, several analytic techniques exist that can help correct for the effects of publication bias. First, the oncologist can limit consideration to studies exceeding some minimum criterion for precision (e.g., all studies falling higher on the y-axis of the funnel plot than the area of asymmetry). In addition, trim and fill techniques may be used to estimate effects that would have been seen in missing studies if they were observed. Both of these techniques are controversial.
FIGURE 30.4
CONCLUSIONS
Example of a (Hypothetical) funnel plot (26).
Several methods have been developed to analyze for the presence of publication bias (24, 25). The commonest of these is the funnel plot (Figure 30.4) (26). The funnel plot is a graphical tool by which the effect estimate for each study (on the x-axis) is plotted against the precision of each study (typically a function of the variance, or standard error; on the y-axis). It is expected that the point estimates from more precise studies (those higher on the y-axis) will center narrowly about the true estimate. Less precise studies will on average center about this value, but any individual study will vary more widely. The resulting graph thus takes on an inverse conical (or funnel) shape. As previously mentioned, publication bias typically arises when small (i.e., imprecise) studies that fail to achieve statistical significance are selectively not published relative to either small positive studies or large studies. Were this the case, one could detect this by an absence of one of the lower corners of the curve (typically on the side indicative of no benefit), termed funnel plot asymmetry. Visual inspection of funnel plots may not always be sensitive to publication bias, particularly when few studies are available for meta-analysis. Therefore, statistical methods have been developed to assess for the presence of possible publication bias, such as the Begg rank correlation and Egger weighted regression methods (27). Discussion of these is beyond the scope of this chapter. Some authors have noted that the funnel plot, and related statistical methods, indicate only asymmetry and are not specific as to the cause of that asymmetry. For example, in a setting in which the small studies have been conducted with very tight control and encouraging very strict adherence to therapy, but large studies are less tightly controlled and poor adherence is prevalent,
Meta-analysis, applied correctly, is a powerful tool. Successful meta-analysis can identify gaps in existing knowledge, suggest avenues for future research, render a precise (relative to individual studies) pooled effect estimate, and spur hypotheses. Successful meta-analysis is dependent on methodological rigor, and it is incumbent upon the meta-analyst and reader to ensure that such rigor is applied.
References 1. DerSimonian R, Laird N. Meta-analysis in clinical trials. Cont Clin Trials. 1986;7:177–188. 2. Berlin JA, Longnecker MP, Greenland S. Meta-analysis of epidemiologic dose-response data. Epidemiology. 1993;4:218–228. 3. Early Breast Cancer Trialists’ Collaborative Group (EBCTCG). Effect of chemotherapy and hormonal therapy for early breast cancer recurrence and 15-year survival: an overview of the randomised trials. Lancet. 2005;365:1687–1717. 4. Gennari A, Sormani MP, Pronzato P, et al. J Natl Cancer Inst. 2008;100:14–20. 5. Egger M, Smith GD, Phillips AN. Meta-analysis: principles and procedures. BMJ. 1997;315:1533–1537. 6. Curtin F, Elbourne D, Altman DG. Meta-analysis combining parallel and cross-over clinical trials. II: binary outcomes. Stat Med. 2002;21:2145–2159. 7. Becker MP, Balagtas CC. Marginal modelling of binary crossover data. Biometrics. 1993;49:997–1009. 8. Khan KS, Kleijnen J. Stage II, phase 6: Data extraction and monitoring progress. In: Undertaking systematic reviews of research on effectiveness: CRD’s guidance for those carrying out or commissioning reviews. CRD Report 4 (2nd ed.), March 2001 http://www.york.ac.uk/inst/crd/report4.htm 9. Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF, for the QUOROM Group. Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Lancet. 1999;354:1896–1900. 10. Suissa S. Binary methods for continuous outcomes: a parametric alternative. J Clin Epidemiol. 1991;44:241–248. 11. Whitehead A, Bailey AJ, Elbourne D. Combining summaries of binary outcomes with those of continuous outcomes in a metaanalysis. J Biopharm Stat. 1999;9:1–16.
30 SYSTEMATIC REVIEW AND META-ANALYSIS
12. Berlin JA, Laird NM, Sacks HS, Chalmers TC. A comparison of statistical methods for combining event rates from clinical trials. Stat Med. 1989;8:141–151. 13. DerSimonian R, Laird N. Meta-analysis in clinical trials. Cont Clin Trials. 1986;7:177–188. 14. Bailey KR. Inter-study differences: how should they influence the interpretation and analysis of results? Stat Med. 1987;6:351–358. 15. Normand S-LT. Tutorial in biostatistics. Meta-analysis: formulating, evaluating, combining, and reporting. Stat Med. 1999;18:321–359. 16. Chalmers TC, Smith H Jr, Blackburn B, et al. A method for assessing the quality of a randomized control trial. Cont Clin Trials. 1981;2:31–49. 17. Mahon WA, Daniel EE. A method for the assessment of reports of drug trials. Can Med Assoc J. 1964;90:565–569. 18. Verhagen AP, de Vet HCW, de Bie RA, et al. The Delphi list: a criteria list for quality assessment of randomized clinical trials for conducting systematic reviews developed by Delphi consensus. J Clin Epidemiol. 1998;51:1235–1241. 19. Detsky AS, Naylor CD, O’Rourke K, McGeer AJ, L’Abbe KA. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol. 1992;45: 255–265.
297
20. Berard A, Bravo G. Combining studies using effect sizes and quality scores: application to bone loss in postmenopausal women. J Clin Epidemiol. 1998;51:801807. 21. Juni P, Witschi A, Block R, Egger M. The hazards of scoring the quality of clinical trials: lessons for meta-analysis. JAMA. 1999;282:1054–1060. 22. Begg CB, Berlin JA. Publication bias and dissemination of clinical research. J Natl Cancer Inst. 1989;81:107–115. 23. Chalmers TC. Problems induced by meta-analyses. Stat Med. 1991;10:971–979. 24. Sutton AJ, Duval S, Tweedie R, Abrams KR, Jones DR. Empirical assessment of effect of publication bias on meta-analyses. BMJ. 2000;320:1574–1577. 25. Sutton AJ, Duval S, Tweedie R, Abrams KR, Jones DR. Empirical assessment of effect of publication bias on meta-analyses. BMJ. 2000;320:1574–1577. 26. Hamon M, Lepage O, Malagutti P, et al. Diagnostic performance of 16- and 64-section spiral CT for coronary artery bypass graft assessment: meta-analysis. Radiology. 2008;247: 679–686. 27. Egger M, Davey Smith G, Schneider M, Minder C. Bias in metaanalysis detected by a simple, graphical test. BMJ. 1997; 315(7109):629–634.
This page intentionally left blank
31
Regulatory Affairs: The Investigator-Initiated Oncology Trial
Maria Mézes Harvey M. Arbit
The purpose of this chapter is to introduce the junior oncology investigator to the basic regulatory framework governing Investigator-Initiated clinical trials, and provide a simple, practical reference guide to obtaining and maintaining regulatory approvals to conduct such research. Space permits only a brief overview highlighting the key topics, however, references provided throughout the text are accessible for more in-depth exploration. The section titled “Resources” at the end of this chapter lists documents and Web addresses for further study. While the majority of oncology trials initiated by the individual investigator are trials involving drugs and biologicals, there are certainly a number of device trials that are related to the treatment of cancer. Regulations specific to medical devices are addressed briefly.
REGULATORY FRAMEWORK The U. S. Food and Drug Administration (FDA), a branch of Health and Human Services, regulates the pharmaceutical, medical device, and biological industries. Congress enacts laws and federal executive departments and administrative agencies write regulations to implement the authority of the laws. Regulations are ancillary or subordinate to laws, but both laws and regulations are enforceable. The official codification of Federal Statutes is the U. S. Code. The
U. S. Code of Federal Regulations (CFR) is divided into titles numbered 1 to 50, each being divided into chapters usually bearing the name of the issuing agency. Each chapter is further subdivided into parts covering specific regulatory areas. Title 21 of the CFR titled “Food and Drugs” contains the parts applicable to clinical studies (see Table 31.1). In addition to FDA regulations, the agency publishes Guidelines and Information Sheets representing the agency’s current thinking on a number of topics. Even though these are not binding in the manner of regulations, it is strongly suggested that the investigator follow the requirements contained within, which serve to clarify and provide additional detail. Many Guidance documents reference applicable regulations, which are enforceable. The FDA’s Good Clinical Practice Program Web page “Guidances, Information Sheets, and Important Notices on Good Clinical Practice in FDA-Regulated Clinical Trials” (accessible at http://fda/gov/oc/gcp/guidance.html) has a comprehensive list of current Guidance documents and Information Sheets.
REGULATORY DEFINITIONS Before progressing further into the “how-to” of obtaining and maintaining regulatory approval to conduct clinical research, it is necessary to define key terms
299
300
ONCOLOGY CLINICAL TRIALS
TABLE 31.1
Chapter 21 CFR Parts Pertaining to Clinical Studies. Part 11
Electronic Records; Electronic Signatures Part 50 Protection of Human Subjects Part 54 Financial Disclosure by Clinical Investigators Part 56 Institutional Review Boards Part 58 Good Laboratory Practice Part 211 Current Good Manufacturing Practice for Finished Pharmaceuticals Part 312 Investigational New Drug Application Part 600 Biological Products Part 812 Investigational Device Exemption Title 21 can be accessed online at http://www. access. gpo.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm
frequently used in clinical research. These terms are defined in the context of specific regulations, and occasionally the same term will have a slightly different definition from regulation to regulation. A good example of this is the term “Sponsor”. The Guidance document “Financial Disclosure by Clinical Investigators” explains: Under 21 CFR 54.2, “sponsor” is defined as the party “supporting a particular study at the time it was carried out.” FDA interprets support to include those who provide “material support,” e.g., monetary support or test product under study. The sponsor of an IND or IDE, as defined in 21 CFR 312 and 812 is the “party or parties who take responsibility for and initiate a clinical investigation.” The term “sponsor” is also used in 312.53 and 812.43 to refer to the party who will be submitting a marketing application. . . . (1) Appendix A lists some commonly used regulatory terms and their definitions.
testing in humans of any drug or biological product not approved for sale in the United States. FDA’s chief responsibility over an IND in all phases of research is to assure the rights and safety of patients, while in phase II and phase III studies, the FDA has the additional responsibility of assuring that the quality of the science is adequate to permit evaluation of effectiveness and safety. Marketed drugs and biologicals used off-label in research may require the submission of an IND application. Current federal law requires that a drug or biological product be the subject of an approved marketing application before it can be transported or distributed across state lines. The vehicle through which the sponsor obtains an exemption from this legal requirement is the IND. There are different types of INDs. Commercial INDs are applications submitted by drug companies, and the data generated through these studies become part of a marketing application called a New Drug Application (NDA) to gain approval for a new drug. In addition to Commercial INDs, there are INDs filed for noncommercial research: Treatment INDs, Emergency Use INDs, and Investigator-Initiated INDs. In the context of this chapter, the focus will be on the Investigator-Initiated IND, namely that which is sponsored by an individual investigator. In this case, the investigator being the sponsor of the IND assumes full regulatory responsibility. There is a provision in the regulations for an “exemption” from the requirement to have an IND in effect if certain criteria are met. The FDA publishes a comprehensive Guideline document “Guidance for Industry—IND Exemptions for Studies of Lawfully Marketed Drug or Biological Products for the Treatment of Cancer” (2) providing information on how the individual investigator and/or institutional review board (IRB) can make an IND-exemption determination. It should be noted that even though the FDA makes this avenue available, frequently the company supplying the drug for an Investigator-Initiated trial requires, as a condition of supplying the drug, that the investigator prepare and file a complete IND and have the FDA make the assessment that the study is “INDexempt.” The FDA will issue a letter to that effect. IND Exemption
THE INVESTIGATIONAL NEW DRUG (IND) APPLICATION An Investigational New Drug (IND) application also known as a “Notice of Claimed Investigational Exemption for a New Drug” is required prior to beginning
Regulation 21 CFR 312.2(b)(1) provides for the exemption of filing an IND if the study meets all the following five (5) criteria: · The study is not intended to support FDA approval of a new indication or a significant change in the product labeling.
31 REGULATORY AFFAIRS: THE INVESTIGATOR-INITIATED ONCOLOGY TRIAL
APPENDIX A
Definitions. 21 CFR 310.3, 21 CFR 312.3, FD&C Act Sect. 201, 201(h) Clinical Clinical investigation means any experiment in which a drug is administered or Investigation dispensed to (or used involving) one or more human subjects. For the purposes of this part, an experiment is any use of a drug except for the use of a marketed drug in the course of medical practice. Device A medical device is an instrument, apparatus, implement, machine, contrivance, implant, in vitro reagent, or other similar or related article (including a component part) or accessory which is: · recognized in the official National Formulary, or the United States Pharmacopoeia, or any supplement to them, · intended for use in the diagnosis of disease or other conditions, or in the cure, mitigation, treatment, or prevention of disease in man or other animals, or · intended to affect the structure or any function of the body of man or other animals, and which does not achieve any of its primary intended purposes through chemical action within or on the body of man or other animals and which is not dependent upon being metabolized for the achievement of any of its primary intended purposes. Drug Drug means (A) articles recognized in the official U. S. Pharmacopoeia, official Homoeopathic Pharmacopoeia of the United States, or official National Formulary (or any supplement to any of them); and (B) articles intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease in man or other animals; and (C) articles (other than food) intended to affect the structure or any function of the body of man. Good Clinical GCP is an international ethical and scientific quality standard for designing, Practice (GCP) conducting, recording, and reporting trials that involve the participation of human subjects. Compliance with this standard provides public assurance that the rights and well-being of trials subjects are protected, consistent with the principles that have their origin in the Declaration of Helsinki, and the clinical trial data are credible. IND IND means an investigational new drug application. IND is synonymous with “Notice of Claimed Investigational Exemption for a New Drug.” Investigational Investigational new drug means a new drug or biological drug that is used in a New Drug clinical investigation. The term also includes a biological product that is used in vitro for diagnostic purposes. The terms investigational drug and investigational new drug are synonymous. Further, a new drug means a new substance, new combination, new proportions, new use, new dosage form or new route of administration. Investigator Investigator means an individual who initiates and actually conducts a clinical investigation (i.e., under whose immediate direction the drug is administered or dispensed to a subject). In the event an investigation is conducted by a team of individuals, the investigator is the responsible leader of the team. Subinvestigator includes any other individual member of that team. Person The term person includes an individual, partnership, corporation, and association. Sponsor Sponsor means a person who takes responsibility for and initiates a clinical investigation. The sponsor may be an individual or pharmaceutical company, governmental agency, academic institution, private organization, or other organization. The sponsor does not actually conduct the investigation unless the sponsor is a sponsor-investigator. A person other than an individual that uses one or more of its own employees to conduct an investigation that it has initiated is a sponsor, not a sponsor-investigator, and the employees are investigators. There are two kinds of sponsor: Regulatory Sponsor (holds the IND or IDE) and Financial Sponsor (provides funding and/or product). SponsorSponsor-Investigator means an individual who both initiates and conducts an Investigator investigation, and under whose immediate direction the investigational drug is administered or dispensed. The term does not include any person other than an individual. The requirements applicable to a sponsor-investigator under this part include both those applicable to an investigator and a sponsor. Subject Subject means a human who participates in an investigation, either as a recipient of the investigational new drug or as a control. A subject may be a healthy human or a patient with a disease.
301
302
ONCOLOGY CLINICAL TRIALS
· The study is not intended to support a significant change in the advertising. · The investigation does not involve a route of administration or dosage level or use in a patient population or other factor that significantly increases the risks (or decreases the acceptability of the risks) associated with the use of the drug product. · The study is conducted in compliance with IRB and informed consent regulations as per 21 CFR 56 and 50. · The study is conducted in compliance with 21CFR 312.7 (no promotion and charging for investigational drugs). Appendix B (3) contains examples of oncology trials the FDA considers exempt, and Appendix C (4) includes examples of trials considered nonexemptions.
FDA 1572, and bound. When complete, the original and two identical copies are mailed to the FDA. Receipt of the initial IND will be acknowledged by the FDA. This acknowledgment indicates the date received, the IND number assigned, and obligations of the IND sponsor. Additional submissions to the same IND (i.e., amendments) will not be acknowledged; therefore, it is essential that when mailing anything to the FDA, a reputable courier such as Federal Express be used and receipt tracked for documentation. All subsequent communication with the FDA on a given IND must follow the same format. Helpful information for SponsorInvestigators, including forms and mailing addresses for the IND, are available on the FDA Web site at: http://www.fda.gov/CDER/forms/1571-1572-help.html. A helpful Guidance document is “Content and Format
Preparing and Filing the Investigator-IND The IND content and format are specified in the Regulations at 21 CFR 312.23. Table 31.2 provides a list of IND contents. The entire IND must be paginated, pages single-sided, except for Forms FDA 1571 and
APPENDIX B
Examples of Studies Considered to Be IND-Exempt (3). · Single-arm, phase II trials using marketed drugs to treat a cancer different from that indicated in the approved labeling, and using doses and schedules similar to those in the labeling. An exception may exist when standard therapy in the population to be studied is very effective (e.g., is associated with a survival benefit); in that case, use of another regimen may expose patients to the risk of receiving an ineffective therapy and an IND would be necessary. · Phase I trials of marketed drug, if such therapy is appropriate in patients with residual cancer and if there is no effective therapy. The investigator has to carefully evaluate toxicity prior to dose escalation. · New combinations if described in the professional medical literature, and do not pose a significant risk. · New routes or schedules of administration not described in the approved labeling, if there is sufficient clinical experience described in the literature documenting its safety. · Studies of high-dose therapy with regimens that appear to have an acceptable therapeutic ratio; or phase I studies with incremental changes from such regimens.
APPENDIX C
Examples of Studies Requiring an IND (not considered exempt) (4). · Studies of cytotoxic drugs in patients for whom cytotoxic therapy would not be considered standard therapy and would require special justification. Any use of cytotoxic agents in nonmalignant disease (e.g., rheumatoid arthritis, multiple sclerosis) would, most likely, be considered to alter the acceptability of the risk of the agent. · Studies of adjuvant chemotherapy. · If the population studied has a low risk of cancer after surgery, treatment with any new therapy may indicate a significantly increased risk. · If standard adjuvant therapy is available and produces a survival benefit, substitution of a new therapy for standard therapy poses a significant risk that the new therapy will not produce the same survival benefit. · If adjuvant trials are properly designed, they usually will be able to demonstrate whether the new therapy is safe and effective, and such results may lead to an NDA or be used in the NDA. · Studies involving substitution of a new agent of unproven activity in settings where standard therapy provides a cure or increase in survival; first-line treatment of testicular cancer, ovarian cancer, breast cancer, leukemia, and lymphoma. In this case, the critical judgment is whether it is ethical to withhold standard therapy while testing this new drug. · Initial studies of drugs intended to be chemosensitizers, radiosensitizers, or resistance modulators. Animal studies should be used to estimate the effect of the modulator on toxicity and to allow estimation of a safe starting dose.
31 REGULATORY AFFAIRS: THE INVESTIGATOR-INITIATED ONCOLOGY TRIAL
303
TABLE 31.2
IND Content and Corresponding Notes. 1.
Title page, Cover letter to the FDA, Form FDA 3674
2.
Form FDA 1571
3.
Letter of Authorization
4. 5.
Table of Contents Introductory Statement
6.
General Investigational Plan Investigator’s Brochure and/or Package Insert Clinical Protocol Form FDA 1572
7. 8. 9.
10. Curricula Vitae 11. Chemistry, Manufacturing, and Controls 12. Pharmacology, Toxicology
13. Previous Human Experience 14. Additional Information
All phases of oncology trials are to be registered (FDA requires registration except for phase I under the law while ICMJE requires registration as a condition of publication of the study) on a publicly accessible database such as ClinicalTrials.gov. Once registered, the NCT number given to the trial is entered on Form FDA 3674. Initial IND is Serial Number 0000 and each subsequent submission (amendment, safety report, annual report, correspondence, etc.) is numbered consecutively. Copy of a letter sent by the drug company supplying the investigational agent authorizing the FDA to access proprietary information already on file in support of the Investigator’s IND. This type of IND is commonly referred to as a cross-filed or cross-referenced IND. Information is intended to place the development plan for the drug into perspective and help the FDA anticipate sponsor needs. Describe the developmental plan, such as early studies, current studies, future studies. Use current package insert for marketed drugs. An Investigator’s Brochure is not required for an Investigator-Sponsored IND. Protocol format for oncology trials available on NCI Web site at www.cancer.gov. This is a legal contract with the FDA. Particular attention should be paid to “Box 9: Investigator Commitments.” These are legally binding thus should be reviewed, thoroughly understood, and complied with. CVs of Investigator (Box 1 on Form 1572) and Subinvestigators (Box 6 on Form 1572). Summary information if submitting a Letter of Authorization, otherwise follow Guidance for Industry document(s). Pharmacology and drug distribution information that describes the pharmacologic effects and mechanisms of actions of the drug in animals and information on the absorption, distribution, metabolism, and excretions of the drug. Information that describes the toxicological effects of the drug on animals and in vitro. If this is an approved marketed drug, the package insert can be referenced. Present previous human experience in an integrated summary report. If this is an approved marketed drug, the package insert can be referenced. Present other information that will support the safety of the investigational drug and the clinical protocol.
of Investigational New Drug Applications (INDs) for Phase I Studies of Drugs, including WellCharacterized, Therapeutic, Biotechnology-Derived Products” (5). Once received at the FDA, an IND number is assigned and a letter sent to the IND sponsor acknowledging receipt of the application. The letter reminds the sponsor of his or her obligations, including that the study may not be initiated until 30 days after the date of receipt shown on the letter unless otherwise notified by the FDA. This 30-day period is the safety review. FDA review of the IND may generate deficiencies and/or comments that need to be responded to satisfactorily before the FDA issues a “study may proceed” communication. The 30-day date (or alternately the date the study is allowed to proceed) is the IND’s
effective date. On the anniversary of the effective date, the annual report is to be filed within 60 days of that date. Content and format of the IND annual report and other submissions to the IND are covered in 21 CFR 312.30 (protocol amendments), 312.31 (information amendments), 312.32 (IND safety reports), and 312.33 (IND annual reports).
DEVICES Table 31.3 provides some basic regulatory information and associated references for further study. An overview of device regulations and specific resources is available online through FDA’s “Device Advice” at http://www.fda.gov/cdrh/devadvice/overview.html.
304
ONCOLOGY CLINICAL TRIALS
TABLE 31.3
Device Summary. Types of Devices Significant Risk Device 21 CFR 812.3(m)
Nonsignificant Risk (NSR) Device
Custom Device 21 CFR 812.3(b)
Investigational device that: · Is intended as an implant · Supports or sustains life · Is of substantial importance in diagnosing, curing, mitigating, or treating disease · Presents potential for serious risk · IRB and FDA approval (IDE application) required · Investigational device that does not meet significant risk device requirements IRB approval required. An IDE application is not submitted to FDA (an IDE is considered to be in effect); however, the IDE abbreviated requirements 21 CFR 812.2(b) must be followed. The IRB acts as the surrogate overseer. (This is not to be confused with minimal risk as defined in 21 CFR 56.102(i).) Minimal risk means that the probability and magnitude of harm or discomfort anticipated in the research are not greater in and of themselves than those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests. · Deviates from approved device, to comply with MD order · Not generally available to or used by MD · Not generally available in finished form · Not offered for commercial distribution AND: specifically made for and
intended for use by an individual patient named in the order by MD; or, meets the special needs of the MD in his/her professional practice. IDE Application Contents For detailed information reference 21 CFR 812 1. Table of Contents 2. Report of prior investigations 3. Investigational plan 4. Risk analysis 5. Description of the device 6. Monitoring procedures 7. Manufacturing information 8. Investigator information 9. IRB information 10. Sales information 11. Environmental impact statement 12. Labeling 13. Informed consent materials 14. Other information
“Abbreviated” IDE Requirements 21 CFR 812.2(b) · · · · · · ·
Cannot be a Significant Risk device Informed consent must be obtained and documented (21 CFR 50) IRB approval Label in accordance with 21 CFR 812.5 Sponsor complies with monitoring requirements 21 CFR 812.46 Maintains records and reports 21 CFR 812 subparts c & e Comply with prohibitions 21 CFR 812.7
31 REGULATORY AFFAIRS: THE INVESTIGATOR-INITIATED ONCOLOGY TRIAL
IDE Exemption · Lawfully marketed in the United States (<5/28/76) · Substantially equivalent to marketed device (>5/28/76) · Used in accordance with the approved indications · Diagnostic device · Sponsor complies with labeling (“For Research Use Only. Not for use in diagnostic procedures.” 21 CFR 809.10(c)), and · Noninvasive does not involve significant risk and invasive sampling, does not introduce energy into the subject, is not considered diagnostic without confirmation · Custom device, unless its use is for determining safety/effectiveness · Consumer preference testing, unless its use is for determining safety/effectiveness or puts subjects at risk · Veterinary use only · Research on or with lab animals and labeled “CAUTION—Device for investigational use in laboratory animals or other tests that do not involve human subjects.” (21 CFR 812.5(c))
RESPONSIBILITIES OF THE SPONSOR-INVESTIGATOR Sponsor-Investigator responsibilities for the conduct of drug and device clinical trials are quite similar. Table 31.4 lists Sponsor and Investigator responsibilities corresponding to drug and device trials. Regulations cited are accessible on the FDA Web site at http://www.access.gpo.gov/cgi-bin/cfrassemble.cgi?title= 200521.
GOOD CLINICAL PRACTICE Good Clinical Practice (GCP) is an international quality standard developed by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), a body that defines standards which regulatory authorities can then use to develop regulations for clinical trials involving human subjects. GCP is defined as “an international ethical and scientific quality standard for
TABLE 31.4
Responsibilities of Sponsors and Investigators in Drug and Device Trials. TITLE 21, CODE
SPONSOR
AND
INVESTIGATOR RESPONSIBILITIES
General Responsibilities of Sponsors General Responsibilities of Investigators Selecting Investigators and Monitors Informing Investigators Emergency Research Investigator Reports Specific Responsibilities of Investigators Disqualification of a Clinical Investigator Monitoring Investigations IRB approval Inspection of Investigator’s Records and Reports Transfer of Obligations to a Contract Research Organization Disposition of Unused Supply of Investigational Drug Investigator Control of the Investigational Drug Investigator Handling of Controlled Substances Investigator Recordkeeping and Record Retention Sponsor Recordkeeping and Record Retention Sponsor Monitoring and Review of Ongoing Investigation
305
FEDERAL REGULATIONS, PARTS
OF
DRUGS
DEVICES
312.50 312.60 312.53 312.55 312.54 312.64 312.70 312.50 312.66 312.68 312.52
812.40 812.100 812.43 812.45 812.47 812.150 812.110 812.119 812.46 812.42 812.145 812.3
312.59
812.110
312.61 312.69 312.62 312.57 312.56
812.140 812.140 812.140 812.46
306
ONCOLOGY CLINICAL TRIALS
designing, conducting, recording, and reporting trials that involve the participation of human subjects. Compliance with this standard provides public assurance that the rights, safety, and wellbeing of trial subjects are protected, consistent with the principles that have their origin in the Declaration of Helsinki, and the clinical trial data are credible” (6). The ICH document “Guidance for Industry—E6 Good Clinical Practice: Consolidated Guidance” was developed and recommended for adoption to the regulating bodies of the European Union, Japan, and the United States. It is the standard for conducting clinical drug trials, and every investigator should become thoroughly familiar with the contents of this document available online at http://www.fda.gov/cder/guidance/959fnl.pdf.
RESOURCES · The Elements of Success, Conducting Cancer Clinical Trials, A Guide Available at www.c-changetogether.org/pubs/pubs/ClinicalTrialsSucess.pdf · NCI Investigator’s Handbook Available at http:// ctep.cancer.gov/handbook/index.html · NCI/CTEP Resources Available at http://ctep.cancer. gov
· Forms FDA 1571, 1572, 3674, and MedWatch 3500A in Word or fillable PDF format Available at http://www.fda.gov/opacom/morechoices/fdaforms/fd aforms.html.
References 1. Guidance, Financial Disclosure by Clinical Investigators. Retrieved August 21, 2008, from http://www.fda.gov/oc/guidance/financialdis.html 2. FDA Guidance for Industry: IND Exemption for Studies of Lawfully Marketed Drug or Biological Products for the Treatment of Cancer. Revision 1 (January 2004). Retrieved August 21, 2008, from http://www.fda.gov/cder/guidance/6036fnl .doc 3. FDA Guidance for Industry: IND Exemption for Studies of Lawfully Marketed Drug or Biological Products for the Treatment of Cancer. Revision 1 (January 2004). Retrieved August 21, 2008, from http://www.fda.gov/cder/guidance/6036fnl .doc 4. FDA Guidance for Industry: IND Exemption for Studies of Lawfully Marketed Drug or Biological Products for the Treatment of Cancer. Revision 1 (January 2004). Retrieved August 21, 2008, from http://www.fda.gov/cder/guidance/ 6036fnl.doc 5. FDA Guidance for Industry: Content and Format of Investigational New Drug Applications (INDs) for Phase I Studies of Drugs, including Well-Characterized, Therapeutic, Biotechnology-Derived Products. (Accessed October 30, 2009, from http://www.fda.gov/downloads/Drugs/GuidanceCompliance RegulatoryInformation/Guidances/ucm071597 .pdf.) 6. Guidance for Industry: E6 Good Clinical Practice: Consolidated Guidance. Retrieved August 18, 2008, from http:// www.fda .gov/cder/guidance/959fnl.pdf
32
The Drug Evaluation Process in Oncology: FDA Perspective
Steven Lemery Patricia Keegan Richard Pazdur
The primary mission of the U.S. Food and Drug Administration (FDA) is to protect and advance the public health (1). FDA is responsible for ensuring that drugs marketed in the United States are both safe and effective. The Agency accomplishes this mission by reviewing clinical trials and data associated with a drug throughout the development process, from early clinical trials prior to drug approval through the drug’s postmarketing experience. FDA advances the public health by holding meetings with sponsors throughout the drug development process to ensure adequate study designs and that accurate information is disseminated to the public.
THE FDA ORGANIZATION This section describes the FDA organization that is pertinent to the interests of hematologists and oncologists. The FDA contains nine centers/offices that regulate both food and drug products. The Center for Drug Evaluation and Research (CDER) regulates drugs and therapeutic biologic proteins, including monoclonal antibodies and cytokines. The Center for Biologics Evaluation and Research (CBER) regulates vaccines, gene therapy, blood component products, and cellular therapies, including stem cell products. The Center for Devices and Radiological Health (CDRH) regulates
devices, including test kits for the diagnosis of cancer and devices used to treat cancer. Each Center within the FDA contains multiple Offices. Because this chapter primarily will focus on the regulation of drugs, a more detailed description of the organization of CDER is provided. Two Offices of primary interest to individuals involved in drug development are the Office of New Drugs (OND) and the Office of Surveillance and Epidemiology (OSE). OND reviews investigational new drug applications (INDs), applications for marketing approval of a drug or biologic, and postmarketing safety information. OSE evaluates drug risks and is charged with promoting the safe use of drugs by the American people. OND contains Offices of Drug Evaluation (ODEs) that focus on specific therapeutic areas. The Office of Oncology Drug Products (OODP) has the responsibility for the review of drugs intended for the treatment of cancer. For the remainder of this chapter, all references to drugs include both human drugs and biological products unless otherwise specified. The decision-making responsibility for the approval of new drugs (new molecular entities) has been delegated through the Secretary of Health and Human Services (HHS) to the Clinical Review Office level (e.g., OODP for oncology drugs), and the decision-making responsibility for expanded claims for marketed drugs has been delegated to the Clinical Review Division level.
307
308
ONCOLOGY CLINICAL TRIALS
TABLE 32.1
Basis for Regulatory Decisions. BASIS FOR FDA DECISION MAKING
LEGALLY BINDING
Statutes: Passed by Congress
Yes
Regulations: Promulgated by the Agency
Yes
Case Law
Yes (unless decision overturned on appeal)
Guidance Documents: Promulgated by the Agency
No
FDA Precedent
No
FDA Internal Policy
No
Advisory Committee Advice (2)
No (advisory committee decisions are considered advice to FDA; FDA may accept or reject the advisory committee decision)
FDA DECISION MAKING The legal foundation for FDA authorities is derived from laws and statutes passed by Congress. For example, the 1962 amendments to the Federal Food, Drug, and Cosmetic Act require that FDA-approved drugs be both safe and effective. Table 32.1 describes various factors FDA relies upon when making decisions. Statutes and regulations have the force of law. Final court decisions can affect the interpretation and implementation of laws. The statutes authorize the Agency to write regulations that interpret the law. The regulations regarding the conduct of an IND can be found in the Code of Federal Regulations (CFR), specifically in 21 CFR 312. Prior to enactment, regulations must be published as proposed rules in the Federal Register and undergo a public notice and comment period. Interested parties have the opportunity to comment on a proposed rule and FDA must address all comments prior to finalizing the rule. The process for creation of a final rule resulting in new regulations can be lengthy, in part due to the requirement for public notice and comment; however, addressing such comments generally leads to clearly stated and legally sound final regulations. Both the statutes and regulations are legally binding for the regulated industry (generally pharmaceutical companies), clinical study investigators, and FDA. Additionally, FDA may publish guidance documents that describe FDA’s current thinking on how to comply with the regulations. Guidance documents are not legally binding except where the guidance cites the
applicable statutes or regulations. FDA recommends that any IND sponsor, either a drug manufacturer or sponsor-investigator, planning to deviate from FDA guidance should consult the Agency. FDA may hold public advisory committee (AC) meetings to obtain advice on a particular topic, such as recommendations on whether standards for approval of a New Drug Application (NDA) or a Biologics License Application (BLA) have been met, or on a subject of general interest, such as acceptability of novel trial designs or nonstandard end points in a specific disease setting. The FDAAA of 2007 requires that all new molecular entities (NME) either be presented to an AC for advice prior to a decision on FDA approval or that the Clinical Review Office provide written justification describing why such advice was not necessary when marketing approval is granted. AC membership typically includes medical subspecialty experts, a patient advocate for the disease or condition being discussed, a consumer representative, one or more statisticians, and a nonvoting industry representative; a standing AC may also be supplemented with expertise from other FDA advisory committees (e.g., the Drug Safety and Risk Management AC or with special [SGE] or regular government employees [RGE] identified by FDA for their particular expertise). In an AC meeting that is considering the potential approval of an NME, both FDA and the applicant (usually a pharmaceutical company) present their analyses and conclusions regarding clinical studies with the NME and other topics relevant for consideration (e.g., availability of other therapies). After the presentations and
32 THE DRUG EVALUATION PROCESS IN ONCOLOGY: FDA PERSPECTIVE
an open public hearing to permit additional comment, the AC members will discuss questions (posed by the Agency) providing general advice, and may vote on issues such a drug’s safety and efficacy, need for limitations on a drug’s use, or whether risk management activities should be instituted. The Agency views these votes as advisory and nonbinding in formulating its ultimate regulatory decision. Most Agency decisions in earlier stages of drug development (prior to marketing approval) are based on the scientific evaluation of clinical and other data submitted to FDA and the judgment of the potential risks and benefits by the scientific review staff within the Clinical Review Division and collaborative review by adjunctive Divisions (e.g., Division of Clinical Pharmacology) who are charged with oversight of the drug development program under an IND. For example, the regulations (legal basis) in 21 CFR 312 state that an IND may be placed on hold if there is an unreasonable or significant risk of injury or illness to the (study) subject. The FDA review staff must use clinical and scientific judgment to determine whether the proposed study or investigational new drug is reasonably safe to allow initiation of or continuation of a clinical study with an investigational new drug.
IND An IND is an exemption under the Interstate Commerce Act that allows for the interstate shipment of nonapproved drugs, also known as investigational drugs (3). There are certain legal requirements applicable to IND sponsors and clinical investigators when conducting a study under an IND. These include informed consent requirements contained in 21 CFR 50, institutional review board (IRB) requirements contained in 21 CFR 56, and specific IND requirements contained in 21 CFR 312. What Is an Investigational Drug? In order to determine whether an IND is required for a clinical trial, it must be determined whether an investigator plans to administer a drug to humans. A drug is legally defined as an article that is intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease and is intended to affect the structure or any function of the body (4). An IND is required in the United States in order to administer an unapproved drug to a human person (5). Vitamins may be considered drugs and require an IND for instituting clinical trials in certain situations. If a substance is an oral vitamin and labeled as a dietary supplement, it is not a drug. However, if a
309
vitamin product will be studied for its use to treat or prevent cancer (or any disease), then the vitamin is considered an investigational drug and an IND would be required. If an investigator intends to conduct a clinical trial using an FDA-approved drug, an IND may be required. If the clinical investigation is being conducted outside of an IND, the drug must be obtained from the commercial manufacturer in the FDA-approved packaging and labeling. If an investigator uses a supply of the approved drug which was manufactured prior to marketing approval or is labeled for investigational use only, then an IND is required. Only those products that are lawfully marketed in the United States (i.e., commercial marketed product) qualify for IND exemption. A lawfully marketed product is defined as a drug that is the exact same as the FDA approved product in terms of formulation, strength, appearance (including embossing, debossing, and over-encapsulation), packaging, labeling, content, and count. For an approved drug, an IND will not be required if the following conditions are met: 1. The clinical trial is not intended to be reported to FDA as a well-controlled study that might support a new indication or support any significant change in the labeling (6); 2. The trial is not intended to support new promotional claims (advertising) for the drug (6); 3. The trial does not involve a route of administration or dosage level or new patient population (or other factor) that significantly increases the risks associated with the drug (6); 4. The clinical trial is conducted in compliance with IRB and informed consent regulations; and (6) 5. The clinical trial is conducted in compliance with the requirements of 21 CFR 312.7 (this section of the regulations deals with the promotion and charging for investigational drugs) (6). Additional guidance may be found in the document entitled “Guidance for Industry: IND Exemptions for Studies of Lawfully Marketed Drug or Biological Products for the Treatments of Cancer” (7). FDA Responsibilities Regarding INDs Major objectives of the FDA are to assure the safety and rights of subjects and to help assure that the quality of a scientific investigation (in phase II and phase III) is adequate to permit an evaluation of safety and effectiveness (8). Within 30 days of receipt of a new IND, FDA will either allow the IND to proceed or will place the IND on hold. If an FDA reviewer discovers potential hold issues, a discussion between the division
310
ONCOLOGY CLINICAL TRIALS
and the sponsor may occur prior to the 30-day deadline in order to resolve deficiencies in the IND. When a study is placed on clinical hold, no patient may receive the investigational drug(s) under that IND until FDA receives the necessary information to resolve the issues resulting in the hold and states in writing that the clinical hold has been lifted. FDA may also place a portion of the IND on hold. For example, FDA may allow patients who have received the investigational drug without experiencing toxicity to continue to receive the drug while suspending accrual of treatmentnaïve patients. For phase I (first-in-humans) studies, there are five reasons cited in the regulations (21 CFR 312.42) which allow FDA to place a study on hold: 1. Unreasonable or significant risk of injury or illness to the subject 2. Lack of qualified investigators 3. Misleading, erroneous, or incomplete investigator’s brochure 4. Insufficient data submitted to evaluate the risk 5. Exclusion of men or women with reproductive potential from a study intended to treat a life threatening disease that affects both genders (this criterion does not apply to pregnant women or diseases that only affect one gender) In addition to the five reasons to place a first-in-humans (phase I) study on hold, for phase II and phase III studies, a protocol can be placed on hold if it is clearly deficient in design to meet its stated objectives (9). FDA Review Process for INDs An IND must contain information pertaining to a drug’s chemistry, manufacturing, and control (CMC), as well as the drug’s pharmacology and toxicology. A clinical protocol must also be submitted. The IND must contain an investigator’s brochure unless the IND is being submitted and conducted by a single sponsorinvestigator (10). A sponsor must also complete and sign FDA Form 1571, in which the sponsor agrees to the following: (1) clinical investigations will not begin until 30 days after the date the IND is received at FDA, (2) clinical investigations will not be initiated or will be discontinued if FDA places the IND on hold, (3) an IRB that complies with 21 CFR 56 will be responsible for initial and ongoing review of all studies under the IND, and (4) all clinical studies will be conducted in accordance with applicable regulations. The applicable Title 21 regulations for clinical investigations are published and updated online at the following Web site: http://www.accessdata.fda.gov/SCRIPTs/cdrh/cfdocs/cfc
fr/CFRSearch.cfm. Of special importance are 21 CFR 50 (informed consent), 21 CFR 56 (IRB), and 21 CFR 312 (IND regulations). When the original IND is submitted to FDA, all members of the IND review team recommend to the Division Director whether the IND should proceed or be placed on clinical hold. The decision regarding the original IND submission is made within 30 days after FDA receives the application. CMC Review Process FDA chemists and biologists review data regarding the drug substance, drug product, labeling information, and environmental analyses. Sufficient information must be submitted during each phase of clinical drug development to assure the proper identification, quality, purity, and strength (or potency) of the product (11). Because manufacturing changes may occur during drug development, the requirements for CMC information depend on the phase of investigation. Sponsors can meet with FDA to discuss their proposed approach to product characterization and pose questions regarding CMC issues prior to the IND submission in a pre-IND meeting. Academic institutions and sponsor-investigators may submit a separate IND and serve as the IND sponsor in order to conduct clinical investigations for an investigational drug when the investigational drug is supplied by a commercial manufacturer. In such instances, the academic institution of the sponsor-investigator is also given a letter of crossreference authorizing FDA to review CMC or toxicology data contained in the commercial manufacturer’s IND in support of the academic institution or sponsorinvestigator’s proposed clinical investigations. This generally relieves the noncommercial sponsor of the need for submitting CMC information, unless the investigational product is modified or administered in a manner not supported by the cross-referenced data. Pharmacology/Toxicology Review Process FDA toxicology reviewers assess animal studies and other toxicology data and collaborate with the clinical reviewer (usually an oncologist for drugs intended to be studied in patients with cancer) to determine a safe starting dose at which to initiate clinical studies. The toxicology reviewer also provides advice to the clinical reviewer on additional monitoring for specific adverse events (AEs) that may be indicated based on the animal studies. The toxicology studies required to initiate a clinical study depend on an assessment of the reasonably acceptable risks in the potential subjects to be enrolled in the studies. For example, carcinogenicity studies may be required prior to initiating studies in
32 THE DRUG EVALUATION PROCESS IN ONCOLOGY: FDA PERSPECTIVE
healthy volunteers; however, these nonclinical studies may not be needed prior to administering the drug to patients with advanced cancer (12). Clinical Review The FDA clinical reviewer assesses the protocols, informed consent documents, and other animal or human data pertinent to establishing the safety of the proposed clinical trials. After initiation of the clinical investigations, the clinical reviewer evaluates amendments to the IND, which may include protocol amendments, new protocols, expedited safety reports, and annual reports.
311
regular monitoring and auditing in order to ensure that the investigators are compliant with their obligations to adhere to the protocol, obtain IRB review, obtain informed consent, and provide required investigator information (e.g., financial disclosure forms) (13). Additional responsibilities of an IND sponsor include ensuring control of the investigational drug supply and disposition, maintaining adequate records, and allowing FDA inspections (13). Importantly, IND sponsors must ensure that reports of serious and unexpected AEs are submitted to FDA in an expedited fashion. A summary of more common AEs must be submitted in an annual report, within 60 days of the anniversary date that the original IND became active (13, 14).
Other Reviews by FDA A clinical pharmacologist reviews animal and (if available) human pharmacokinetic data and evaluates the scientific rigor of clinical plans for the characterization of the pharmacokinetic profile of the investigational drug in the intended population and in special populations (e.g., in patients with hepatic or renal impairment). The clinical pharmacologist also evaluates plans for the characterization of the drug’s effect on QTc intervals, evaluation for drug interactions, and determination of the incidence and clinical impact of immunogenicity (i.e., anti-product antibodies that are most frequently of concern with biotechnology-derived products. A statistician may be included as part of the IND review team (especially for phase II and phase III studies) to evaluate the appropriateness of the plans for data analysis and to confirm the results of data analyses submitted by the IND sponsor. Sponsor and Investigator Responsibilities Regarding the IND There are obligations that both the sponsor and investigator must comply with when conducting studies under an IND. Sponsors Title 21 CFR 312 defines a sponsor as a person who takes responsibility for and initiates a clinical investigation. The sponsor may be an individual or pharmaceutical company, governmental agency, academic institution, private organization, or other organization. All IND sponsors are obligated to ensure that clinical investigators are qualified (by training and experience) to conduct a specific investigation, and that all investigators read the protocols and investigator’s brochure (13). The sponsor is also responsible (directly or by subcontracting this responsibility) for oversight of the clinical investigations through
Investigators A clinical investigator is an individual who actually conducts a clinical investigation (15). Investigators are responsible for complying with the protocol and protecting the rights, safety, and welfare of subjects (16). The investigator must obtain IRB approval prior to initiating a clinical study and obtain informed consent before enrolling subjects (16). The investigator must maintain records, including drug disposition records and case histories (21 CFR 312), and must submit reports, including reports of serious AEs to the sponsor as required by the regulations and the protocol.
EXPANDED ACCESS MECHANISMS FOR INVESTIGATIONAL DRUGS In certain instances, a patient may wish to obtain access to an investigational drug if he or she does not qualify for an ongoing clinical trial (or a clinical trial may not be ongoing while a company awaits a final regulatory decision regarding marketing approval). FDA published a proposed rule (proposed new regulation) in December 2006 to clarify procedures for allowing expanded access to investigational drugs (17). A physician should first contact the IND sponsor for that investigational drug (usually the pharmaceutical company) to determine whether a treatment IND exists that contains a clinical protocol suitable for the specific patient. If the IND sponsor does not have an existing protocol that would allow for the patient to receive access to the study drug, the physician should inquire whether the IND sponsor will agree to supply the investigational drug for the specific patient and provide the physician with a letter of cross-reference to CMC, toxicology, and clinical information in the original IND. In such cases, the physician may submit an IND (a form of sponsor-investigator IND) for the specified patient.
312
ONCOLOGY CLINICAL TRIALS
The contents of such INDs are often limited to essential elements (FDA Forms 1571 and 1572, the letter of cross-reference, brief patient history, treatment and monitoring plan, and copy of the informed consent document); the IND may be allowed to proceed within hours to days of receiving this information. The regulations include provisions for Treatment INDs and Treatment Protocols, which are mechanisms to provide patient access to promising investigational drugs for life-threatening diseases when there are no available comparable or satisfactory alternative therapies. The investigational drug to be administered in a Treatment IND must either be under investigation in a controlled clinical trial under an IND or all trials must be completed, and the IND sponsor must actively be pursuing marketing approval with due diligence. Other types of nontraditional INDs include Group C INDs and emergency use INDs. The Group C process involves an agreement between the National Cancer Institute (NCI) and oncologists who are registered with NCI to allow for the use of investigational drugs according to a physician-specified treatment protocol. While this mechanism has been used in the past, NCI has rarely submitted such INDs in recent years. An emergency use IND may be granted in an imminently life-threatening situation, in which there is not sufficient time to formally submit an IND for review. In such instances, FDA will review the information obtained by telephone conversation, facsimile transmissions, or e-mail, with the formal IND submission to occur within 30 days of allowing the IND to proceed.
THE NDA OR BLA The CMC, toxicology, pharmacology, and clinical data intended to support marketing approval for an investigational agent are submitted under an NDA for a drug or a BLA for a biologic product as defined under the Public Health Service (PHS) Act. Despite certain differences in the legal basis and some requirements for marketing approval, data contained in NDAs and BLAs must demonstrate substantial evidence of effectiveness and a favorable risk-benefit analysis derived from adequate and well-controlled trials (18).
Cosmetic Act were passed following the evidence of congenital malformations in children whose pregnant mothers used thalidomide (19). The following list describes the major laws pertaining to the drug approval process. 1. 1906 Pure Food and Drugs Act—prohibition of interstate trade in misbranded or adulterated food and drugs (19) 2. 1938 Federal Food, Drug, and Cosmetic Act— requirement that drugs be shown safe prior to marketing (19) 3. 1962 Kefauver-Harris Amendments to the Food Drug and Cosmetic Act—requirement that drugs be safe and effective prior to marketing (19) 4. 1997 Food Drug and Modernization Act (FDAMA)—contained revisions of the 1962 law to allow FDA to consider evidence from one adequate and well-controlled trial plus confirmatory evidence as being sufficient to establish effectiveness (20) 5. 2007 Food and Drug Administration Amendments Act (FDAAA)contains provisions to allow FDA to mandate certain labeling changes pertaining to drug safety (21) 1962 Kefauver-Harris Amendments to the Food Drug and Cosmetic Act After passage of the 1962 Kefauver-Harris Amendments, marketing applications for approval of drugs required demonstration of substantial evidence of effectiveness. The amendment required that substantial evidence come from adequate and well-controlled trials (20). Because of the plurality of the word trials, applicants for drug approval are generally expected to show that the product will have the effect that it purports to have in more than one trial. In 1997, passage of FDAMA permitted FDA to consider evidence from one adequate and wellcontrolled trial plus confirmatory evidence in certain circumstances (20). These circumstances include persuasive findings in a large multicenter trial with consistent results across subsets (20). Furthermore, new drug approvals based on one adequate and well-controlled study should demonstrate a robust and statistically persuasive effect on a serious outcome such as survival or irreversible morbidity, such that a second study would be considered unethical (20).
The Legal Basis Behind Drug Approval Requirements
Accelerated Approval
Many laws pertaining to the regulation of drugs were generated following well-publicized tragedies. The 1938 Federal Food, Drug, and Cosmetic Act was passed following the deaths of 107 people who ingested elixir of sulfanilamide that contained diethylene glycol. The 1962 Kefauver-Harris Amendments to the Food, Drug, and
The provisions for accelerated approval were codified in regulations in 1992 in response to the AIDS epidemic (22). Accelerated approval requires that adequate and well-controlled trials establish that a drug has an effect on a surrogate end point that is reasonably likely to predict clinical benefit (23, 24). Similar to regular
32 THE DRUG EVALUATION PROCESS IN ONCOLOGY: FDA PERSPECTIVE
approval, approval of a drug under accelerated approval must be based on substantial evidence from adequate and well-controlled trials. Accelerated approval requires that an applicant conduct additional studies or provide additional data from ongoing studies that verify the clinical benefit (23). These confirmatory studies should be ongoing at the time of drug approval. Over time, some end points originally used as the basis for an accelerated approval may be considered established surrogates for clinical benefit as additional clinical data become available. FDA often seeks advice from the Oncologic Drugs Advisory Committee (ODAC) and outside experts in determining when there is sufficient evidence to consider a surrogate end point as a validated or established surrogate. Approval of Oncology Drugs—End Points FDA must be able to accurately describe the effect that a particular drug has on an end point in the product label. The drug should either improve mortality, or relieve suffering by improving quality of life, tumorrelated symptoms, or physical functioning (25–28). With few exceptions, randomization is an essential clinical trial design element for determining whether a drug confers clinical benefit and is critical for time-to-event end points (e.g., survival improvement), quality-of-life end points, and comparative drug claims. In 1979, the U.S. Supreme Court (United States vs. Rutherford) concluded that a drug intended to treat terminally ill patients is effective if it prolongs life, improves physical condition, or reduces pain; and that a drug is safe if the potential to cause harm is offset by the potential therapeutic benefit (29). In multiple meetings held in the 1980s, ODAC advised FDA that a drug should improve mortality or relieve suffering to demonstrate clinical benefit (25). An improvement in overall survival (OS) remains the highest level of efficacy for approval of new drugs for the treatment of cancer patients. More recently, FDA has approved new drugs based on end points considered to be established surrogates for clinical benefit (regular approval) or surrogates reasonably likely to predict clinical benefit (accelerated approval). Such end points have included time-to-progression, progressionfree survival (PFS), and durable objective response rates. The acceptability of a specific end point in trials supporting an NDA or BLA is influenced by the indication being sought, the magnitude of the effect demonstrated (effect size and duration), the demonstrated risks and benefits of alternative therapies, and the new drug’s safety profile (25). A clinically significant improvement in PFS is an established surrogate for clinical benefit in certain cancers (e.g., chronic lymphocytic leukemia); however, in
313
many cancers, PFS is generally considered a surrogate reasonably likely to predict clinical benefit. The potential advantages of using PFS as a study end point are that demonstration of treatment effects require smaller sample sizes and shorter follow-up and that PFS will not be affected by subsequent therapies (25). However, trials in which efficacy is primarily based on effects on PFS can be expensive because PFS can be subject to intentional and unintentional bias, therefore, blinded independent reviews of radiographs are often required to minimize assessment bias. Blinded independent reviews are not necessary when patients and investigators are masked to the treatment assignment and the toxicities of the drugs do not unblind the treatment (25). When PFS is compared across treatment arms, patient visits and radiological assessments should be symmetric between study arms to allow for valid comparisons. Adequate attention to trial conduct is also important in a study with a PFS end point because missing data can complicate the analysis of PFS (25). Durable reduction in tumors has been used to support regular and accelerated approval; however, the acceptability of tumor reduction, as measured by response rates, must be considered in the clinical context. For example, in some settings, an improvement in tumor response rates may not predict an improvement in either survival or time-to-progression (30). FDA has considered durable complete response rates to be an established surrogate for clinical benefit (regular standard) in patients with refractory acute myelogenous leukemia, because patients achieving complete responses have improved survival, a reduced incidence of infections, and reduced transfusion rates. However, FDA generally considers durable tumor response rates in patients with no satisfactory alternative therapies to be sufficient only to support accelerated approval. The effect size is especially important in the evaluation of response rates. Additional FDA thinking regarding end points for cancer therapies can be found in an FDA Guidance for Industry document (25). A second draft Guidance for Industry provides additional advice regarding how to employ patient-reported outcome measures in support of product labeling and promotional claims (31).
FDA PREMARKET AND POSTMARKET SAFETY REVIEW FDA review must accurately summarize the available safety information regarding a drug prior to its approval. The premarket safety review will be used in the risk-benefit analysis supporting approval of a new drug and should provide patients and health care providers the requisite information required to make treatment decisions.
314
ONCOLOGY CLINICAL TRIALS
Standards for Including Adverse Reactions in the Drug Label FDA must include an AE in the label when reasonable evidence of a causal association with a drug exists; however, a causal relationship does not need to have been definitively established, especially for rare, serious reactions that are unusual in the absence of drug therapy (32). In making an assessment for specific events, FDA may evaluate data from clinical trials, drug class effects, data from nonclinical sources, and data from patient use outside of clinical trials. PostMarketing Safety Review FDA reviews incoming voluntary AE reports submitted from virtually any source in addition to required reports submitted by product manufacturers. Because of the volume of reports received by FDA each day, FDA typically analyzes databases containing all AE reports to assess for safety signals that require further review. New clinical studies or meta-analyses that contain safety signals are also reviewed by FDA. When new postmarketing safety concerns are discovered, the FDAAA legislation of 2007 allows FDA to require certain changes to a drug’s label and (when necessary) additional post-approval clinical studies with specific enforceable timelines (21). The FDAAA legislation also allows FDA to require that the commercial manufacturer develop and conduct a risk evaluation and management strategy (REMS) to ensure a drug’s safe use. Examples of REMS include restricted drug access programs, registries to obtain additional safety data, or creation of a Medication Guide or communication plan. Additionally, FDA began the Sentinel Initiative in May 2008 with the goal to create and implement a nationwide electronic system for monitoring medical product safety (21).
4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18.
19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29.
References 1. FDA’s Mission Statement. http://www.fda.gov/opacom/ morechoices/mission.html. Accessed Dec 30, 2008. 2. Farrell AT, Papadouli I, Hori A, et al. The advisory process for anticancer drug regulation: a global perspective. Ann Oncol. 2006;17:889–896. 3. Chapter 21 Code of Federal Regulations (CFR) 312
30. 31. 32.
Federal Food, Drug, and Cosmetic Act, Section 301(g)(1) 21 CFR 312.20 21 CFR 312.2 FDA Guidance for Industry: IND Exemptions for Studies of Lawfully Marketed Drug or Biological Products for the Treatment of Cancer. Issued Jan 15, 2004. 21 CFR 312.22(a) 21 CFR 312.42 21 CFR 312.55 21 CFR 312.23 International Conference on Harmonization Document S1A Document “Guideline for Industry: The Need for Long-term Rodent Carcinogenicity Studies of Pharmaceuticals.” March 1996. 21 CFR 312.53 21 CFR 312.33 21 CFR 312.3 21 CFR 312.60 71 FR 75147 (December 14, 2006) Farrell A, Williams G, Pazdur R. FDA’s role in the development and approval of drugs, biologics, and devices for cancer. In: DeVita VT, Hellman S, Rosenberg SA, eds. Cancer: Principles and Practice of Oncology. 8th ed. Philadelphia, PA: LippincottRaven; 2009. Milestones in U.S. Food and Drug Law History. http://www.fda.gov/opacom/backgrounders/miles.html. Accessed Dec 30, 2008. FDA Guidance for Industry: Providing Clinical Evidence of Effectiveness for Human Drugs and Biological Products. Issued May 1998. FDAAA Implementation—Highlights One Year after Enactment. http://www.fda.gov/oc/initiatives/advance/fdaaa/accomplishments.html. Accessed Dec 30, 2008. 57 FR 58942 (December 11, 1992) 21 CFR 314.500–530 Dagher R, Johnson J, Williams G, et al. Accelerated approval of oncology products: a decade of experience. J Natl Cancer Inst. 2004;96:1500–1509. FDA Guidance for Industry: Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics. Issued May 2007. Johnson JR, Williams G, Pazdur R. End points and United States Food and Drug Administration approval of oncology drugs. J Clin Oncol. 2003;21:1404–1411. Williams G, Pazdur R, Temple R. Assessing tumor-related signs and symptoms to support cancer drug approval. J Biopharm Stat. 2004;14:5–21. Rock EP, Scott JA, Kennedy DL, et al. Challenges to the use of health-related quality of life for Food and Drug Administration approval of anticancer products. J Natl Cancer Inst Monogr. 2007;37:27–30. U.S. Supreme Court, United States v. Rutherford, 442 U.S. 544 (1979). http://supreme.justia.com/us/442/544/case.html. Accessed Dec 31, 2008. Fleming TR. Objective response rate as a surrogate endpoint: a commentary. J Clin Oncol. 2005;23:4845–4846. FDA Draft Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. Issued Feb 2006. FDA Guidance for Industry: Adverse Reactions Section of Labeling for Human Prescription Drug and Biological Products—Content and Format. Issued Jan 2006.
33
Industry Collaboration in Cancer Clinical Trials
Linda Bressler
Much has been written in recent years about the relationship between industry and researchers in the conduct of clinical trials, including cancer clinical trials. Though the majority of literature discusses issues surrounding the relationship between the academic medical center and an industry collaborator, many of the same issues apply to the private hospital or physician practice, cooperative research groups, or individual investigator. Since the 1980s, federal funding for clinical research in the United States has been perceived to have declined, and more recently has actually declined. In the 1950s and 1960s, the majority of research funding to universities came from the government, with private industry providing less than 5% of research funding. The sources of funding have shifted considerably, and by the early 2000s, for-profit companies were providing financial support for 70% of clinical drug trials conducted at academic medical centers (1, 2). Collaboration with industry may provide the investigator (and his or her institution) with financial support, and does provide the investigator (and therefore patients) with access to interesting investigational compounds. This is particularly valuable in oncology clinical research. Investigators can thus become familiar with compounds earlier in their development. And although Institutional Review Boards (IRBs) may not consider the provision of drugs at no charge to patients
as a benefit to participating in trials, the patients themselves often view free drugs as a benefit (3). The collaboration often provides industry with access to experienced researchers and opinion leaders as well as potentially large numbers of patients. In addition, depending on the degree of involvement of industry and the investigator in trial design, the experienced researchers and opinion leaders lend scientific credibility to the study and subsequently to the publication that reports the results (3). Thus the collaboration with industry has the potential to be mutually beneficial. However, the increased involvement of industry in the trials that will result (or not) in the regulatory approval of their product with its accompanying financial gains brings with it a host of issues in addition to the scientific question under study. Individual investigators likely have varying levels of experience with concepts such as publication rights, data ownership and access, intellectual property, conflict of interest, budget and contract negotiation, and confidentiality and disclosure. Some argue that direct company involvement and support for clinical trials of the company’s own product should be eliminated (2, 4). Others seek to uphold scientific integrity within a collaborative relationship through regulations, oversight, and disclosure of conflicts. Without taking a position for or against the idea of the existence of the collaborative relationship in the first place, this chapter assumes that the relationship
315
316
ONCOLOGY CLINICAL TRIALS
will exist, and discusses some aspects that should be considered by the individual investigator. In many cases, the role of the investigator in an industry-sponsored trial has changed from identifying the scientific question, designing the study, analyzing, and then sharing the results with the scientific community to accruing patients and submitting data for a study already designed by the company and perhaps managed by a contract research organization (5). The investigator should always consider study design issues and potential sources of bias, and particularly so if they have had limited input into the design. Does the study question indeed need to be answered? Will a noninferiority study provide information that will be of value to the scientific community? Is the comparator appropriate and is it administered at the appropriate dose and schedule (6)? In an investigator-initiated trial, the investigator generally retains responsibility for the design and conduct of the trial, applying to the company for support in the form of funds and drugs. In the case of the investigator-initiated trial, companies vary in their policies with regard to how much input they require in study design and conduct. Thus, the same sources of bias that might exist in an industry-sponsored trial might also exist in an investigator-initiated trial. Companies also vary with regard to whether or not they will allow investigators to cross-reference the company’s investigational new drug (IND) for a drug that has not yet been approved for marketing. This has implications for the regulatory requirements that will be incorporated into the practical aspects of study conduct. In both the industry-sponsored trial and the investigator-initiated trial, it is essential that the investigator understand the terms and limitations of the contract with the industry collaborator. Ideally, contracts should provide clear documentation for what is expected of the investigator, including the role of the investigator in the design of the trial. However, contracts are generally negotiated between attorneys and/or contract negotiators for both parties, and oftentimes, none of the negotiators understand the science or the practical aspects of the trial. Example 1: I was reviewing a grant agreement for financial support from industry for a laboratory correlative study to be conducted by cooperative group laboratory investigators. The correlative study was to be done using stored tissue specimens obtained in a treatment trial that had already been completed. The treatment trial had been coordinated by a different cooperative group, and the question in the treatment trial involved the company’s drug. I entered the
negotiations already in progress. The agreement repeatedly referred to the study drug (e.g., how the study drug would be provided, handling of confidential and proprietary information about the study drug, etc.). The contract language was standard for the company, and the negotiators were not aware that the agreement was for a correlative study and there was, in fact, no study drug. In this case, the consequences of the unfamiliarity with the work that was being funded were minimal and limited to delays in executing the contract and receiving the funds. However, if funding is required in order to begin the work, the impact of a delay in executing the contract can be significant. While contract negotiators might not be expected to understand the scientific or practical aspects of the trial, they should certainly be expected to understand the implications of the contract language itself, and they should educate and advocate for the investigator and the institution in their negotiations. Perhaps the contract clauses that have received the most attention in recent years are those pertaining to disclosure of confidential information, publication rights, and access to study data (7). In the extreme, an industry collaborator sued an investigator for disclosing data indicating ineffectiveness of and possible toxicity associated with the collaborator’s compound. The contract between the investigator and the collaborator included a clause in which the investigator agreed not to publish her findings without prior approval of the collaborator. The collaborator did not grant approval, and not only tried to stop the investigator from presenting the results to the scientific community, but also tried to stop her from notifying research subjects and the ethics committee/IRB (8). The case attracted worldwide attention and extensive debate on the meaning of and suppression of academic freedom. Such a debate is outside of the scope of this chapter. However, most would agree that an investigator should not be prevented from publishing his or her research, even if the results are unfavorable to an industry collaborator. A company is accountable to its shareholders, and a researcher is accountable to the scientific community and to the patients who participate in clinical trials. However, the ultimate goal of all of those involved in clinical research should be improved patient outcomes (3). Preventing publication of negative results biases the information that will eventually be available to clinicians, thus biasing the clinicians’ decisions. Patients may be needlessly exposed to minimally effective or ineffective and expensive or toxic treatments when unpublished negative results equal or outweigh published positive results.
33 INDUSTRY COLLABORATION IN CANCER CLINICAL TRIALS
The requirement by medical journals when reviewing submissions for publication, for early registration of trials (e.g., clinicaltrials.gov), helps to track trial progress and allows for documentation of negative results (9). But trial registration does not relieve the researcher of the right, indeed the obligation, of reporting important negative results. Example 2: A phase II cooperative group trial did not meet its prespecified end point. An abstract reporting the result was submitted to a professional meeting. A copy of the abstract was sent to an industry collaborator for courtesy review. Of note, the study was conducted under the group’s IND. The group designed and conducted the trial, analyzed the results, and prepared the abstract. The collaborator did not have access to the data. The agreement between the group and the collaborator provided for the collaborator to receive copies of any manuscripts at least 30 days prior to their submission for publication and for the group to delay submission if requested by the collaborator in order for the collaborator to seek patent or intellectual property protection. Upon receipt of the abstract, an attorney for the collaborator telephoned me to say that the collaborator wanted the group to withdraw the abstract because the conclusion was not supported and was not favorable to the collaborator. The concluding statement of the abstract was modified to more clearly define the population for whom the treatment was not effective, and the abstract was submitted. In addition to the contract language regarding their right to publish, investigators should also consider the contract provisions and the collaborator’s overall policies for authorship. Ghostwriting is an important potential source of bias in publication (4, 6). Ghostwriters are unacknowledged authors who are paid for developing manuscripts. The authors who are named in the publication may or may not have had a significant role in study development or conduct, or in writing the manuscript. But they are the opinion leaders mentioned above, and their names add credibility to the publication. It is the collective responsibility of investigators, actual authors, and journal editors to avoid bias resulting from ghostwriting. In 2006, the Executive Committee of the Association of American Medical Colleges (AAMC) approved a set of twentytwo “Principles for Protecting Integrity in the Conduct and Reporting of Clinical Trials.” The principles were developed as a result of concern for appropriate
317
standards for analysis and reporting of sponsored trials, and in particular industry-sponsored clinical trials. Specifically, the AAMC principles state that “Ghost or guest authorship is unacceptable. Authorship implies independent, substantial, and fully disclosed participation in the study and in the preparation of the manuscript. It is acceptable for employees of the sponsor to participate in drafting and publication activity, but only if their roles are fully disclosed” (10). It follows that in order to publish meaningful results, the investigator must have access to the data. In a multicenter trial, an individual investigator is discouraged from publishing results from only his or her site. Such results are likely to be biased or misleading and may not be reflective of the entire study population. Similarly, the author in a multicenter trial must have access to all of the data, not just the data from his or her center. Indeed, author access to only some of the data can effectively limit publication rights. Again, an understanding of the contract provisions regarding authorship and access to data is essential. Several reports describe surveys about the language in contracts with industry sponsors. Schulman et al. surveyed 122 medical schools to evaluate adherence to the guidelines of the International Committee of Medical Journal Editors (ICMJE) (11). Like the AAMC, the ICMJE was concerned about maintaining the integrity of research within the context of increasing industry sponsorship, and in 2001, the “Uniform Requirements for Manuscripts Submitted to Biomedical Journals” were revised (most recent update 2008). Shulman et al. developed a telephone survey including 22 questions about, among other issues, access to data and publication of results. The survey respondent was selected by each institution and was intended to be the person “most knowledgeable about the content of agreements with industry sponsors.” Responses were expressed as compliance scores—the percentage of the institution’s agreements that addressed the particular item. Although all agreements gave investigators access to their own data and the median compliance score for the item “agreement allows site investigators to analyze and publish site data” was 100%, the median compliance score for the item “agreement requires access to all data for authors of reports on multicenter trials” was only 1% (11). Mello et al. described a contract negotiation survey conducted among 107 medical school administrators. They noted that 24% of respondents would allow an industry collaborator to insert their own statistical analysis in a manuscript, 47% would not, and 29% were not sure. Similarly, 50% of respondents would allow an industry sponsor to draft the manuscript,
318
ONCOLOGY CLINICAL TRIALS
40% would not, and 11% were not sure (12). These authors also conducted a survey of investigators, noting that the responses of investigators differed significantly in some areas from those of the administrators who serve as the contract negotiators (13). It seems clear that not only do the contents of executed agreements vary among institutions, but also the investigators who are expected to uphold the agreements may not be knowledgeable of these contents. The negotiator’s contract review must inform the investigator’s decision to participate in a collaborative effort with industry, and input from the investigator must inform the negotiations. The contract negotiation process should thus be a partnership between the negotiator and the investigator. Recognizing that contract negotiation is seen as a barrier to initiating, and therefore completing, clinical trials involving academic medical centers and industry collaborators, the National Cancer Institute (NCI) and the CEO Roundtable on Cancer in 2008 sponsored proposed standard language for clinical trial agreements. The CEO Roundtable on Cancer (http://ceoroundtableoncancer.org) is a nonprofit corporation consisting of corporate executives from major American companies. The proposed language was developed from a review of redacted, previously negotiated agreements between cancer centers and pharmaceutical companies. The standard language is not mandatory and is intended to serve as a starting point for contract negotiations. The document includes clauses for agreements for industry-sponsored trials and for investigator-initiated trials (14). It is intended for use by cancer centers, but other parties, including individual investigators, may be able to use any or all of the clauses to facilitate contract negotiations. The proposed language does not include clauses for all of the items surveyed by Shulman and by Mello. Investigators unfamiliar with clinical trial agreements may find it useful to compare the items in these three references as they begin to consider how to review agreements covering their own collaborations with industry. It is also interesting to note subtle differences between the agreement language for industry-sponsored and investigator-initiated trials. The development of a budget is another important component of the industry collaboration. The total budget amount is generally included in the contract, but the development of the budget is a separate process occurring parallel to contract negotiation. The investigator should consider one-time start-up costs as budget items. These items reflect work that is required independent of the number of patients eventually enrolled on the trial (e.g. design and development of the protocol, submission of the protocol to the institutional
review board/ethics committee), and for which payment should be considered nonrefundable. Other budget items include the time and effort of members of the study team, as appropriate, for patient recruitment; physical examinations and patient assessments; drug ordering, record keeping, preparation, dispensing, and administration; assessment of adverse events; data collection and submission; and completion of regulatory requirements. Finally, other items to include in budget requests are nonstandard-of-care (i.e., research related) clinical tests and procedures. It is an unfortunate occurrence when, following a successful contract negotiation, patient participation in the study is limited by refusal of third party payers to reimburse for nonstandard-of-care tests that were not included in the original budget. Industry collaboration has become commonplace in clinical research. Concern for preserving the scientific integrity of research, and the public trust in the research process has led some to suggest that industry sponsors should be removed from direct involvement in studies of their own products. It has been proposed that an independent institute be established to oversee the design and conduct of clinical drug trials, including the identification of the researchers (2, 4). Until such revamping of the research system occurs, it is essential that investigators approach industry collaborations with a clear understanding of the expectations of all involved parties.
References 1. Margolin KA, van Besien K, Peace DJ. An introduction to foundation and industry-sponsored research: practical and ethical considerations. Am Soc Hematol Educ Program. 2007;: 498–503. 2. Schafer A. Biomedical conflicts of interest: a defence of the sequestration thesis—learning from the cases of Nancy Olivieri and David Healy. J Med Ethics. 2004;30:8–24. 3. Bressler LR, Schilsky RL. Collaboration between cooperative groups and industry. J Oncol Pract. 2008;4:140–141. 4. Angell M. Industry-sponsored clinical research. A broken system. J Am Med Assoc. 2008;300:1069–1071. 5. Rowinsky EK. Erosion of the principal investigator role in a climate of industry dominance. Eur J Cancer. 2005;41: 2206–2209. 6. DeAngelis CD, Fontanarosa PB. Impugning the integrity of medical science. The adverse effects of industry influence. J Am Med Assoc. 2008;299:1833–1835. 7. Drazen JM. Institutions, contracts, and academic freedom. N Engl J Med. 2002;347:1362–1363. 8. Olivieri NF. Patients’ health or company profits? The commercialisation of academic research. Sci Eng Ethics. 2003;2:29–41. 9. De Angelis CD, Drazen JM, Frizelle FA, et al. Clinical trial registration: a statement from the International Committee of Medical Journal Editors. N Engl J Med. 2004;351:1250–1251. 10. Principles for Protecting Integrity In: The Conduct and Reporting of Clinical Trials: Association of American Medical Colleges; 2006. (Accessed April 14, 2009, at http:// www.aamc .org/ research/clinicaltrialsreporting/clinicaltrialsreporting.pdf)
33 INDUSTRY COLLABORATION IN CANCER CLINICAL TRIALS
11. Schulman KA, Seils DM, Timbie JW, et al. A national survey of provisions in clinical-trial agreements between medical schools and industry sponsors. N Engl J Med. 2002; 347:1335–1341. 12. Mello MM, Clarridge BR, Studdert DM. Academic medical centers’ standards for clinical-trial agreements with industry. N Engl J Med. 2005;352:2202–10. 13. Mello MM, Clarridge BR, Studdert DM. Researchers’ views of the acceptability of restrictive provisions in clinical trial
319
agreements with industry sponsors. Account Res. 2005;12: 163–191. 14. Proposed standardized/harmonized clauses for clinical trial agreements: collaboration between the National Cancer Institute and the CEO Roundtable on Cancer, 2008. (Accessed April 14, 2009, at http://cancercenters.cancer.gov/documents/ stclauses.pdf)
This page intentionally left blank
34
Defining the Roles and Responsibilities of Study Personnel
Frederic De Pourcq
The successful implementation of any clinical trial is dependent on the allocation of appropriate resources to aid the principal investigator (PI) in conducting the trial. Once the personnel resources have been identified, it is important to assign clear and well-defined roles and responsibilities to each research team member. Regardless of the size of the research team, everyone must have a clear understanding of his or her expectations during the life of a clinical trial. While the PI is ultimately responsible for ensuring that the trial is conducted appropriately and in accordance with federal regulations, he or she may delegate certain responsibilities to individuals who make up the research team. Indeed, the myriad of federal regulations governing clinical trials today virtually demands that certain key activities associated with any clinical trial be delegated if a PI is to fulfill his or her obligations and avoid citation for noncompliance.
THE RESEARCH STUDY TEAM According to the International Commission on Harmonization (ICH), the research team should consist of adequate resources necessary to carry out the research (ICH 4.2). Allocating sufficient personnel is not only a compliance and implementation strategy, but also more importantly it becomes a question of subject safety. The scope and complexity of the clinical trial,
the number of evaluable subjects needed for analysis, and the duration of subject participation will influence the structure of the research team and who will be responsible for certain key activities. As the onus of adequate resource allocation is on the PI; therefore, depending on the trial’s complexity, a research team can be comprised of a PI and a clinical research nurse (CRN), or it may have a more comprehensive structure that would include coinvestigators, CRN, data manager (DM), clinical research associate (CRA), pharmacist, residents, and fellows. While it is one thing to activate a clinical trial; it is quite another to ensure that key activities that accompany it are actually getting done. Table 34.1 lists key activities common to all clinical trials that a PI must consider when starting a new clinical research trial. Most assigned roles within a research team have commensurated generic titles with associated responsibilities; however, it is up to the PI to decide who the most appropriate persons to perform key activities are. It is important to remember that regardless of the size of the research team, all key activities must be performed in a manner that will instill confidence to anyone reviewing the trial results that it was conducted in accordance with good clinical practices (GCPs). To this end, all assigned roles and responsibilities should be clearly defined and should never compete or overlap. To avoid confusion as to “who is doing what,” the principal investigator should formally record each team member’s responsibilities in
321
322
ONCOLOGY CLINICAL TRIALS
TABLE 34.1 Regulatory Documentation and Maintenance Who will maintain the following documents: · Most recent IRB approved protocols and Informed Consents · Original and revised copies of FDA 1572 · CVs of Principal Investigators and Co-investigators · Subject screening and enrollment logs · Staff signature logs · Copies of the hospital or local laboratory certificates · Copies of signed financial disclosure forms for each investigator on the team · Copy of the most recent Investigator Brochure (IB) · Correspondence to and from the trial sponsor or FDA IRB Review and Documentation Who will prepare and submit: · The protocol and related documents for submission to the IRB for review and approval · The local IRB Continuing (annual) reviews throughout the life of the trial · Ongoing correspondence with the IRB · Process and submit protocol amendments and revisions to the IRB · Serious Adverse Event Reports (SAEs) to the IRB · Protocol deviations to the IRB Subject Recruitment, Eligibility and Enrollment Who will be responsible for: · Recruiting or screening subjects for eligibility · Obtain and document the Informed Consent process · Enroll and randomize subjects after obtaining Informed Consent · Educate and maintain regular contact with the subjects throughout study participation · Monitoring subjects for compliance · Document all clinical events and perform regular assessments such as adverse events and concomitant medicine review · Scheduling the subject for study visits and related procedures · Order and receiving study drug · Maintaining and reconcile drug dispensing logs Data Collection and Reporting Who will: · Extract clinical data from the subject’s medical record · Enter data into case report forms · Design case report forms (if necessary) · Schedule data monitoring with study sponsor · Reconcile data discrepancies between the medical record and the case report form including addressing queries form the sponsor · Compile reports for regular data safety monitoring review by the PI and/or other institutional oversight entities
a Delegation of Log as required by ICH 4.1.5 (see sample Delegation Log in Appendix 34-1). PI The Food and Drug Administration (FDA) regulation 21 C.F.R. 312.3(b) defines a clinical investigator as “an individual who actually conducts a clinical investigation (i.e., under whose immediate direction the
drug is administered or dispensed to a subject)”; whereas, the National Cancer Institute (NCI) Clinical Trials Glossary defines a PI as one who “oversees all aspects of a clinical trial. . . .” The expectation (implied or otherwise) of either definition is that the investigator is the leader of the research team; driving the overall effort by directing the other members of the team. Table 34.2 lists the four main categories of a PI’s key activities when managing a clinical trial.
34 DEFINING THE ROLES AND RESPONSIBILITIES OF STUDY PERSONNEL
TABLE 34.2 Administrative Oversight · Develop the concept and draft the protocol. · Assemble a research team and convene team meetings. · File all regulatory documents related to the trial and generated from either the trial’s sponsor, NCI, FDA or other oversight entities. · Draft a satisfactory budget with a sponsor or underwriters that allows for adequate resources and cost recovery. Items to include in cost recovery include reimbursement for each research team member’s time and any clinical procedures or evaluations that are necessary but would normally not be performed in the daily management of the patient’s disease. · Negotiate a contractual agreement with trial sponsor (in academic institutions investigator faculty cannot sign a contract). · The submission of the protocol and related documents including the Informed Consent for local IRB approval and regular updates for continuing review. Enrollment and Treatment Oversight: · Implement a recruitment strategy and review and confirm subject eligibility. · Protect the rights of subjects participating in the trial and obtain informed consent. · Oversee that patients are receiving the protocol intervention or therapy according to protocol dictates. · Monitor each subject’s response to therapy or intervention including discontinuation of a subject’s participation as specified in the protocol. · Make revisions to the protocol as deemed appropriate. · Report all protocol deviations to the local IRB. · Review Drug accountability records. · Ensure adequate data collection tools are in place and record retention. Data Safety and Monitoring Oversight · Submit periodic data safety and monitoring reports as dictated by the protocol’s Data Safety and Monitoring Plan (DSMP). · Assess adverse events for attribution and follow until resolution. · Report all serious adverse events according the protocol’s DSMP and the local IRB’s reporting criteria. Analysis and Publication · Review the collected data regularly to evaluate the intervention for efficacy according the statistical considerations spelled out in the protocol. · Prepare the trial results for publication or presentation.
323
Be Aware The PI is the only member of the research team to sign the FDA Form 1572 and thus is the only one who must answer for any and all federal, state, and local IRB regulatory infractions. Section 9 of the 1572 spells out very specifically an investigator’s obligations for conducting a clinical trial under his or her oversight. Once the principal investigator signs and dates the 1572, he or she is ultimately responsible for the trial’s conduct; hence, it is incumbent upon the PI to be aware as to the state of the clinical trial at all times. The FDA will not accept ignorance as a defense nor the transferring of blame for noncompliance to another member of the research team. Therefore, it is important that a PI exercise due diligence during all phases of the clinical trial, including meeting regularly with members of the research team to ascertain progress and to be informed of potential problems that may impact the trial’s timeline or compliance to federal regulations. What cannot be Delegated Even though the Code of Federal Regulations and ICH Guidelines permit certain research activities and responsibilities to be delegated to others, there are some key activities that remain the exclusive jurisdiction of the PI and cannot be delegated. A PI may not delegate: · Overall responsibility and accountability for the trial · The signature on the FDA Form 1572 · Determination of causality between an adverse event (AE) and the protocol therapy or intervention · Medical decisions related to a subject’s overall treatment for his or her disease For more detailed insight regarding the responsibilities of a PI refer to 21 C.F.R. Part 50, C.F.R. 312, C.F.R. 812, and ICH GCP Guidelines. Coinvestigator The coinvestigator (a.k.a subinvestigator) is a designated member of the clinical team who can perform critical trial-related procedures or who can make critical decisions related to the clinical trial, but who is supervised by the PI. A coinvestigator may make decisions on behalf of the PI in the absence of the latter. Coinvestigators are usually other physicians and residents, PhDs, research fellows and specialist other clinicians, such as advanced practice registered nurses (APRNs) or a clinical specialist. Coinvestigators, who are also clinical physicians, may enroll and treat patients on the protocol. For more complex clinical trials requiring discreet
324
ONCOLOGY CLINICAL TRIALS
analysis of secondary or tertiary objectives, a PI may enlist the help of a specialist or expert who brings a specific pool of knowledge to the team or who can perform a unique procedure and thus can provide objectivity to the overall findings of the clinical trial as well as aid in its completion. Coinvestigators are listed on but do not sign the FDA Form 1572. CRN The CRN is often the primary liaison between the investigators, other primary care providers, the Institutional Review Board (IRB), and the sponsor; in other words the go-to-person. The CRN (a.k.a. clinical research coordinator) can provide the PI with day-today support in the areas of preactivation, informed
consent, recruitment and eligibility, treatment, compliance, and posttreatment follow-up. As CRNs have frequent contact with research subjects during a study, they also play a vital role in human subject protection and patient safety and compliance (1). More often than not it is the CRN who will screen, enroll, educate, and monitor subjects for compliance during protocol participation. Depending on the enrollment activities and other obligations, a CRN is generally responsible for ensuring that all clinical documentation is complete and adverse experience and protocol deviations are reported, as well as for maintaining the protocol’s regulatory files. Table 34.3 categorizes some of the key responsibilities of a CRN. The roles of the CRN and DM are often interchangeable when the research team consists of only the
TABLE 34.3 Pre-Activation · Assess a potential protocol for possible barriers to subject recruitment risks and data collection problems · Participate in protocol planning and pre-study meetings with the clinical team · Develop recruitment strategies · Assist with protocol submission to IRB for approval · Conduct in-service to other clinical staff that may interact with the subject during study participation Informed Consent and Eligibility · Recruit subjects and serve as a point of contact regarding questions about the protocol requirements · Review and verify eligibility of potential subjects · Educate subjects about the protocol’s requirements and the subject’s obligations during study participation and ensure that subjects are fully aware of possible side effects prior to signing the informed consent · Ensure that the most recent and valid Informed Consent has been signed by the subject · Enroll and randomize subjects · Maintain screening and enrollment logs Treatment and Monitoring · Monitor subjects while on study for compliance · Conduct adverse event assessment and review with principal or co-investigator managing the subject’s treatment while on trial · Regularly conduct concomitant medicine reviews · Coordinate protocol treatment and schedule required procedures and ancillary services · Perform all protocol questionnaires such as Quality of Life, Mini Mental Status exams, Pain Assessments, and review subject diaries etc. Patient Safety · Notify the principal investigator, trial sponsor and local IRB of any serious adverse events according to protocol requirements and timelines · Update research team as to the status of each subject and any problems with protocol compliance · Notify PI and local IRB of any protocol deviations Post Treatment and Data Reporting · Ensure off-study and follow-up procedures are performed · Follow subject for survival if required by the protocol · Ensure source documentation of all protocol data elements are available for extraction and reporting in CRFs · Review collected data for missing elements and inconsistencies · Assist Research team with publication and presentations
34 DEFINING THE ROLES AND RESPONSIBILITIES OF STUDY PERSONNEL
PI and the research nurse. In such cases the CRN also has the added responsibilities of performing many of the key activities associated with data management. DM and CRA Data integrity is the cornerstone of any clinical trial publication and requires considerable effort on the part of individuals to ensure its reliability and reproducibility (2). The DM is the primary individual designated for ensuring that reportable data is accurate and verifiable. They are charged with extracting documented clinical results and completing the study case report forms as well as for ensuring that the data is ready for statistical analysis. They often are the point of first-contact between the trial’s sponsor and the research team when it comes to issues related to data reporting. DMs keep the PI updated as to the progress of data collection and any time lines for interim data and safety review. Although DMs will work closely with the biostatistician in cleaning up or reconciling discordant data, it is not their responsibility to interpret ambiguous or incomplete data. In such instances, he or she must refer back to the clinician responsible for the original documentation. In the case of pharmaceutical-initiated clinical trials, they will coordinate addressing queries generated by either the sponsor or other designated entities reviewing the data. Table 34.4 lists the responsibilities commonly assigned to a DM.
TABLE 34.4 Data Extraction · Data extraction from the medical or research record · Enter data in case report forms (CRFs) or computerized database · Reconcile discordant data and address queries either from the trial’s sponsor or other members of the research team who will analyze the data · Monitor patient accrual · Design case report forms (CRFs) · Coordinate and prepare for any audits and other data review · Compile data safety monitoring reports as required by the protocol’s DSMP · Ensures data quality control · Perform data back-up procedures · Archive data for storage and later retrieval · Participate in research team meetings · Update the research team as to data collection efforts
325
CRA Depending on the environment in which the clinical trial is taking place, a DM may be assigned some clinical research activities beyond the boundaries of strict data extraction. Such activities may involve limited patient interactions and include assisting the CRN in screening, enrollment and follow-up of study subjects, administering some qualitative ancillary tests (i.e., quality of life questionnaires), reviewing patient diaries, processing specimens for quantitative analysis, and reminding the CRN of protocol visit expectations and documentation for each patient. In these instances the DM assumes a CRA role in addition to his or her data reporting activities. Depending on the clinical trial, it is not uncommon for a PI to include a CRA, in addition to DM, among a research team with a distinct and separate focus on clinical support activities. Further, the PI can also direct the CRA to coordinate the submission activities related to the protocol’s IRB review and approval process, including continuing reviews and compiling and submitting interim reports to various oversight entities. What to Look for in a DM or CRA Finding the right skill set in a DM is important to the overall success of data reporting time lines. A DM and/or a CRA who is not a CRN should have previous research experience and be knowledgeable in the research methodologies, medical terminology, and human anatomy; and it helps if he or she possesses a bachelor’s degree in the health-related field, although this is not a prerequisite. They need to have a solid understanding of computer desktop systems such as Microsoft Access, Excel, and even different analytical software used for statistical analysis. Research Pharmacist Most investigational agents are prepared and dispensed from pharmacies. Enlisting a research pharmacist as a clinical member of the research team ensures the proper ordering, dispensing, and administration of the investigational drug(s); additionally, it will relieve other members of the research team of those responsibilities. Table 34.5 lists the primary responsibilities of a research pharmacist. Fellows Adding a clinical fellow to the research team provides a twofold benefit: (1) it provides a strong foundation in the fundamentals of clinical research to a junior-
326
ONCOLOGY CLINICAL TRIALS
TABLE 34.5 Primary Research Pharmacy Responsibilities · Verify that each patient has signed informed consent prior to dispensing the investigational drug/agent · Ensure that written orders/prescriptions are obtained prior to dispensing · Maintain proper documentation of investigational drug received, dispensed and returned · Safeguard investigational drug throughout the trial, which may include preparation, repackaging, proper storage, and destruction of investigational drug/agent · Storage of confidential blinding envelopes and sensitive randomization codes
level physician through practical experience; and (2) it is a useful strategy of ensuring that responsibilities that are not assigned to other research team members are getting done. A clinical fellow will often participate in clinical activities of the protocol, such as performing
AE assessments and concomitant medicine reviews, jointly managing subjects with the PI, participating in research team meetings review, and assisting with drafting the clinical trial results and presentation of data.
References 1. McClary KA, Offenhuartz M. Clinical research nurses give new meaning to protect and serve. NurseWeek. April 24, 2006. 2. Cassidy J. The role of a data manager in clinical cancer research. Cancer Nursing. 1993;16(2):131–138.
OTHER RESOURCES · NCI Investigator’s Handbook (http://ctep.cancer .gov/handbook/index/) · NCI Clinical trial documents (http://www.cancer .gov/clinicaltrials/conducting/) · NCI/CTEP resources (http://ctep.cancer/gov/) · GCP FDA Regulations (relating to clinical trial investigators) (http://www.fda.gov/oc/gcp/regulations/)
Writing a Consent Form
35 Christine Grady
An important component of ethical clinical research is informed consent. Planning for informed consent involves deciding what information to provide to potential participants or their legally authorized representatives (LARs), both in writing and in discussions; as well as who is going to present the information, how participant understanding will be considered, and who will obtain the participant’s signature. PURPOSE AND RATIONALE FOR INFORMED CONSENT Informed consent is a legal, regulatory, and ethical requirement in clinical research and has become widely accepted as an integral part of the ethical conduct of clinical research (1). Current requirements for informed consent owe much to the legal system, but the underlying values are deeply embedded in American culture and values. Fundamentally, informed consent is based on the ethical principle of respect for persons (2). This principle recognizes and compels us to respect an individual’s right to determine his or her own life goals and make autonomous choices consistent with those goals. Informed consent is also integral to clinical practice, in that patients are given information about the nature, consequences, benefits, risks, and alternatives of a
treatment to help them decide whether to accept the treatment (3). Similarly, informed consent in clinical research is a process by which we enable individuals considering research to exercise their right to autonomous choice about whether or not to participate or continue to participate in the research. Existing research ethics guidelines (4–7), laws, and federal regulations governing clinical research, including the FDA regulations and the Common Rule (8, 9) emphasize the need for the voluntary informed consent of research participants. For example, the 2008 version of the World Medical Association’s Declaration of Helsinki states: In medical research involving competent human subjects, each potential subject must be adequately informed of the aims, methods, sources of funding, any possible conflicts of interest, institutional affiliations of the researcher, the anticipated benefits and potential risks of the study and the discomfort it may entail, and any other relevant aspects of the study. The potential subject must be informed of the right to refuse to participate in the study or to withdraw consent to participate at any time without reprisal. Special attention should be given to the specific information needs of
The views expressed here are those of the author and do not necessarily reflect those of the Clinical Center, the National Institutes of Health, the Public Health Service, or the Department of Health and Human Services. 327
328
ONCOLOGY CLINICAL TRIALS
individual potential subjects as well as to the methods used to deliver the information. After ensuring that the potential subject has understood the information, the physician or another appropriately qualified individual must then seek the potential subject’s freely given informed consent, preferably in writing. If the consent cannot be expressed in writing, the nonwritten consent must be formally documented and witnessed (10). Informed consent includes: (1) Disclosure of relevant information to prospective participants about the research, (2) their understanding of the information and appreciation of what it means for them, and (3) their voluntary agreement to participate (11). By regulation, an institutional review board (IRB) or research ethics committee (REC) reviews and approves the proposed process for obtaining research informed consent as well as the written information to be given to prospective participants before participants are approached. The process of informed decision making by potential research participants about whether or not to enroll in a study generally includes discussion of relevant information about the research study with the principal investigator (PI) and other members of the research team (as appropriate) and reading and signing the written consent document. Participants should be given sufficient time to read and consider the decision and should be encouraged to ask questions before being asked to sign the consent document. After the consent form is signed, ongoing discussion and education of participants appropriate to the nature, type, and duration of the study should continue. The participant retains the right to change his or her mind and withdraw consent for participation.
THE WRITTEN CONSENT DOCUMENT Early in the research process well before any potential participants are approached, the investigator prepares a written consent document for review and approval by the IRB. Advertisements, fliers, or brochures that are prepared to recruit and inform potential participants about a study are considered part of the informed consent process and also require review and approval by the IRB. The consent document summarizes information about the study, including an explanation of the research procedures, related risks and possible benefits, alternatives to participation, and the rights of research participants. Discussion between the potential participant and the investigator is often guided by
the written information found in the consent document. Participants may use the written information to understand what the study is about and what is required of them. They may find the consent document helpful to their decision by using it as a tool when discussing their study participation decision with their family, friends, or health care providers. Participants may also refer to the written document as a source of information about the study throughout their participation. Participants usually sign the consent form to indicate that they have agreed to participate in the study. According to the federal regulations, “A copy shall be given to the person signing the form” (12). The consent document for clinical research should be clearly written with the goal of promoting informed decision making by participants. Both the content and format of the written document are important to making it readable and understandable. A consent working group assembled by the National Cancer Institute (NCI) (and described below) developed a checklist for investigators as they are preparing consent forms for their studies (13). See Table 35.1. Consent documents for clinical research should include the information required by regulations found at 45CFR.46.116 and 21CFR50.25 (listed in Table 35.2), as well as other information that prospective participants might need in order to make an informed decision about participation. The consent document should clearly state that participation in research is voluntary and should not include any language waiving or appearing to waive participant’s rights. To promote understanding, consent documents should be written in language that can be understood by the proposed participant population and that is consistent with their educational level, familiarity with research, and cultural views. The investigator should strive to write the consent document at a level understandable to research participants, and as recommended by the Good Clinical Practice Guidelines of the International Conference on Harmonization Tripartite ICH-GCP “as nontechnical as practical” (14). Consent documents should be clear and concise, and use short sentences and uncomplicated words. Language should be simple and direct but not simplistic and patronizing (15). Guidelines suggest that documents should be written at the 8th grade reading level (16), as approximately 50% of people in the United States read at or below an 8th grade reading level (17, 18). Online programs are available to assess reading level (19). Unfortunately, data show that research consent forms are often complex and written at or above the 12th grade level (17). A more appropriate reading level can be achieved if the text contains short
35 WRITING A CONSENT FORM
329
TABLE 35.1
Checklist for Easy-to-Read Informed Consent Documents. From NCI Simplification of Informed Consent Documents, Appendix 3. http:// www.cancer.gov/ clinicaltrials/understanding/simplification-of-informed-consent-docs/page1.
TEXT · · · · · · · · · · · · · · · · · · · ·
Words are familiar to the reader. Any scientific, medical, or legal words are defined clearly. Words and terminology are consistent throughout the document. Sentences are short, simple, and direct. Line length is limited to 30–50 characters and spaces. Paragraphs are short. Convey one idea per paragraph. Verbs are in active voice (i.e., the subject is the doer of the act). Personal pronouns are used to increase personal identification. Each idea is clear and logically sequenced (according to audience logic). Important points are highlighted. Study purpose is presented early in the text. Titles, subtitles, and other headers help to clarify organization of text. Headers are simple and close to text. Underline, bold, or boxes (rather than all caps or italics) give emphasis. Layout balances white space with words and graphics. Left margins are justified. Right margins are ragged. Upper and lower case letters are used. Style of print is easy to read. Type size is at least 12 point. Readability analysis is done to determine reading level (should be ≤ 8th grade). Avoid abbreviations and acronyms.
GRAPHICS · · · · · · · · · · ·
Helpful in explaining the text. Easy to understand. Meaningful to the audience. Appropriately located. Text and graphics go together. Simple and uncluttered. Images reflect cultural context. Visuals have captions. Each visual is directly related to one message. Cues, such as circles or arrows, point out key information. Colors, when used, are appealing to the audience. Avoid graphics that won’t reproduce well.
sentences, active voice, and words with no more than two syllables. Substituting scientific or medical terms with words more familiar to the nonscientist can be very helpful. See Table 35.3 for an example. Use of the second person is recommended by the NCI “because it reflects the conversation between the investigator and potential participant” (20). Using charts, pictures, flow charts, or graphics in addition to text may further facilitate understanding (21).
Readability and understandability are also influenced by the format of the consent document. Information should be arranged logically and should be easy to follow. The font size should be readable by the population considering participation. Documents that use text in small font, single spacing, or large chunks of text are difficult to read. Documents that balance words with graphics and white space are more readable. The use of headings in the consent document can
330
ONCOLOGY CLINICAL TRIALS
TABLE 35.2
Information to be Included in the Consent Document (Adapted from 45CFR.46.116, 21CFR50.25, and ICH-GCP E6, 4.8.10). 1. A statement that the study involves research. 2. An explanation of the purpose of the research, an invitation to participate and explanation of why the subject was selected, and the expected duration of the subject’s participation. 3. A description of procedures to be followed and identification of which procedures are investigational and which might be provided as standard care to the subject in another setting. Use of research methods such as randomization and placebo controls should be explained. 4. A description of any foreseeable risks or discomforts to the subject, an estimate of their likelihood, and a description of what steps will be taken to prevent or minimize them; as well as acknowledgment of potentially unforeseeable risks. 5. A description of any benefits to the subject or to others that may reasonably be expected from the research, and an estimate of their likelihood. 6. A disclosure of any appropriate alternative procedures or courses of treatment that might be advantageous to the subject. 7. A statement describing to what extent records will be kept confidential, including examples of who may have access to research records such as hospital personnel, the FDA, and drug sponsors. 8. For research involving more than minimal risk, an explanation and description of any compensation and any medical treatments that are available if subjects are injured through participation; where further information can be obtained, and whom to contact in the event of research-related injury. 9. An explanation of whom to contact for answers to questions about the research and the research subject’s rights (including the name and phone number of the Principal Investigator [PI]). 10. A statement that research is voluntary and that refusal to participate or a decision to withdraw at any time will involve no penalty or loss of benefits to which the subject is otherwise entitled. 11. A statement indicating that the subject is making a decision whether or not to participate, and that his/her signature indicates that he/she has decided to participate having read and discussed the information presented. When appropriate, or when required by the IRB, one or more of the following elements of information will also be included in the consent document: 1. If the subject is or may become pregnant, a statement that the particular treatment or procedure may involve risks, foreseeable or currently unforeseeable, to the subject, or to the embryo or fetus. 2. A description of circumstances in which the subject’s participation may be terminated by the investigator without the subject’s consent. 3. Any costs to the subject that may result from participation in the research. 4. The possible consequences of a subject’s decision to withdraw from the research and procedures for orderly termination of participation. 5. A statement that the PI will notify subjects of any significant new findings developed during the course of the study that may affect them and influence their willingness to continue participation. 6. The approximate number of subjects involved in the study. 7. The plan for follow-up or access to an intervention proven effective through the research.
help the investigator and the IRB to ensure that all the elements required by the regulations are included and communicated to the participants. Headings can also promote comprehension and readability. The NCI template uses questions for headings. In this format, the headings take a conversational tone, such as “Why is this research being done?” Headings can also be in the form of statements, such as “The purpose of this study.”
The written consent document and any other written information provided to participants should be updated when new information becomes available that might be relevant to the participant’s consent or his or her willingness to continue in a study. The IRB/REC reviews and approves revisions to the written consent form before it is provided to research participants. When a participant agrees to participate, it is with the assumption that he or she understands the nature of
35 WRITING A CONSENT FORM
331
TABLE 35.3
Sample Consent Wording at Different Grade Levels. STANDARD WORDING (Flesch Kincaid grade 12): The purpose of this study is to test the safety of XXX at different dose levels in patients with cancer who have different degrees of normal and abnormal liver function. XXX is an experimental drug that has not been approved by the Food and Drug Administration (FDA) for use in patients with cancer. XXX was designed to enter cancer cells and block the activity of proteins that are important for cancer cell growth and survival. This is the first study in which XXX will be given to patients with different degrees of liver function. We already know the safe dose for patients with normal liver function. Other purposes of this study are to find out what side effects occur when XXX is given to patients with different degrees of liver function, how much XXX is in your blood at specific times, and whether or not XXX is effective in treating your cancer. SIMPLER WORDING (Flesch Kincaid grade 7.6): We want to find out what dose of XXX is safe in patients with cancer whose livers are or are not working normally. XXX is an experimental drug made to try to block the growth of cancer cells. It is not approved by the Food and Drug Administration for patients with cancer. This is the first time we will give XXX to people who have abnormal liver tests. We also want to learn what side effects XXX has, how much is in your blood at specific times, and whether or not XXX works to treat your cancer.
the study, benefits, risks, and alternatives to participating, as well as the fact that participation is voluntary. Unfortunately, participants do not always understand information that might be relevant to an informed decision about participation. Although research has shown that the majority of research participants report satisfaction with the process of informed consent, many fewer understand important aspects of the studies in which they are participating (22–27).
THE NATIONAL CANCER INSTITUTE WORKING GROUP AND TEMPLATE Many have expressed concern that consent documents for clinical research are long and complicated and written in a way that is difficult for people to understand (22). In recent years, consent forms for clinical research have only become longer and more difficult (28). Some worry that they are overly legalistic and complex, widely considered legal documents, and therefore less likely to be read or comprehended by potential research participants (29). Recognizing this problem, several groups including the Association of American Medical Colleges (AAMC), the NCI, and others have developed initiatives to improve research consent forms (30, 31). In one such initiative, the NCI, in collaboration with the U.S. Food and Drug Administration (FDA) and the U.S. Office of Human Research Protections (OHRP), convened a diverse group of experts in the late 1990s to propose solutions. The NCI group offered specific “Recommendations for the Development of Informed Consent Documents for Cancer Clinical Trials” (31), in
which they discussed the problem and delineated recommendations for how to write a consent document and how to communicate with research participants. This group also released a consent form template for cancer treatment trials. The template includes all the elements of consent required by the federal regulations, and offers sample template language as well as guidance for investigators regarding how to describe the purpose, regimens, risks, benefits, alternatives, and confidentiality protections as they write consent documents. (See http://www.cancer.gov/clinicaltrials/understanding/ simplification-of-informed-consent-docs/page2) (31). With respect to risks, for example, the NCI recommends that the consent document not only describe physical risks, but also include nonphysical risks such as the time commitment, financial implications, and psychological effects when these factors could potentially affect participants’ decisions about enrolling. The NCI recommendations suggest describing risks associated with the trial according to their likelihood and severity in the following categories: (1) very likely, regardless of severity, (2) less likely, but serious, or (3) rare, but relatively severe. They further suggest that the investigator not describe in the research consent document the risks of standard medical care that the participant would receive if he or she was not participating in the clinical trial (e.g. placement of a central venous catheter), but continue to obtain consent for such procedures as part of usual medical care. The NCI template proposes that risks be presented together for the entire treatment regimen rather than listed for each specific drug or procedure in the study since, for example, many chemotherapy drugs have similar risks.
332
ONCOLOGY CLINICAL TRIALS
EXCEPTIONS TO THE REQUIREMENT FOR WRITTEN INFORMED CONSENT A written document signed by the research participant is required by the FDA regulations and the ICH-GCP guide: “Informed consent is documented by means of a written, signed and dated informed consent form” (32). Written informed consent is also required, with clearly defined exceptions, by the Common Rule “informed consent shall be documented by the use of a written consent form approved by the IRB and signed by the subject or the subject’s legally authorized representative” (33). However, for certain studies or certain participant groups, written consent may be inappropriate. The federal Common Rule allows the IRB under certain specified circumstances to approve an oral consent procedure, approve changes to some of the required elements of informed consent, or allow a waiver of informed consent or of the need to document consent. In order to waive the requirement for informed consent or to change some of the elements of informed consent, the IRB must determine that (1) the research involves no more than minimal risk to participants, (2) waiving or changing the requirements “will not adversely affect the rights and welfare of subjects,” (3) the research could not practicably be carried out without the waiver, and (4) when appropriate, additional information will be given to participants at the conclusion of the study (34). Importantly, FDA regulations do not allow a waiver of informed consent except under a specific exception for emergency research (35); written informed consent is required by the FDA in all studies testing drugs, biologics, or devices under an investigational new drug (IND) or an Investigational Device Exemption (IDE). Federal regulations require that investigators document informed consent using a written consent form that is approved by an IRB and signed by the participant or a LAR, except in limited circumstances delineated in the regulations (36). As described earlier (and in Table 35-1), this written consent form should contain certain information about the study and the participant’s rights. The federal regulations allow the use of a short written consent form that does not include all the detailed information usually found in a written consent form if approved by the IRB. The short form is used to complement an oral presentation of the relevant study information to the potential participant. According to the regulations, the IRB must approve a summary of what is to be included in the oral presentation, there must be a witness to the oral presentation, and the participant or LAR must sign the short form for the
record (37). This process can be useful, for example, when the consent form is written in English and the prospective participant cannot read English. An interpreter conversant in the language of the participant works with the investigator to explain information about the study to the participant and answer questions, and a short form can be used to document that the participant consented. Some institutions, such as the intramural National Institutes of Health (NIH), have created short consent forms in multiple languages to be used along with a detailed explanation of the particulars of the study through an interpreter. What should happen when the prospective research participant or LAR is unable to read? ICHGCP recommends a witnessed process of providing oral information to the participant, including reading and explaining to the participant the information in the written consent form or other written documents. The participant should then give an oral indication of agreement to participation, and, if capable, should sign and date the consent form. In these cases, a witness—who is independent of the participant and the research team—should also sign and date the consent form, attesting that the information in the written documents was accurately explained to the participant, and that informed consent was freely given (38). In certain studies or for certain participants, the investigator may think that a signed consent form presents a risk to the participant. The Common Rule allows a waiver of the requirement for a signed consent form if the IRB determines that: (1) there is a risk from a breach of confidentiality and the only link between the subject and the research would be the consent document; or (2) the research is minimal risk and involves no procedures which normally require written consent outside of research (39).
CONCLUSION Informed consent, a central ethical requirement for most clinical research, is a process that includes disclosing study information to potential participants so that they can make informed and voluntary choices about participating or continuing to participate in a cancer research study. Study information is usually disclosed to participants both in a written consent document and through discussions with the investigator and research team. Unfortunately, written consent forms for research are often long, complex, written at a high level, and difficult to read and understand. Investigators should strive to write consent documents in a way that facilitates understanding among
35 WRITING A CONSENT FORM
participants. Consent documents should be written in simple language, include information relevant to participant decisions and required by regulations, and use a coherent and reader-friendly format. Following NCI and other recommendations for simplifying writing, using charts and graphics, and employing template language with appropriate study specific modifications is likely to result in better consent documents and better informed research participants.
References 1. Emanuel E, Wendler D, Grady C. What makes clinical research ethical? J Am Med Assoc. 2000;283(20):2701–2711. 2. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. Washington, DC, U.S. Government Printing Office; 1979. 3. Berg J, Appelbaum P, Lidz C, Parker L. Informed Consent. New York: Oxford University Press; 2001. 4. The Nuremberg Code; 1949. www.hhs.gov/ohrp/references/ nurcode.htm. 5. World Medical Association. Declaration of Helsinki, Ethical Principles for Medical Research Involving Human Subjects, October 2008. http://www.wma.net/e/policy/b3.htm. Accessed November 5, 2008. 6. Council for International Organizations of Medical Sciences. International Ethical Guidelines for Biomedical Research Involving Human Subjects. Geneva, CIOMS/WHO; 2002. www.cioms.ch. Accessed August 12, 2008. 7. International Conference on Harmonisation (ICH) Harmonised Tripartite Guideline, Guideline for Good Clinical Practice, E6(R1) Current Step 4 Version, dated 10 June 1996. http://www .ich.org/LOB/media/MEDIA482.pdf. Accessed November. 8, 2008. 8. U.S. Code of Federal Regulations Title 21, Part 50 “Protection of Human Subjects.” www.fda.gov. Accessed August 19, 2008. 9. U.S. Code of Federal Regulations Title 45, Part 46. (45CFR.46) www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm. Accessed August 19, 2008. 10. Op Cit, World Medical Association (ref #5). http://www .wma.net/e/policy/b3.htm. 11. Beauchamp T, Childress J. Principles of Biomedical Ethics. 4th ed. New York: Oxford University Press; 1994. 12. U.S. Code of Federal Regulations 45CFR.46.117. www.hhs .gov/ohrp/humansubjects/guidance/45cfr46.htm. 13. Comprehensive Working Group on Informed Consent in Cancer Clinical Trials. Recommendations for the development of informed consent documents for cancer clinical trials. National Cancer Institute. http://www.cancer.gov/clinicaltrials/ understanding/simplification-of-informed-consent-docs/page1. Accessed August 19, 2008. 14. ICH-GCP http://www.ich.org/LOB/media/MEDIA482.pdf, item 4.86, page 21. 15. Wright N. Plain Language. From the EPA Writing Course. Available at http://www.plainlanguage.gov/whatisPL/definitions/ wright.cfm. Accessed October 3, 2008. 16. Office of Human Subjects Research, National Institutes of Health. Information sheet #6. Guidelines for Writing Informed Consent Documents. http://ohsr.od.nih.gov/info/sheet6.html. Accessed October 3, 2008.
333
17. Paasche-Orlow M, Taylor H, Brancati F. Readability standards for informed consent forms as compared with actual readability. N Engl J Med. 2003;348:721–726. 18. Davis TC, Williams MV, Marin E, et al. Health literacy and cancer communication. CA Cancer J Clin. 2002:52:134–149. 19. Online Readability Tests. www.readability.info or www .writerservices.com/wps/s1_readability_score.htm. 20. Comprehensive Working Group on Informed Consent in Cancer Clinical Trials. Recommendations for the development of informed consent documents for cancer clinical trials. National Cancer Institute. http://www.cancer.gov/clinicaltrials/ understanding/simplification-of-informed-consent-docs/page1. Accessed August 19, 2008. 21. National Cancer Institute. Clear and simple: developing effective print materials for low-literate readers. http:// www.cancer.gov/aboutnci/oc/clear-and-simple/page10. Accessed November. 1, 2008. 22. Jefford M, Mileshkin L, Raunow H, et al. Satisfaction with the decision to participate in cancer clinical trials is high, but understanding is a problem. J Clin Oncol. 2005;23(suppl 16):6067. 23. Sugarman J, McCrory DC, Powell D, et al. Empirical research on informed consent: an annotated bibliography. Hastings Center Rep. 1999;29:S1–S42. 24. Penman DT, Holland JC, Bahna GF, et al. Informed consent for investigational chemotherapy: patients’ and physicians’ perceptions. J Clin Oncol. 1984;2:849–855. 25. Olver I, Buchanan L, Laidlaw C, Poulton G. The adequacy of consent forms for informing patients entering oncological clinical trials. Ann Oncol. 1995;6:867–870. 26. Joffe S, Cook E, Cleary P, et al. The quality of informed consent in cancer clinical trials: a cross sectional survey. Lancet. 2001; 358:1772–1777. 27. Sharp S. Consent documents for oncology trials: does anybody read these things? Am J Clin Oncol. 2004;27:570–575. 28. Beardsley E, Jefford M, Mileshkin L. Longer consent forms for clinical trials compromise patient understanding: so why are they lengthening? J Clin Oncol. 2007;25(9):e13–e14. www .jco.ascopubs.org. Accessed August 19, 2008. 29. Jefford M, Moore R. Improvement of informed consent and the quality of consent documents. Lancet Oncol. 2008;9:485–493. 30. Association of American Medical Colleges. Universal Use of Short and Readable Informed Consent Documents: How do we get there? http://www.aamc.org/research/ clinicalresearch/ hdickler-mtgsumrpt53007.pdf. Accessed November. 3, 2008. 31. Comprehensive Working Group on Informed Consent in Cancer Clinical Trials. Recommendations for the development of informed consent documents for cancer clinical trials. National Cancer Institute. http://www.cancer.gov/clinicaltrials/ understanding/simplification-of-informed-consent-docs/page1. Accessed August 19, 2008. 32. ICH-GCP. http://www.ich.org/LOB/media/MEDIA482.pdf. 33. U.S. Code of Federal Regulations 45CFR.46.117(b) www.hhs .gov/ohrp/humansubjects/guidance/45cfr46.htm 34. U.S. Code of Federal Regulations 45CFR.46.116(d) www.hhs .gov/ohrp/humansubjects/guidance/45cfr46.htm 35. U.S. Code of Federal Regulations 21CFR.50.23 and 50.24 www.fda.gov 36. U.S. Code of Federal Regulations 45CFR §46.117 www.hhs .gov/ohrp/humansubjects/guidance/45cfr46.htm 37. U.S. Code of Federal Regulations 45CFR.46.117 (b); 21CFR .50.27(b) www.hhs.gov/ohrp/humansubjects/guidance/ 45cfr46 .htm and www.fda.gov 38. ICH-GCP 4.8.9. http://www.ich.org/ LOB/media/MEDIA482 .pdf. 39. U.S. Code of Federal Regulations 45CFR.46.117(c) www.hhs .gov/ohrp/humansubjects/guidance/45cfr46.htm
This page intentionally left blank
36
How Cooperative Groups Function
Edward L. Trimble Alison Martin
The National Cancer Institute (NCI)-sponsored Clinical Trials Cooperative Groups represent the largest publicly funded oncology clinical trials organization in the world. Now numbering 12 and having evolved over a 50-year period, the Cooperative Groups offer a sophisticated multi-institutional network of multimodality, multidisciplinary researchers who work collaboratively to develop and conduct clinical trials that address national priorities in oncology. Such a structure enables team research across institutions based on expertise, while using standardized collection and reporting tools to maximize the quality of the research and robustness of results over sites and time. The Cooperative Groups treatment sites are primarily located throughout the United States, as well as in Canada and Europe (see Fig. 36.1). Approximately 30,000 patients are enrolled per year on clinical trials which span the spectrum of prevention, screening, diagnostic, treatment, and quality of life. By systematically collecting and banking specimens that are characterized by controlled treatments and longitudinal clinical outcome data, the Cooperative Groups provide a unique platform for the translational science that will lead to the vision of personalized medicine of the future. Over the decades, the value of a standing national consortium that incorporates continuous peer review has proven to be an important resource for moving rapidly evolving science into the clinic. This mechanism
for clinical trials stands in contrast to the alternative of individual grant applications proposing to conduct a specific clinical trial. Without a standing infrastructure, for each new trial undertaken, the trial sponsors must recruit a new group of participating sites, educate a new group of investigators, and set up a new method of data collection. A national consortium has the potential to: · Speed enrollment. As improvements in cancer treatment have led to increased survival, the size of clinical trials has correspondingly increased. Large randomized trials are often necessary to detect small benefits and to allow the developments in molecular biology that define specific patient populations. · Open participation to a broader community. This minimizes the potential for bias due to inadvertent over-accrual of specific patient populations or reliance on a limited number of health care professionals and hospitals. · Capacity to support a broad research portfolio: phase II trials of combination strategies to develop potential regimens for phase III trials; concurrent phase III trials; trials in multiple cancers; rapid sequencing of trials in the same population; trials in rare populations; studies of combined modality approaches to a cancer which may or may not involve a drug question; targeted inclusion of underserved populations; symptom management;
335
336
ONCOLOGY CLINICAL TRIALS
FIGURE 36.1 National cancer institute U.S. clinical trials treatment sites, 2000.
health-related quality of life; and cancer prevention. Establishment of an ongoing clinical trials infrastructure facilitates the conduct of multiple trials for a variety of cancers simultaneously, as well as a rapid sequence of trials for the same patient population. · Develop cost-effective administrative, data management, and quality assurance programs that meet regulatory requirements. · Promote intellectual cross-fertilization and synergy through regular interaction among clinical investigators from multiple modalities outside one’s institution who plan and conduct trials together. Since the year 2000, approximately 25,000 to 30,000 new patients have been enrolled onto Cooperative Groups treatment studies each year; 12,000 to 14,000 patients were evaluated annually on ancillary laboratory correlative studies; and many times this combined number has been in follow-up.
BRIEF HISTORY The inception of the Cooperative Groups program dates to 1955, when cancer was viewed as a major public health problem, but cancer drug development was not sufficiently lucrative to attract the necessary private sector investment. Dr. Sidney Farber, the medical research philanthropist Mary Lasker, and others approached the U.S. Congress with a proposal to increase support for
chemotherapy treatment trials in cancer. The U.S. Congress awarded $5 million to the NCI in order to establish a Chemotherapy National Service Center. By 1958, 17 groups had been organized with the main purpose of testing new anticancer agents from the NCI’s investigational agent development program. Today, drug development partnerships extend to academia and industry. Some of the original 17 groups have coalesced or disbanded. The most recently formed Cooperative Group—the American College of Radiology Imaging Network (ACRIN)—was created in 1999 to address the growing need for early noninvasive diagnostic and staging predictive and prognostic markers. Currently, there are 12 Cooperative Groups (see Table 36.1). The groups differ in research focus and structure with three groups focusing on specific patient populations (Children’s Oncology Group, Gynecologic Oncology Group, National Adjuvant Breast and Bowel Project); three groups focusing primarily on a specific modality (Radiation Therapy Oncology Group, American College of Surgeons Oncology Group, ACRIN, and the rest performing studies in a variety of adult cancers across the spectrum of prevention, cancer control, treatment, and translational science. The common theme for all the groups is their charge to develop, conduct, and participate in treatment trials in a multi-institutional setting. The unique contributions of the Cooperative Groups, their number as well as research strategies, must necessarily change over time in order to be synergistic with other stakeholders in
36 HOW COOPERATIVE GROUPS FUNCTION
337
TABLE 36.1
Oncology Clinical Trials Cooperative Groups Financially Supported by the U.S. NCI NIH. NAME
WEB SITE
PRIMARY FOCUS
Multimodality, U.S. based Cancer and Leukemia Group B (CALGB)
http://www.calgb.org
Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers Prevention, local, adjuvant, and treatment trials for breast and colorectal cancers
Eastern Cooperative Oncology Group (ECOG)
http://www.ecog.org
National Surgical Adjuvant Breast and Bowel Project (NSABP) North Central Cancer Treatment Group (NCCTG)
http://www.nsabp.pitt.edu
Southwest Oncology Group (SWOG)
http://www.swog.org
Specialty, U.S. based American College of Radiology Imaging Network (ACRIN) American College of Surgeons Oncology Group (ACOSOG) Children’s Oncology Group (COG) Gynecologic Oncology Group (GOG) Radiation Therapy Oncology Group (RTOG) Multimodality, International European Organization for Research on Treatment of Cancer (EORTC) National Cancer Institute of Cancer Clinical Trials Group (NCIC CTG)
http://ncctg.mayo.edu
Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers
http://www.acrin.org
Management and development of imaging technologies http://www.acosog.org Evaluation of surgical therapies in solid tumors http://www.childrensoncologygroup Clinical trials in children and adolescents .org http://www.gog.org Clinical trials, including cancer control and treatment, for women with gynecological malignancies http://www.rtog.org Radiotherapy approaches, including multi modality, to improve cancer control, survival, and quality of life http://www.eortc.be/default.htm
http://www.ctg.queensu.ca
cancer treatment discovery and development in the public, nonprofit, and private sectors.
COLLABORATION WITH NCI Since 1980–1981, the Cooperative Groups have been supported by the NCI through cooperative agreements, giving the Cooperative Groups and the NCI a shared responsibility for ensuring that the best and most important clinical research is conducted. The optimal mechanism for developing a research agenda has been
Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers Multidisciplinary cancer control, treatment, and translational science in variety of solid and hematologic cancers
iterative, and trends toward greater coordination across NCI programs (e.g., those responsible for treatment and prevention trials, quality of life, translational science, and multimodality interventions), while maximizing input from extramural research leaders—all at any earlier stage of concept and protocol development. In 2005, the NCI’s Clinical Trials Working Group recommendations led to national disease-specific scientific Steering Committees (SCs) to set scientific priorities within a disease and provide peer-review for individual trials (http://ccct.nci.nih.gov). The composition of the SC can be tailored to the scientific needs of the
338
ONCOLOGY CLINICAL TRIALS
disease, but must include members of each of the Cooperative Groups with active research in that disease, SPOREs, Cancer Centers, biostatisticians, basic science/translational researchers, community oncologists, patient advocates, as well as NCI staff. The need for such central prioritization and coordination of the research agenda is exemplified by the scope of breast cancer research, in which 10 of the Cooperative Groups listed in Table 36-1 (above) are active. INTERGROUP COLLABORATION Collaboration across the Cooperative Groups is encouraged at multiple levels, through Cooperative Group Chair meetings, across the disease-specific Scientific Committees, in the disease-specific SCs, and through the efforts to enhance accrual through the centralized portal of the Cancer Trials Support Unit (CTSU; www.ctsu.org). Each of these collaborative efforts contributes to coordinating the research agenda, speeding accrual to both large phase III trials and to trials addressing the needs of uncommon disease types. To facilitate intergroup collaboration, the NCI and the Cooperative Groups have undertaken major efforts to harmonize various aspects of clinical trial design and conduct. These include standardized data for collection (Common Data Elements), standardized criteria for response to and progression during chemotherapy (RECIST), standardized assessment toxicity of treatment (CTC AE), and a model protocol template. In conjunction with the NCI’s Cancer Diagnosis Program (DCP), Cancer Therapy Evaluation Program (CTEP) and the Cooperative Groups develop consensus criteria for specimen collection and handling within and across trials and Cooperative Groups. The NCI in conjunction with the Cooperative Groups maintains a central registry of investigators active in NCI-sponsored clinical trials to meet FDA requirements, as well as a registry to track local IRB approvals of protocols and amendments, and yearly IRB reviews. The NCI is also working with the U.S.-based clinical trials Cooperative Groups to develop a central Web site for patient registration and randomization hub as well as common software for remote data capture. (Visit http://ctep.cancer.gov for further information.) STRUCTURE OF THE COOPERATIVE GROUPS While the Cooperative Groups’ organizational structures vary, all are governed by a constitution and bylaws, led by a Cooperative Group chair (or cochairs),
with the input and oversight of a Board of Governors and/or Executive Committee, and sometimes a Scientific Advisory Board. The three major functional units are Headquarters (or the Operations Center), the Statistical and Data Management Center (SDMC), and Participating Sites. Headquarters serve as the central office(s) for administration and management of the Cooperative Group and its committees, including protocol development, quality assurance/control, education, auditing and compliance with NCI, and federal regulations. Cooperative Group Committees may be scientific or administrative, with the former actively involved in conducting clinical trials. The SDMC is responsible for standard operating procedures for major functions such as data management, study monitoring, data analysis, and the operations of the Data Safety and Monitoring Committee. A critical function of the Cooperative Groups is bringing together multidisciplinary researchers, clinical trialists, and patient advocates on a regular basis. In general, the Cooperative Group convenes face-to-face meetings every 6 to 12 months, with frequent conference calls and email communication in between. In some cases, Cooperative Group members are institutions or participating sites, which may be further subdivided into main member and affiliated member institutions. In other cases, Cooperative Group members are individual investigators or practices. Investigators who participate in group trials work at many different sites, including NCI-designated Cancer Centers, university teaching hospitals, large community hospitals, military hospitals, veterans’ hospitals, and small oncology practices. The NCI’s Division of Cancer Prevention undertook a major effort to extend clinical trials into the community setting through their Clinical Community Oncology Program (CCOP) and to make trials available to historically underserved populations through the Minority-Based CCOPs. More than 20% of accruals to NCI-sponsored cooperative group trials comes from CCOP and Minority-Based CCOP sites.
DEVELOPMENT OF A RESEARCH STRATEGY Each Cooperative Group must undergo regular peer review at 4- to 6-year intervals. Cooperative Groups are judged by a variety of parameters including scientific leadership, timeliness of study completion, quality of data, and track record of publications. In general, the process of a research agenda and concept development begins at the level of a Scientific Committee or disease site committee. At the annual or semiannual Cooperative Group meeting, Committees
36 HOW COOPERATIVE GROUPS FUNCTION
review progress on open trials, prioritize a research agenda and discuss ideas for future studies. The Cooperative Groups buttress trial leadership through partial salary support for committee chairs and principal investigators of large trials. Typical responsibilities of the Committee Chair are to: · · · · · ·
Coordinate all research activities of their committee Interact with Headquarters and the SDMC Direct the annual or semiannual Committee meetings Prepare periodic progress reports and prioritize studies Direct the development of new concepts and protocols Monitor abstract and manuscript submissions and approve authorship selections DEVELOPMENT OF A CONCEPT AND PROTOCOL
Each Cooperative Group has a standard concept sheet with development steps and timelines which, when approved by the subcommittee, Headquarters, and finally the national SC, is formatted according to the Cooperative Group’s protocol template, which also has its development steps and timelines. Each committee provides support from a dedicated biostatistician and nonphysician coordinator of concepts and protocols. Protocol development includes coordination of input from the relevant clinical investigators, biostatisticians, pharmacists, translational research scientists, and data managers. The final protocol document must be approved by the NCI, after which it is disseminated to the local investigators, who must obtain approval for opening the trial from their local institutional review board (IRB) and clinical research office. The Cooperative Groups have clearly defined responsibilities for the study chair, or principal investigator (PI). · During generation of the concept and protocol, the PI is responsible for responding to comments from Cooperative Group Headquarters, the NCI-coordinated SC and Central IRB reviews. · Once the protocol is activated, the PI is responsible for monitoring progress of the study and the evolving safety data, submitting amendments to the protocol, and presenting to the Data and Safety Monitoring Board (DSMB) as needed. · At the completion of the study, the PI collaborates with statistical, data management, and quality assurance staff to analyze and report the results. Each Cooperative Group’s headquarters has the responsibilities of executing and negotiating all pharmaceutical contracts; setting-up drug distribution
339
mechanisms (if needed); submitting the concept and protocol to the SC; developing case report forms (CRFs); broadcasting activation of the protocol; maintaining database and Web registration; processing IRB approvals; and overseeing safety reporting and function of the DSMB, regulatory compliance, and quality assurance/quality control (QA/QC). QA/QC may include central review of pathology, imaging studies, operative notes, and radiation treatment planning. The NCI supports two groups to assist with quality assurance for radiation oncology—the Radiologic Physics Center (RPC) and the Quality Assurance Review Center (QARC). RPC ensures that radiation dosimetry is standard at participating institutions, whereas QARC is able to review quality of radiation treatment planning as well as imaging studies. In addition, NCI requires that each Cooperative Group conduct on-site audits of each participating center every 3 years to ensure that all research is conducted in accordance with Good Clinical Practice as defined by the International Conference on Harmonization (ICH). These audits include confirmation of reported data with original source documentation for at least 10% of enrolled patients. In addition, each Cooperative Group is required to have an independent DSMB to ensure that all trials are conducted in accordance with National Institutes of Health (NIH) and NCI guidelines. In this era of molecular oncology and development of targeted therapies, the NCI and the Cooperative Groups encourage incorporation of relevant translational research linked to protocols through partnership with other funded projects, such as the Specialized Programs of Research Excellence (SPOREs), program projects (P01s), investigator-initiated-grants (R01s), and private sector support. The Cooperative Groups also maintain central tumor banks for specimens collected from patients enrolled on Cooperative Group studies and (in some cases) core laboratories for analysis of specimens. These large collections of specimens with annotated, longitudinal clinical information are arguably the most valuable resource of the Cooperative Groups and can facilitate the identification and validation of prognostic and predictive markers, new therapeutic targets, as well as the identification of appropriate subpopulations for targeted therapies—in essence allowing a virtual trial to be conducted.
COST OF COOPERATIVE GROUP TRIALS The local costs of participation in Cooperative Group trials must be discussed. These include the staff time required to process protocols through local scientific and ethics committees, as well as the time of doctors
340
ONCOLOGY CLINICAL TRIALS
TABLE 36.2
Opportunities and Challenges of International Collaboration in Cancer Treatment Trials. OPPORTUNITIES · Faster accrual from more sites for patients with common cancers and stages of disease · Faster accrual for patients with rare tumors, specific molecular defects, less common histologic subtypes, and less common stages of disease · Broader applicability of research results · Fewer duplicative trials · More complementary trials · More rapid dissemination of innovations in cancer treatment
CHALLENGES · · · · · · ·
Differing regulations between countries Differing levels of infrastructure support for cancer clinical trials between countries Differing processes and schedules for scientific review by funding bodies between countries Longer lead time for concept and trial development Differing licensing arrangements for specific drugs between countries Differing standards of care in medical practice and consensus on control arm Differing reimbursement to help support expenses of trial
and nurses required to educate their colleagues about open and upcoming clinical trials, discuss trials with potential trial participants and their families, obtain informed consent from patients who choose to enroll on clinical trials, administer the prescribed protocol therapy, manage toxicities related to treatment, gather and submit the data required at entry and thereafter, and perform surveillance after treatment according to protocol. Some protocols also require the active participation of institutional pharmacists, pathologists, radiologists, psychologists, and other health care professionals. A recent survey of institutions active in clinical trials estimated that a hypothetical industrysponsored phase III trial which accrued 20 patients over 12 months would require 3,998 hours, whereas a hypothetical NCI-sponsored phase III trial would require 4,012 hours. At present, the average per-capita reimbursement provided by the NCI to the participating institution is $2,000 per patient (1). The NCI and the Cooperative Groups are developing an algorithm to augment this payment for trials which are particularly complex.
INTERNATIONAL COLLABORATION In the current global economy, it is reasonable to explore when international collaboration can maximize resources and patients, facilitate clinical trial accrual, as well as translation of outcomes into patient benefit. Efforts to harmonize global regulatory guidelines have been ongoing since 1990 (ICH of
Technical Requirements for Registration of Pharmaceuticals for Human Use; www.ich.org), but many hurdles remain (see Table 36.2). The NCI and Cooperative Groups program support international collaboration in theory and when possible. The document “Cooperative Group Guidelines for the Development, Conduct and Analysis of Clinical Trials with International Collaborating Institutions” was developed as a joint effort between NCI, the Office for Human Research Protections (OHRP), and the FDA to help investigators navigate the international regulatory landscape. (http://ctep.cancer.gov/branches/ctmb/clinicalTrials/docs/nci_clin_intl_guidelines.pdf). Successful examples of international collaborations are still emerging. The COG, NSABP, and RTOG, for example, have had strong support from Canadian investigators since their inception. More recently, hospitals in Australia, Ireland, Japan, Korea, New Zealand, Peru, Switzerland, and South Africa have joined the roster of sites active in NCI-sponsored trials. Two of the groups listed in Table 36.1 (above), the EORTC, and NCIC CTG, are based outside the United States. The NCI’s support for these Cooperative Groups is intended to facilitate accrual to trials of critical importance to U.S. cancer patients. NCI financial sponsorship of the EORTC is limited to partial support of the EORTC data and statistical center, and the NCI financial sponsorship of NCIC-CTG is limited to partial support of the NCIC-CTG data and statistical center as well as per-capita payments for patients accrued to studies led by one of the U.S.-based clinical trials Cooperative Groups. It is reasonable to expect, however, that
TABLE 36.3
Examples of Clinical Trial Contributions from the Cooperative Groups. NAME
EXAMPLES
Cancer and Leukemia Group B (CALGB)
Efficacy of 7 + 3 induction therapy for AML Survival advantage with all-trans retinoic acid in APL PFS benefit of fludarabine as first-line therapy of CLL1 Improved outcome with weekly paclitaxel added to adjuvant AC Improved TTP with bevacizumab added to chemotherapy in first-line therapy of metastatic breast cancer Bevacizumab in the treatment of patients with advanced colorectal cancer2 Establishment of lumpectomy and radiation over mastectomy Chemoprevention with tamoxifen or raloxifene Adjuvant therapies for breast and colorectal cancers3 Survival advantage of adjuvant therapy in early-stage colon cancer Survival advantage with addition of postoperative continuous infusion 5-FU to radiation in rectal cancer New standard of FOLFOX4 in first-line therapy of advanced colon cancer4 Prostate Cancer Prevention Trial demonstrating finasteride reduces the period prevalence of prostate cancer Improved survival in renal cancer when cytoreductive nephrectomy precedes interferon alpha Imatinib in unresectable or metastatic GIST5 Digital vs. conventional mammography for breast screening Accuracy of CT colonography for detection of large adenomas and cancers6 Improved recurrence-free survival with adjuvant imatinib in GIST
Eastern Cooperative Oncology Group (ECOG)
National Surgical Adjuvant Breast and Bowel Project (NSABP) North Central Cancer Treatment Group (NCCTG)
Southwest Oncology Group (SWOG)
American College of Radiology Imaging Network (ACRIN) American College of Surgeons Oncology Group (ACOSOG) Children’s Oncology Group (COG)
Gynecologic Oncology Group (GOG)
Radiation Therapy Oncology Group (RTOG)
European Organization for Research on Treatment of Cancer (EORTC)
National Cancer Institute of Cancer Clinical Trials Group (NCIC CTG)
OF
CLINICAL TRIAL CONTRIBUTIONS
Less intense treatment for advanced neuroblastoma achieves high survival rates Improved survival and cures in childhood ALL Curative therapy for Wilm’s7 Improved survival in advanced ovarian cancer with paclitaxel vs. cyclophosphamide when added to cisplatin Role for intraperitoneal chemotherapy in stage III residual disease Improved survival with chemoradiotherapy in cervical cancer8 Predictive markers in oligodendroglioma and anaplastic oligoastrocytoma for response to chemoradiotherapy Improved survival in prostate cancer when radiotherapy is combined with hormone suppression Improved survival in advanced pancreatic cancer when gemcitabine is added to standard therapy after surgery Imatinib in unresectable or metastatic GIST Rituximab maintenance improves outcomes in relapsed/resistant follicular NHL Adjuvant interferon alpha vs. observation in patients with stage IIB/III melanoma Intermittent vs. continuous androgen ablation in advanced prostate cancer Radiotherapy with and without temozolomide in glioblastoma Improved DFS for letrozole vs. placebo after 5 years of adjuvant tamoxifen
1 Green MR, George SL, Schilsky RL. Tomorrow’s cancer treatments today: the first 50 years of the Cancer and Leukemia Group B. Semin Oncol. 2008;35:470–483. 2 Giantonio BJ, Forastiere AA, Comis RL. The role of the Eastern Cooperative Oncology Group in establishing standards of cancer care: over 50 years of progress through clinical research. Semin Oncol. 2008;35:494–506. 3 Wickerham DL, O’Connel MJ, Costantino JP, et al. The half century of clinical trials of the National Surgical Adjuvant Breast and Bowel Project. Semin Oncol. 2008;35:522–529. 4 Grothey A, Adjei AA, Alberts SR, et al. North Central Cancer Treatment Group—achievements and perspectives. Semin Oncol. 2008;35:530–544. 5 Coltman CA. The Southwest Oncology Group: progress in cancer research. Semin Oncol. 2008;35:545–552. 6 Hillman BJ, Gatsonis C. The American College of Radiology Imaging Network—clinical trials of diagnostic imaging and imageguided treatment. Semin Oncol. 2008;35:460–469. 7 O’Leary M, Krailo M, Anderson JR, et al. Progress in childhood cancer: 50 years of research collaboration, a report from the Children’s Oncology Group. Semin Oncol. 2008;35:484–493. 8 Omura, GA. Progress in gynecologic cancer research: the Gynecologic Oncology Group experience. Semin Oncol. 2008;35:507–521.
342
ONCOLOGY CLINICAL TRIALS
leveraging resources across public, private, and nonprofit sectors as well as countries will assume greater importance as the need to minimize health care costs while pursuing the goal of personalized medicine increases. CONCLUSION The Cooperative Groups have been responsible for major advances in virtually every oncologic cancer (see Table 36-3). Despite the ample senior leadership, all the
Cooperative Groups value the human capital represented by young investigators and offer support through a variety of ways (e.g., training workshops, opportunities to be investigators on clinical trials, and awards).
References 1. Emanuel EJ, Schnipper LE, Kamin DY, et al. The costs of conducting clinical research. J Clin Oncol. 2003;21:4081–4082.
37
Adaptive Clinical Trial Design in Oncology
Elizabeth Garrett-Mayer
Adaptive designs have become increasingly popular in recent years due to a number of changes in oncology research. First, candidate treatments considered for clinical trials have increased at a greater rate than the number of patients eligible for trials in recent decades creating a demand for greater efficiency in trial design. Second, many new agents are targeted therapies that may only be tested in patients who have a particular genetic mutation or protein overexpression. As tumor biology is better understood, cancers are further subtyped, such as with hormone-receptor and HER2 status in breast cancer. This subtyping, although a step toward better cures, creates challenges for trial design by looking at potentially small numbers of patients. Another contributing factor is the greater involvement of patient-advocates in oncology trial design development teams. Advocates are extremely valuable in representing the patient perspective in trial design and have been vocal in their preferences for adaptive designs in oncology. Their rationale is mainly driven by the desire to allow more patients to be treated on more effective treatments and at therapeutically active levels of treatment (1, 2). What is an adaptive clinical trial? Adaptive clinical trials are designed to use accumulating information to determine how to modify the trial as it progresses. There are many ways a trial can be adapted, but a unifying principle is that the adaptive aspects of the trial are predetermined before the trial begins and are a part of the
planned design. Additionally, the modifications in an adaptive trial must maintain the validity and integrity of the trial. As described by Dragalin (3), maintaining validity means providing correct inference (e.g., adjusted estimates of confidence intervals and p-values), and maintaining integrity refers to providing convincing evidence to the scientific community. Adaptive designs are generally thought to occur in stages where at the end of each stage a possible adaptation of the trial may occur. There may be only two stages (allowing just one opportunity for adapting), or the stages may be relatively continuous so that adaptation may occur after each patient is enrolled or observed for his or her outcome. Although some have come to believe that adaptive designs are Bayesian, the class of adaptive designs is really much broader and encompasses many frequentist and Bayesian designs alike. For example, Simon’s two-stage design is adaptive because it allows an interim analysis of futility stopping, yet it is based on classical frequentist hypothesis testing with type I and type II errors (4). This is also true of group sequential designs (5), which are used to determine early stopping in randomized trials, and the biased coin design for adaptive allocation (6). In recent years, a number of novel Bayesian adaptive designs have gained popularity in the biostatistics and clinical trials communities leading to their more widespread use. However, frequentist adaptive designs are still arguably more popular and prevalent due to their long history and acceptance.
343
344
ONCOLOGY CLINICAL TRIALS
Adaptive treatment assignment designs were discussed in the context of clinical trials in 1963 by Anscombe (7) who, along with others (8), had philosophical problems with the Neyman-Pearson paradigm used for trial design. Simon (9) discusses these early innovators of adaptive design (including Zelen’s “play the winner” [10]) and some reasons why adaptive trial designs were not being implemented, even in the mid-1970s despite the numerous designs that had been proposed for adaptive treatment assignment. Now, more than 30 years later, statisticians continue to work on methods for improving trial design with better yet more complex designs being published in prominent journals regularly. Some of the same barriers to development that Simon (9) discussed still face us today. First, most major clinical trials do not have one sole goal. Although the primary goal may be choosing the more efficacious treatment, efficacy is generally considered in relation to toxicity so that a treatment that is significantly more toxic would not be chosen in favor of one that is slightly less efficacious. Another significant issue is that the common paradigm for evaluating treatments is the test of the null hypothesis of no treatment difference. Approaches that do not provide some evidence for testing the null hypothesis are not convincing to many in the field. For example, methods that do not provide a p-value have had difficulty being adopted due to the preponderance of frequentist hypothesis testing in clinical trials. The last to be mentioned here, although additional barriers are discussed by Simon, is the difficulty in convincing clinical colleagues to adopt these designs into practice. This has been seen in the past several decades with adaptive dose-finding designs, such as the continual reassessment method (11), which have been shown to have substantial improvements over traditional designs, but clinical researchers are reluctant to implement them without full understanding of the mathematical details. As a result, trials tend to be designed using more traditional designs even though they are less efficient in a number of ways. Recognizing many of the difficulties in implementing adaptive trials, Pharmaceutical Research and Manufacturers of America (PhRMA) formed the Adaptive Designs Working Group (ADWG) in 2005. The goal of this group is to facilitate and foster a wider usage, regulatory acceptance, and clinical development of adaptive trials (12,13). The approach ADWG has taken is a fact-based evaluation of the benefits and challenges of these types of designs and they have published numerous papers on the topic, most notably volume 17, issue 6 of the Journal of Biopharmaceutical Statistics (2007), which devoted the entire issue to manuscripts on the topic (14).
There are a number of ways to adapt trials. The most commonly used adaptive approaches in oncology clinical trials will be discussed in this chapter and are as follows: 1. Covariate-adaptive allocation: patients are assigned to treatment arms in a way that ensures balance of patient factors, such as age, stage of disease, or disease subtype. 2. Response-adaptive allocation: patients are assigned to treatment arms based on the accumulating information of success in each arm. Patients are more likely to be assigned to the arm with greatest success. 3. Adaptive dose escalation: doses for future patients on a trial are determined based on the outcomes of patients treated earlier in the trial. 4. Adaptive sample size: the sample size in one or more arms is determined based on the accumulating evidence of success and variability to be able to achieve a convincing result at the end of the trial. 5. Adaptive stopping: the trial incorporates one or more interim analyses in the trial to allow for early stopping based on evidence of futility or efficacy of one or more arms. This chapter will touch on a number of these approaches and additional references are provided for more detailed discussions. The reader is also suggested to read other chapters in this book, especially Chapters 8 (Design of Phase I Trials), 9 (Design of Phase II Trials), 14 (Bayesian Designs in Clinical Trials), 19 (Interim Analysis of Phase III Trials), that will describe some of these adaptive designs in the context of competing approaches.
ADAPTIVE ALLOCATION Adaptive allocation designs assign patients to treatment arms based on information collected on patients already in the trial. There are two types of adaptive allocation discussed in the following two subsections. For simplicity, adaptive allocation in the case of just two treatment groups will be considered, but the methods extend quite easily to more than two treatment groups. Covariate-Adaptive Designs Covariate-adaptive designs use covariate (i.e., patient characteristic) information on the patient being assigned and on the patients previously assigned to
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
treatment arms to determine the treatment assignment. The goal is to achieve balance of covariate factors across the arms so that comparisons of treatment effects will be untainted by confounding. The permuted block within strata design, although not adaptive, is often used when covariate balance is desired. In this approach, after each block of patients within stratum, treatment assignment will be balanced and the assignments within blocks are predetermined before the trial begins. Although permuted block within strata designs can achieve balance with relatively simple implementation, there is concern over the lack of randomness, which can lead to selection bias. This is especially problematic when there are a relatively large number of strata. With many strata, the block size is usually necessarily small (two or four) so that after just one patient in a block of size two, the assignment for the next patient in the stratum is predetermined. Efron (6) introduced the biased coin design, which has a simple approach when balance on a single covariate is desired. In his original paper, Efron used the real example of patients with Hodgkin’s disease who were randomized to treatment and control. The clinical investigator wanted to ensure balance across age groups (ages 10–19, 20–29, 30–39, and 40–49), but with a sample size of only 29, there was a high likelihood of imbalance using simple randomization. Efron acknowledged the need for some degree of uncertainty in the allocation and to find a compromise between balance and complete randomization. In its simplest form, as in our example, the biased coin design will determine the age category of the next subject and calculate the number of patients in that age category already assigned to the treatment group and the number assigned to the control group. The difference between these two (D) is the additional number of patients in treatment group versus the control group. The biased coin design assigns the new patients according to the following rule, where 0.50 < p < 1.0: If D > 0, assign treatment with probability 1-p and control with probability p If D = 0, assign treatment with probability 1⁄2 and control with probability 1⁄2 If D < 0, assign treatment with probability p and control with probability 1−p Efron suggested p = 2/3, which has proven to be a popular choice. Notice that when p = 1⁄2 there is complete randomization (i.e., the value of D is irrelevant), and if p = 1 the permuted block design is implemented.
345
The biased coin design has since been extended to more complex situations. Several researchers have proposed designs that vary p with the degree of imbalance (15, 16) and by the sample size (17, 18). Another approach to balancing covariates using adaptive covariate allocation is minimization (19). A significant advantage of minimization, as compared to Efron’s biased coin design, is the ability to balance on multiple covariates at the same time. The imbalance on each covariate is measured and then combined (usually summed) to get an overall measure of imbalance. Probability of assignment of new patients is based on improving balance across all factors. Note that assignment is not deterministic: new patients will still have a positive probability of being assigned to each arm in the study, but the probability of assignment will favor assignment to the treatment that minimizes imbalance, hence the name minimization. Atkinson (20) and others extended the approach of Simon and Pocock. Although this has not been a recent area of methodological development, Atkinson (21) discusses the more modern approaches to covariate adaptive randomization. Example: Efron’s Biased Coin Randomization Recall the example discussed above that was presented by Efron (6). To show the effect of covariate adaptive randomization, the trial with 29 patients assigned to two treatments (T [treatment] vs. C [control]) and where each patient belongs to one of four age groups is revisited. The age groups had 9, 10, 7, and 3 patients in them, respectively. Two sets of trials were simulated with 10,000 trials for each set. First, age categorization was ignored and patients were assigned to T versus C with equal probability. The second set of simulations implemented Efron’s biased coin adaptive randomization to improve the age balance across the treatment groups. More specifically, as a patient was enrolled, his or her age category was noted and the treatment assignments already assigned in that age category were identified. If there were more patients in T than C, the patient was assigned to C with probability 2/3 and to T with probability 1/3. If more patients in the age category had been assigned to C than T, the patient was assigned to T with probability 2/3 and to C with probability 1/3. If there were equal numbers in T and C, then the patient was assigned to T or C with equal probability (i.e., 0.50). In Figure 37.1, the imbalance is shown in the two sets of simulations for each age category (with the number of patients in each category shown at the top) where
346
ONCOLOGY CLINICAL TRIALS
Imbalance is defined as the difference in the number of patients in the treatment versus the control group at the end of the study. Thick section of vertical bars represents the 25th to the 75th percentile; thin sections extend to 5th and 95th percentiles.
FIGURE 37.1 Comparison of imbalance using a simple randomization scheme (black) vs. Efron’s biased coin design (gray) based on simulations of lymphoma clinical trial with two treatments and four age groups with 9, 10, 7, and 3 patients in the age groups.
imbalance is defined as the difference in the number of patients assigned to T and C at the end of the trial. Equal numbers were assigned to T and C when the imbalance is 0 and the maximum imbalance occurs when all patients in a stratum are assigned to either T or C. Within each age category there are black and gray plotting symbols, similar to boxplots, describing the distribution of imbalance across the trials. Black corresponds to the unrestricted randomization (i.e., ignoring age) and gray to the covariate adaptive randomization. The thick portion of the vertical line represents the interquartile range (IQR) (the IQR defined as the 25th percentile to the 75th percentile). The thinner portions of the vertical lines extend to the 5th and 95th percentiles, to show the imbalance range for the large majority of trials. The narrower gray lines (compared to black) demonstrate improved balance in the biased coin design, which is seen rather clearly in the younger three age groups. The oldest age group has only three patients; with such a small number it is no surprise that the biased coin may not improve allocation. Response-Adaptive Designs Response-adaptive designs are among the more commonly recognized innovative designs in oncology research. In a response-adaptive design, a new patient
is assigned to a treatment arm based on the relative success of the patients treated thus far on the trial where patients are more likely to be randomized to the arm with greatest success (e.g., arm with the highest response rate). As above, we restrict discussion to just two treatment groups, but note that the methods are easily extended to three or more groups. Response adaptive designs can be frequentist or Bayesian and we will discuss one of each here in some detail. Zelen (10) was among the first to acknowledge benefits from favoring the seemingly better treatment in multiarm trials with his “play the winner” design. The original allocation rule was designed for trials with two treatments and a binary outcome (i.e., success/failure). If patient n−1 is assigned to treatment A and has a successful outcome, then patient n is also assigned to A. But, if patient n−1 has a failure, then patient n is assigned to treatment B. This design minimizes the number of patients who receive the inferior treatment. A notable and controversial implementation of the “play the winner” design was the extracorporeal membrane oxygenation (ECMO) trial (22, 23, 24), which was a trial of ECMO versus standard therapy for newborn infants with persistent pulmonary hypertension. When implemented, the trial treated nine patients on the ECMO arm and only one on the control arm. All nine ECMO patients were successes and the one control-treated patient was a failure, leading to a number of concerns about the strength of evidence, how to interpret the results, and ethical implementation of such a design. This class of designs has been extended by a number of authors (25, 26, 27). The name “play the winner” is generally assigned to this type of trial design, although it may now be considered a misnomer given the implementation of the assignment rules. Additional references are available for frequentist response-adaptive randomization schemes (28, 29, 30). Thall et al. (31) describe a Bayesian adaptive process in a way that is accessible to nonstatisticians and refer to the adaptive design as a compromise between flipping a coin for treatment assignment and allowing patient or physician to choose the treatment that he or she receives. See also Berry et al. (32) for more discussion of Bayesian adaptive designs. Thall et al. (31) and Chapter 14 of this text provide overviews of Bayesian statistics in clinical trials that may be useful to the reader and cannot be described here due to space. Briefly, the Bayesian paradigm requires information from two sources to provide inference: (1) the prior, and (2) the likelihood function. In our context, the prior is a probability distribution that quantifies beliefs the investigators have
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
before the trial begins. More specifically, in a response adaptive trial, investigators may want to assume a priori that all treatments are equally effective and so the prior would reflect equal efficacy. As a result, in a twoarm trial, the first patient would have a 50% chance of being assigned to arm A and 50% to arm B. The likelihood function is a product of probability distributions based on the data observed and represents the information in the data. The combination of the prior and the likelihood (literally, it is the product of the prior and the likelihood) is called the posterior distribution and is what is used for inference. The more data is collected, the more weight is placed on the likelihood versus the prior in the posterior distribution. For example, at the beginning of the trial when no patients have been treated there is no data and so the posterior is comprised only of the prior. As the trial progresses and the sample size becomes large, the influence of the prior on the posterior usually becomes negligible (this, however, depends on how informative the prior is chosen to be). To illustrate Bayesian adaptive randomization, consider the randomized phase II trial of gemcitabine (G) alone versus gemcitabine plus docetaxel G + D in patients with metastatic soft tissue sarcomas (33). This presentation is simplified somewhat from the original paper where covariates were included in the adaptation. Treatment in a patient was considered a success if he achieved complete or partial response at the end of two, four, six, or eight cycles of treatment; a failure was defined as disease progression or death within eight cycles. Probability of response is denoted by pR and probability of failure by pF (note that stable disease is considered neither a success nor a failure). The authors gave different weights to response and failure and defined the success of G + D as qG+D = pR(G+D) + 1.3pF(G+D) (with the success of G, qG, defined similarly). The crux of the adaptive randomization is the randomization probability, which, in this example, is pG+D so that patients will be adaptively randomized based on the probability: pG+ D =
P(θG+ D
P(θG+ D > θG | data)c . > θG | data)c + P(θG > θG+ D | data)c (Eqn. 37-1).
Notice the tuning parameter c, which is introduced to stabilize the assignments, especially at early time points in the trial. In practice, c = 1/2 is recommended or making c a function of the current sample size (e.g., n/(2N) where n is the current sample size and N is the final expected sample size). For any value of c > 0, pG+D will be greater than 0.5 when the success of G + D is larger
347
than the success of G alone, yielding an increasing chance of randomization to the G + D versus the G alone arm. The sarcoma trial enrolled and randomized 122 patients. The first 30 patients were allocated with equal probability to G + D and G. Beginning with the 31st patient, adaptive randomization was implemented. Due to the higher success rate on the G + D arm, by the end of the trial 60% of patients (N = 73) were assigned to G + D and 40% (N = 49) to G. Example: Bayesian Adaptive Randomization (BAR) To eludicate the implementation of the BAR, a simulated example is presented with increasing sample sizes, yet the same probability of success. In this example, the outcome of interest is response probability (i.e., response rate). In all four examples shown in Figure 37-2, treatments A and B have observed response rates of 0.33 and 0.45, respectively, and we assume equal efficacy as our prior. Figure 37-3 shows posterior distributions for the response rates of A and B with small sample sizes: Na = 12, Nb = 11. Based on the observed data, it is possible to calculate the probability that the response rate of treatment B is greater than that in treatment A (P(B > A)). In panel A, due to the small sample size, although the response rate of B is 0.45, which is greater than 0.33, the probability that the true response rate is greater in B than A is only 0.72. The mathematical details of this calculation are omitted (34), but intuitively, the probability of superiority relates to the amount of overlap in the two distributions. In panel B, the sample sizes increase to 24 and 33 yet the response rates are the same (0.33 and 0.45). However, due to the increased amount of information from the larger sample size, the probability that the true response rate of B is greater than A increases to 0.81. Panels C and D are interpreted similarly with increasing sample sizes, leading to probabilities of 0.92 and 0.98, respectively, for the superiority of B versus A. Of potentially greater interest is the randomization probability. For each scenario shown in Figure 37.2, the next patient in the trial (e.g., in panel 2A, the next patient would be the 24th patient randomized) will be assigned to A or B based on the calculated assignment probability. Thall et al.’s proposal for this, with tuning parameter c, is shown earlier in this section as (Eqn. 37-1. In panels A-D in Figure 37.2, the assignment probability for the next patient is equivalent to the probability that B is greater than A when c = 1. Note that Thall et al. suggest using c = 1⁄2, to avoid potential instability in the assignment probabilities, which is also shown in panels A-D. As the sample size
348
ONCOLOGY CLINICAL TRIALS
Each panel has a 33% response rate in arm A and 45% in arm B. Graphed in solid (A) and dashed (B) lines are the posterior distributions for the response rate (probability of success). Panels A through D vary in their sample sizes, although response rates are the same. Randomization probability for the next patient is shown when tuning parameter c = 1 and when c = 1⁄2.
FIGURE 37.2 Demonstration of randomization probabilities for Bayesian adaptive design.
Solid and dashed lines show examples of two prior dose response curves. Points of intersection with horizontal line at y = 0.40 show recommended doses for each prior and for each updated DTC. Recommended dose for first patient under the solid line prior is 4.1. Dotted lines show the estimated DTC if the first patient has a DLT (dotted line to the left of the solid line: recommended dose = 3.8) and if the first patient does not have a DLT (dotted line to the right of the solid line: recommended dose = 4.5).
FIGURE 37.3 Continual reassessment method prior and updated curves.
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
increases, these assignment probabilities tend to get closer to 0 or 1 when there is a true difference in the response rates in the treatment groups. (Note: if A were superior to B, then the assignment probability would decrease toward 0 as the sample size increased.) When c = 1/2, the randomization is more conservative: the assignment probabilities are closer to 0.50 than when c = 1. ADAPTIVE DOSE ESCALATION In 1990, O’Quigley et al. introduced the continual reassessment method, a Bayesian adaptive dose finding design (11). The first of its kind, it changed the way many statisticians (and some clinicians) approached phase I trials in oncology. Despite a number of safety concerns, the general approach described by O’Quigley has been adopted and shown to be superior to algorithmic dose-finding designs that continue to dominate phase I trials in oncology. Several of the reasons the continual reassessment method (CRM) has been embraced by clinical trialists is that it tends to incur fewer toxic events and more accurately estimates the maximum tolerated dose as compared to the standard phase I dose escalation designs. Other adaptive approaches will also be discussed in this chapter, but Chapter 8 of this book describes more generally the design of Phase I studies. For more detail about other types of Phase I studies (including the algorithmic designs), refer to that chapter. The CRM The CRM uses information about patients already treated with various doses of drugs and their outcomes to determine the optimal dose for the next patient. Given the context of dose finding in oncology, the outcome of interest is most often the occurrence of a dose limiting toxicity (DLT). Although this may differ from trial to trial, one common definition of a DLT is any grade 3 or 4 toxicity, as defined by the Common Terminology Criteria for Adverse Events (CTCAE) (35). The general idea behind the original CRM is that a prior dose-toxicity curve (DTC) was assumed and a desired toxicity rate was chosen. The estimated DTC was updated after each patient’s toxicity outcome is observed, so that each patient’s dose is based on information about how previous patients tolerate the treatment. For example, in Figure 37-3, a prior DTC is shown by the solid line where it can be seen that at the lowest dose level (dose level = 1), the probability of a DLT is quite close to 0. As dose level increases, the chance of DLTs increases. If it is assumed that the
349
desired level of toxicity is 0.40 (meaning that it would be acceptable if 40% of patients had a DLT), then, according to this DTC, dose level 3 would be the optimal dose. Assume that the solid curve in Figure 37.3 is the prior to represent the best guess at the dose-toxicity relationship. The first patient in the trial would receive dose level 3 and would be observed for a DLT. However, if it were instead assumed that the dashed curve were the prior for the DTC, dose level 1 or 2 (or a dose in between them) would be selected as the starting dose. Just as in the previous section, the prior here is based on a probability model. After the first patient has been treated at the dose chosen using the prior DTC, whether or not he or she experienced a DLT is recorded and is combined with the prior DTC to get a better estimate of the true curve. As discussed in the previous section (Response Adaptive Randomization), the prior is combined with the likelihood to provide the posterior inference. When the curve is updated, a new estimated dose response curve is obtained based on this posterior. It likely looks much like the prior curve, but is slightly shifted up or down, depending on whether or not the first patient experienced a DLT. Updating with just one datapoint might not seem like much information, but as the number of patients accumulates, the DTC estimate is based almost exclusively on the observed data and may look very little like the prior. Using the updated DTC, the current best estimate of the optimal dose can be obtained. By the earlier definition, this is the dose associated with a toxicity rate of 0.40. So, as before, the updated dosetoxicity model can be graphed and the dose that is associated with an estimated toxicity rate of 0.40 is found. In Figure 37.3, the dotted lines represent the two ways the curve could have been updated: (1) the first patient could have experienced a DLT, in which case the updated DTC would fall slightly to the left of the solid a priori curve; or (2) the first patient could have not had a DLT, in which case the updated DTC falls slightly to the right of the a priori curve. If the former had occurred, the next patient would be treated at a dose of 3.8, and if the latter occurred, at a dose of 4.2. The trial continues in this fashion: after each patient is treated, the DTC is reestimated and the next patient is treated at the new best estimate of the optimal dose. Patients continue to be treated until some level of certainty is achieved (stopping rules will be discussed in more detail in the next section). This is the general paradigm of O’Quigley et al.’s CRM. Modified CRMs Since 1990, a number of modifications and extensions of the CRM have been proposed. The original CRM
350
ONCOLOGY CLINICAL TRIALS
Patient data is shown with open circles. Recommended dose based on data observed thus far is shown with vertical solid line, at the point of intersection of the updated DTC with the horizontal line at 0.30. 90% confidence intervals for the fitted dose-toxicity model are shown with dashed lines.
FIGURE 37.4 Example of CRM trial.
was criticized for safety concerns, with the most concerning issue the possibly large dose escalations based on scant data. Goodman et al. (36), Moller (37), and Faries (38) all proposed designs to address the safety concerns, and the resulting modifications yield similar implementations. In the interest of space, we describe just Goodman et al.’s modifications. Unlike O’Quigley’s CRM, which proposed treating one patient per dose level, Goodman’s modified CRM treats more than one patient per dose level (and recommends three), does not allow large dose increments (by predefining a series of dose increments), and proceeds according to a standard 3+3 design until the first DLT is observed. This last characteristic allows the modified CRM to be non-Bayesian by not requiring a prior distribution. Instead, once the first DLT is observed, a maximum likelihood estimation approach is used to estimate the dose for the next cohort of patients. Piantadosi et al. (39) proposed a practical version of the CRM in which instead of a formal
Bayesian approach, pseudo-data is used to define what the investigators believe to be a DTC a priori. The pseudo-data is used for determining the first dose to be given to patients, and then continues to be used in choosing the dose for future cohorts in combination with data that is observed on patients in the trial. This approach allows the pseudo-data to be downweighted so that it has less importance in dose selection than toxicity collected from the patient outcomes. Additional adaptive dose finding designs have been proposed that allow escalation on both toxicity and efficacy end points where efficacy is generally measured by a biologic (not clinical) end point (40,41,42). Other CRM proposals detail criteria for stopping rules (43,44), and monitoring to avoid allocating patients to too toxic doses (43). See Garrett-Mayer (45) for more discussion of the CRM and its implementation. Escalation with Overdose Control (EWOC) is another Bayesian dose escalation design (46). It is similar to the CRM, but has the additional component of
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
incorporation of a loss function, which constrains the dose assignment so that the predicted proportion of patients who receive an overdose cannot exceed a specified value. Unlike the CRM, it implies that giving an overdose is a greater mistake than an underdose. Additionally, it has the attractive option that it is optimal in the class of feasible designs. Example: Dose-Finding Study of 153SmEDTMP in Patients with Poor Prognosis Osteosarcoma using the CRM A dose finding trial was implemented in a pediatric population of patients with high-risk osteosarcoma to determine the maximum tolerated dose of 153SmEDTMP (Samarium) (47). In this trial, the target DLT rate was 30% and Goodman’s modified CRM was used with cohorts of size two and a one-parameter dose toxicity model. The first dose was to be 1.0 mCi/kg and increase by 40% at each dose level up to a maximum dose of 4.0 mCi/kg. That is, regardless of the proposed dose by the CRM, the dose increment may not exceed 40% of the last dose tried. Figure 37.4 shows the results from the trial. Neither of the first two patients (treated at 1.0 mCi/kg) had a DLT. According to the design, the next two patients were treated at a 40% increment and received 1.4 mCi/kg of study drug. Of those two patients, one had a DLT. The data for the first two cohorts (i.e., the first four patients) is shown in panel A. The best fitting model is shown as a solid black line, with 90% confidence intervals displayed in dashed lines. To find the dose for the 5th and 6th patient, we find the dose at which the DTC has a DLT rate of 0.30. This turns out to be 1.4 mCi/kg, which is coincidental that this was the dose that was tried in the previous cohort. In the third cohort (panel B), one patient has a DLT and one does not, so that at 1.4 mCi/kg, we now have two DLTs and two patients without DLTs. The dose for the 4th cohort (patients 7 and 8) is found by fitting the DTC to the first six patients, and it is 1.20 mCi/kg as seen by the vertical line in panel B. With one DLT and one non-DLT at 1.20 mCi/kg, the next dose is chosen to be 1.065mCi/kg, and neither the 9th nor 10th patient experiences a DLT at this dose (panel D). The last dose tried was 1.24 mCi/kg, where three patients were treated: one had a DLT and the other two did not. The 13th patient was the last enrolled on this trial and based on the accumulated information, the model was refit a final time and the estimated dose for the next cohort would be 1.21 mCi/kg, at which the 90% confidence interval for the true DLT rate is 15%, 54%.
351
ADAPTIVE SAMPLE SIZE AND STOPPING Adaptive sample size and stopping designs encompass a wide range of designs and are used in most oncology trials. This may seem surprising, but any study that has a formal design-oriented approach for early stopping, whether it be for safety, futility, or efficacy, is considered adaptive for sample size. Work in sequential designs has been used frequently for decades (48, 49). The classical approach to early stopping is the group sequential design which has many different types (5) and can be used in single and multiarm clinical trials. These designs are commonly seen, especially in phase III studies in oncology clinical trials and are described in detail in Chapter 19 (Interim Analysis of Phase III Trials). The remainder of this section will focus on other types of adaptive sample size and stopping designs. Single-Arm Adaptive Designs The most common phase II single-arm study design used in oncology research is the Simon two-stage design (4). It is based on a binary outcome, such as response, and divides the study into two stages, allowing one early look for futility. That is, if (early in the study) there is sufficient evidence that the null hypothesis is unlikely to be rejected, you can terminate the study early. For example, if you are expecting a response rate of 0.40, but in the first 15 patients you see no responses, you would want to stop early. In stage 1, N1 patients are enrolled: if r or fewer respond, then the study is stopped. If greater than r patients respond, an additional N2 patients are enrolled. If the study proceeds to stage 2, there is a decision rule for determining if the null hypothesis should be rejected or not. Due to the early look, p-values and confidence intervals need to be adjusted (50). It is a frequentist-type hypothesis testing-based approach, so type I and type II errors are selected and null and alternative hypotheses are chosen. In a single-stage design with a binary outcome, this information would be sufficient to identify the optimal design. However, with the Simon two-stage design, there are numerous designs (how many depends on the design parameters) that will satisfy the stated criteria. As a result, Simon suggested several criteria for choosing a design among the alternatives. One option is to choose the design that minimizes the expected sample size if the null hypothesis is true. Another is to choose the design that has the smallest maximum sample size (i.e., the smallest sample size at the end of stage 2). Table 37.1 presents the Simon two-stage designs chosen by these criteria where the null hypothesized response rate is 0.10, the alternative is 0.30, and alpha is 0.05 and power is 90%. Design 1 minimizes the expected sample size under the null, which is 22.5 compared to the
352
ONCOLOGY CLINICAL TRIALS
TABLE 37.1
Two Simon Two-Stage Design Options for H0 : p = 0.10; H1: p = 0.30. Designs Assume Alpha of 0.10 and Power of 0.90.
1. Minimum expected N under H0 2. Minimum N1 + N2
PROBABILITY OF STOPPING EARLY IF H0 IS TRUE
PROBABILITY OF STOPPING EARLY IF H1 IS TRUE
N1
r
N1 + N2
EXPECTED N IF H0 IS TRUE
18
2
35
22.5
0.73
0.06
22
2
33
26.2
0.62
0.02
expected sample size under the null in design 2, which is 26.2. However, notice that the total sample size if stage 2 is reached is 35 in design 1, but only 33 in design 2. Other important characteristics to consider when choosing the design is the early stopping chance under the two hypotheses. For design 1, there is higher chance of stopping early if the null hypothesis is true (0.73 vs. 0.62), but there is also a higher chance of stopping early if the alternative is true (0.06 vs. 0.02). More recently, Lee and Liu (51) proposed a Bayesian predictive probability design. The goal was to develop an efficient and flexible design that possesses desirable statistical properties, among which is being able to quantify the type I and type II errors, although it is a Bayesian design. Unlike the Simon two-stage, there are more opportunities for stopping; but because it is not rooted in the frequentist paradigm, the operating characteristics of the design do not suffer. The general idea is to monitor the trial after every k patients (where k can take the value of 1 or greater, and can even vary throughout the trial) and estimate the predictive probability (PP) that the treatment will be considered efficacious by the end of the study. If PP is quite large or small (e.g., >0.95 or <0.05), even early in the trial, the study can be stopped. In practical implementations, stopping is only considered for futility, so only small values of PP would cause the trial to stop early because in a single-arm study, if the treatment shows promise, we would like to continue to enroll patients. There is no ethical dilemma of randomizing patients to a less effective arm in a single arm study. Multiarm Adaptive Stopping and Sample Size Designs There are a number of ways that early stopping and sample size adaptation can be implemented in multiarm, comparative studies. In adaptive stopping and
sample size studies, there are one or more interim evaluations of the trial progress to determine if (a) the trial should proceed, (b) the trial should stop, or (c) the sample size of the trial should be modified to better achieve the objectives of the study. As mentioned above, stopping rules are quite common in treatment trials of oncology regardless of whether they be single-arm or multiarm studies and a number of these are among the classic frequentist group sequential designs, discussed in Chapter 19 (Interim Analysis of Phase III Trials). In this section, we will focus on (a) Bayesian stopping, (b) sample size adjustment, and (c) Phase II/III studies. In the interest of space, each will be described briefly. Bayesian Stopping Rules in Phase III Studies Muss et al. (52) implemented a Bayesian adaptive design to allow for discontinuation, with the initial design specifying between 600 and 1800 patients, with interim analyses (after N = 600, 900, 1200, and 1500 enrollments) performed to determine the strength of evidence in order to allow continuation or stopping (53). The trial compared capecitabine versus standard chemotherapy in women with stage I, stage II, stage IIIA, or stage IIIB breast cancer in the adjuvant setting. The primary endpoint was disease recurrence or death and was compared between groups using a hazard ratio (HR). Unlike standard interim analyses, instead of defining stopping boundaries and announcing results when predefined stopping boundaries are crossed, the decision to discontinue enrollment was based on the likelihood that future enrollments would likely provide a meaningful result. After 600 enrollments, the data were analyzed for predicted futility and the estimated HR was 0.53, favoring standard chemotherapy. But more importantly in terms of the stopping rules, there was significant evidence that continued follow-up would be futile in terms of showing that capecitabine
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
was not inferior to standard chemotherapy. As a result, enrollment was terminated. Sample Size Adjustment in Phase III Studies Sample size adjustment is incorporated in studies when (at the outset) investigators acknowledge that planning a trial is based on many assumptions, some of which may be quite precarious. Allowing sample size adjustment gives the opportunity to reevaluate the appropriateness of initial assumptions and to update them to decrease or increase the number of patients studied. Sequential designs have made significant progress in the area of sample size reestimation, as discussed by Chuang-Stein et al. (54). Distinctions have been made between blinded and unblinded sample size reestimation approaches. As is critical frequentist approaches that include interim evaluations of the data, control of type I error needs to be considered. Combination tests are commonly implemented, although alternative methods have been proposed that incorporate inferiority and superiority rules only for the first stage, and then determining the rules for future stages based on those at the earlier stages. Optimized designs have been proposed which adaptively try to minimize the mixture of expected sample sizes over a range of treatment effect sizes (55, 56, 57). The ADWG recommends minimizing the implementation of interim looks for sample size modifications. This is partly a reflection of their evaluation of frequentist designs for which the type I error rate increases for each look at the data. Bayesian designs, which do not have the same type of penalty, are more generous in allowing more frequent looks at the data (58). Seamless Phase II/III Studies A seamless phase II/III study is designed to achieve the goals of both the phase II and phase III studies in one design (59). Some obvious benefits of a seamless design are the ability to include patient information from both phases in the overall inferences, the lack of lead time between the learning phase (phase II) and the confirmatory phase (phase III), and the simplicity of implementing just one protocol instead of two (saving additional time and effort in the review by the institutional review board (IRB) and protocol review committees). This is an efficient approach when the investigational treatment is either clearly ineffective (i.e., would stop in the early phase) or has a strong beneficial effect and should clearly move quickly into a phase III design. The design challenges of the phase II/III studies include, in the frequentist setting, controlling the overall type I error rate and the power (59) and include
353
approaches such as combining p-values. Additional challenges include endpoint selection: response rate is common in phase II endpoint in oncology phase II studies, but survival outcomes (most commonly overall survival [OS]) are the standard endpoints in phase III studies. The seamless phase II/III approach requires either combining these endpoints or choosing a single endpoint that can appropriately be used in both phases. Bayesian approaches have been proposed (60, 61) and provide an example in unresectable stage III and stage IV lung cancer patients. All patients were to receive chemotherapy and radiation and patients were randomized to adding an adjuvant adenovirus, Ad-p53, that carries a gene thought to restore apoptosis while sensitizing cancer cells to chemoradiation versus control (i.e., no adenovirus added to chemoradiation) (60). The Bayesian design proposed defined a phase II stage (up to 8 to 12 months of enrollment) and a phase III stage (12 to 72 months). Stopping rules were included for stopping anytime during the course of the trial based on superiority of one of the arms with regard to patient survival. During the 8- to 12-month time period, rules were introduced for deciding whether or not to proceed to phase III. These included: (a) continuing the phase II portion, (b) organizing the phase III stage, and (c) stopping and concluding that the adenovirus arm is inferior to the control arm. Although Bayesian and frequentist designs differ in their modeling approaches, this approach is standard: there should be “go, no go” decisions to be made on prespecified criteria that dictate whether (and when) the phase III portion of the study should be undertaken.
Other Adaptive Approaches Adaptive approaches in clinical trials in oncology encompass a broad range of designs. This chapter scratches the surface and there are not only in-depth discussions of the approaches discussed here, but there are also additional approaches that are not able to be introduced due to space. The reader is referred to Jennison et al. (5), Chow et al. (62), and Berry (2) for more discussion.
References 1. Perlmutter J. Understanding Clinical Trials Design: A Tutorial for Patient Advocates. http://www.researchadvocacy.org/ publications/pdf/UnderstandingClinicalTrialDesignTutorial.pdf. 2007. 2. Berry DA. Bayesian statistics and the efficiency and ethics of clinical trials. Statist Sci. 2004;19(1):175–187.
354
ONCOLOGY CLINICAL TRIALS
3. Dragalin V. Adaptive designs: terminology and classification. Drug Infor J. 2006;40:425–435. 4. Simon R. Optimal two-stage designs for phase II clinical trials. Cont Clin Trials. 1989 Mar;10(1):1–10. 5. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. New York: CRC Press; 2000. 6. Efron B. Forcing a sequential experiment to be balanced. Biometrika. 1971;58:403–417. 7. Anscombe FJ. Sequential medical trials. J Am Stat Assoc. 1963;58:365–383. 8. Colton, T. A model for selecting one of two medical treatments. J Am Stat Assoc. 1963;58:388–400. 9. Simon R. Adaptive treatment assignment methods and clinical trials. Biometrics. 1977 Dec;33(4):743–749. 10. Zelen M. Play-the-winner rule and controlled clinical trials. J Am Stat Assoc. 1969;64:134–146. 11. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics. 1990 Mar;46(1):33–48. 12. Gallo P, Krams M. PhRMA working group on adaptive designs: introduction to the full white paper. Drug Infor J. 2006;40: 421–423. 13. Gallo P, Chuang-Stein C, Dragalin V, Gaydos B, Krams M, Pinheiro J. Executive summary of the PhRMA working group on adaptive designs in clinical drug development. J Biopharm Stat. 2006;16:275–283. 14. Chow SC (Ed.), J Biopharm Stat. 2007;17(6). 15. Chen YP. Biased coin design with imbalance tolerance. Communs Statist Stochast Mod. 1999;15:953–975. 16. Soares J F, Wu CFJ. Some restricted randomization rules in sequential designs. Communs Statist Theory Meth. 1983;12: 2017–2034. 17. Wei LJ. The adaptive biased coin design for sequential experiments. Ann Statist. 1978;6:92–100. 18. Atkinson AC, Donev A. Optimum Experimental Designs. Oxford: Clarendon; 1992. 19. Pocock SJ, Simon R. Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics. 1975 Mar;31(1):103–115. 20. Atkinson AC. Optimum biased coin designs for sequential clinical trials with prognostic factors. Biometrika. 1982;69:61–67. 21. Atkinson AC. Optimum biased-coin designs for sequential treatment allocation with covariate information. Stat Med. 1999 Jul 30;18(14):1741–1752; discussion 1753–1755. 22. Royall RM, Bartlett RH, Cornell RG, et al. Ethics and statistics in randomized clinical trials. Stat Sci. 1991;6(1):52–88. 23. Bartlett RH, Roloff DW, Cornell RG, Andrews AF, Dillon PW, Zwischenberger JB. Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics. 1985 Oct;76(4):479–487. 24. O’Rourke PP, Stolar CJ, Zwischenberger JB, Snedecor SM, Bartlett RH. Extracorporeal membrane oxygenation: support for overwhelming pulmonary failurein the pediatric population. collective experience from the extracorporeal life support organization. J Pediatr Surg. 1993 Apr;28(4):523528; discussion 528–529. 25. Andersen D, Faries J, Tamura R. A randomized play-the-winner design for multiarm clinical trials. Comm Stat Theory Methods. 1994;23:309–323. 26. Wei LJ, Durham S. The randomized play-the-winner rule in medical trial. J Am Stat Assoc. 1978;73:840–843. 27. Biswas A. Generalized delayed response in randomized playthe-winner rule. Comm Stat Simu Compu. 2003;32(1): 259–274. 28. Karrison TG, Huo D, Chappell R. A group sequential, response-adaptive design for randomized clinical trials, Contr Clin Trials. 2003;24:506–522. 29. Zelen M. A new design for randomized clinical trials, N Engl J Med. 1979;300:1242–1246. 30. Hu F, Rosenberger WF. Response adaptive randomization. Hoboken, NJ: Wiley & Sons; 2006.
31. Thall P, Wathen JK. Practical Bayesian adaptive randomisation in clinical trials. Eur J Cancer. 2007;43(5):859–866. 32. Berry DA, Eick SG. Adaptive assignment versus balanced randomization in clinical trials: a decision analysis. Stat Med. 1995;14:231–246. 33. Maki RG, Wathen JK, Patel SR, et al. Randomized phase II study of gemcitabine and docetaxel compared with gemcitabine alone in patients with metastatic soft tissue sarcomas: results of sarcoma alliance for research through collaboration study 002 [corrected]. J Clin Oncol. 2007 Jul;25(19): 2755–2763. 34. Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 1933;25(3/4):285–294. 35. Cancer therapy evaluation program, common terminology criteria for adverse events, version 3.0, DCTD, NCI, NIH, DHHS. In: Cancer Therapy Evaluation Program, Common Terminology Criteria for Adverse Events, Version 3.0, DCTD, NCI, NIH, DHHS, March 31, 2003 (http://ctep.cancer .gov). 36. Goodman SN, Zahurak M, Piantadosi S. Some practical improvements in the continual reassessment method for phase I studies. Stat Med. 1995;14;1149–1161. 37. Moller S. An extension of the continual reassessment methods using a preliminary up-and-down design in a dose finding study in cancer patients, in order to investigate a greater range of doses. Stat Med. 1995;14:911–922. 38. Faries D. Practical modifications of the continual reassessment method for phase I cancer clinical trials. J Biopharm Stat. 1994;4:147–164. 39. Piantadosi S, Fisher JD, Grossman S. Practical implementation of a modified continual reassessment method for dose-finding trials. Cancer Chemother Pharmacol. 1998;41:429–436. 40. Mandrekar SJ, Cui Y, Sargent DJ. An adaptive phase I design for identifying a biologically optimal dose for dual agent drug combinations. Stat Med. 2007 May 20;26(11): 2317–2330. 41. Braun T. The bivariate continual reassessment method: extending the CRM to phase I trials of two competing outcomes. Cont Clin Trials. 2002;23:240–256. 42. Thall PF, Cook JD. Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004 Sep;60(3):684–693. 43. Ishizuka N, Ohashi Y. The continual reassessment method and its applications: a Bayesian methodology for phase I cancer clinical trials. Stat Med. 2001;20:2661–2681. 44. Zohar S, Chevret S. Phase I (or phase II) dose-ranging clinical trials: proposal of a two-stage Bayesian design. J Biopharm Stat. 2003;13(1):87–101. 45. Garrett-Mayer E. The continual reassessment method for dosefinding studies: a tutorial. Clin Trials. 2006;3(1):57–71. 46. Babb J, Rogatko A, Zacks S. Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998 May 30;17(10):1103–1120. 47. Loeb DM, Garrett-Mayer E, Hobbs RF, et al. Dose-finding study of 153Sm-EDTMP in patients with poor-prognosis osteosarcoma. Cancer. 2009 Jun 1;115(11):2514–2522. 48. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. 49. Pocock SJ. Interim analyses for randomized clinical trials: the group sequential approach. Biometrics. 1982;38(1);153–162. 50. Koyama T, Chen H. Proper inference from Simon’s two-stage designs. Stat Med. 2009;27(16):3145–3154. 51. Lee JJ, Liu DD. A predictive probability design for phase II cancer clinical trials. Clin Trials. 2008;5:93–106. 52. Muss HB, Berry DA, Cirrincione CT, et al. Adjuvant chemotherapy in older women with early-stage breast cancer. N Engl J Med. 2009 May 14;360(20):2055–2065. 53. Berry DA. Bayesian clinical trials. Nat Rev Drug Discov. 2006 Jan;5(1):27–36. 54. Chuang-Stein C, Anderson K, Gallo P, Collins S. Sample size reestimation: a review and recommendations. Drug Infor J. 2006;40:475–484.
37 ADAPTIVE CLINICAL TRIAL DESIGN IN ONCOLOGY
55. Jennison C, Turnbull BW. Adaptive and nonadaptive group sequential tests. Biometrika. 2006;93:1–21. 56. Jennison C, Turnbull BW. Efficient group sequential designs when there are several effect sizes under consideration. Stat Med. 2006;25:917–932. 57. Banerjee A,Tsiatis AA. Adaptive two-stage designs in phase II clinical trials. Stat Med. 2006:25(19):3382–3395. 58. Wang MD. Sample size reestimation by Bayesian prediction. Biom J. 2007 Jun;49(3):365–377. 59. Maca J, Bhattacharya S, Dragalin V, Gallo P, Krams M. Adaptive seamless phase II/II designs-background, opera-
355
tional aspects, and examples. Drug Infor J. 2006;40: 463–473. 60. Inoue LY, Thall PF, Berry DA. Seamlessly expanding a randomized phase II trial to phase III. Biometrics. 2002 Dec;58(4): 823–831. 61. Berry DA, Müller P, Grieve AP, et al. Adaptive Bayesian designs for dose-ranging drug trials. In: Gatsonis C, Kass RE, Carlin B, et al. (eds.), Case Studies in Bayesian Statistics V. New York: Springer-Verlag; 2001:99–181. 62. Chow SC, Chang M. Adaptive Design Methods in Clinical Trials. Boca Raton, FL: Chapman and Hall; 2007.
This page intentionally left blank
38
Where Do We Need to Go with Clinical Trials in Oncology?
Andrea L. Harzstark Eric J. Small
The cost of developing a new drug reached over $800 million in 2000 (1), with over half of this cost accounted for by clinical trial costs. Clinical trial costs have grown more than five times as quickly as the costs for preclinical research and development (2), with increasing costs for subject recruitment and increasing clinical trial sizes. Phase III studies are by far the most expensive. In addition, only one out of five drugs that begin clinical trials in humans is ultimately approved by the FDA and marketed. In the face of these rising costs, and with diminishing budget resources becoming more of a limiting factor, it is becoming increasingly important to design more informative and definitive clinical trials. In addition, a more thorough understanding of the complexities inherent in drug design and evaluation and the factors necessary for drug efficacy necessitates a rational approach to drug evaluation which avoids empiricism. This chapter will discuss the need to focus on target identification and validation, pharmacokinetics (PK), mechanisms of resistance to therapy, patientrelated characteristics, such as pharmacogenomics, risk assessment, and integration of biomarkers and imaging as important components of future clinical trials.
TARGET DEFINITION AND VALIDATION Rather than focus on screening large libraries of compounds in tissue culture, rational design requires the identification and characterization of targets and pathways
that drive cancer growth and designing drugs that target these abnormalities. This involves a rigorous requirement to rule out targets that represent epiphenomena and for evidence of causation in drug design, with a focus on evaluating drugs with clear mechanistic data. For example, patients with von Hippel-Lindau (VHL) syndrome have long been known to have a propensity to develop renal cell carcinoma, and the VHL gene was identified using genetic linkage analysis in 1993 (3). The discovery that this tumor suppressor gene is mutated in most sporadic clear cell renal cell carcinomas but not in nonclear cell renal cell carcinomas led to an understanding of the VHL gene pathway and its role in the pathogenesis of renal cell carcinoma. The VHL protein, in combination with other proteins, targets the alpha subunit of hypoxia-inducible factor (HIF) for ubiquitinmediated degradation. HIF regulates the production of downstream genes, such as vascular endothelial growth factor (VEGF), which regulate renal cell carcinoma growth. The importance of this pathway was reinforced by the observation that mutations in the VHL gene are associated with increased VEGF promoter activity, which led to the knowledge that VEGF levels are increased in most patients with clear cell renal cell carcinoma (4, 5). These observations suggested that VEGF abnormalities are not only present in patients with renal cell carcinoma, but also likely play a pathogenic role and correlate with prognosis (6). This made VEGF an obvious target for renal cell carcinoma therapy. Therefore, it is no surprise that therapies targeting HIF and VEGF have made great 357
358
ONCOLOGY CLINICAL TRIALS
strides and replaced cytokine-based therapy, which was not based on strong mechanistic data but rather on the empiric observation that sporadic responses in patients with renal cell carcinoma were observed (7). The development of this knowledge has led to the use of multiple drugs in clear cell carcinoma to address these defects. There is now phase III evidence to support the use of the anti-VEGF antibody, bevacizumab, which has been shown, in combination with interferonα, to be more effective in prolonging progression-free survival (PFS) than interferon alone (8, 9). In addition, temsirolimus and everolimus, drugs which inhibit mTOR, a target which is upstream of HIF, have been demonstrated to prolong survival in the first-line setting and PFS in the second-line setting, respectively (10). The tyrosine kinase inhibitors, sunitinib and sorafenib, which target the VEGF receptor (VEGFR)-1 and VEGFR-2, have been demonstrated to prolong overall survival (OS) and PFS, in the first and second-line settings, respectively (11, 12, 13). The clinical evaluation of these drugs has been accompanied by correlative work to demonstrate that the intended target is actually hit by the drug. For example, evaluation of everolimus has revealed that it inactivates the phosphorylation of ribosomal protein S6 kinase 1 (S6K1) and that this inactivation of S6K1 is associated with antitumor activity in everolimus-treated rats. The efficacy of intermittent treatment schedules has been associated with prolonged inactivation of S6K1 in tumors as well as surrogate tissues, suggesting that monitoring of S6K1 activity in peripheral blood mononuclear cells can be used to determine everolimus treatment schedules (14). As the example of clear cell renal carcinoma demonstrates, a key component of rational design is the identification of a target and validation of that target as an important component of tumor pathogenesis. Validated targets are often proteins that play a direct role in malignant transformation or normal receptors and signaling proteins in pathways regulating apoptosis or the cell cycle, which are overexpressed or dysregulated by a mutation, resulting in the loss of protein function. For example, HER-2/neu (c-erbB-2), a member of the epidermal growth factor receptor (EGFR) family, is overexpressed because of gene amplification in 25% of breast cancers. It is associated with an elevated mitotic rate and correlates both with poor prognosis and with poor clinical response to some chemotherapeutic and antihormonal drugs. RhuMAb HER2 (or trastuzumab) is a humanized antibody, which was derived from 4D5, a murine monoclonal antibody that recognizes HER2/neu and inhibits activity of the epidermal growth factor. Overexpression of HER-2/neu is not only prognostic, however; it is also predictive of a response to trastuzumab with an improvement in OS (15).
Recognition of the appropriate target also allows the focus of treatment to be on patients who are likely to respond. Identification of HER-2/neu overexpression is a requirement for treatment with trastuzumab, eliminating patients who are unlikely to respond. Use of targeted agents involves altering the design of phase I studies (16). The goal of a phase I study for a cytotoxic agent is typically to determine the maximum tolerated dose (MTD), or the dose resulting in a less than prespecified limit of tolerable toxicity. This involves the presumption that giving as much drug as tolerated will lead to the greatest clinical effect. This is likely to be the case with standard cytotoxic agents, which generally exploit the differences in mitotic rate or functional cell cycle machinery between cancer cells and noncancer cells. In contrast, the focus for a study evaluating a targeted agent is entirely different. Safety is, of course, always an important focus, but the primary end point of a study is frequently to determine the dose at which the target is inhibited. The goal becomes determination of the optimal biologic dose, which is not necessarily the MTD.PK, pharmacodynamics (PD), and toxicity evaluation all play an important role in such studies, making them more complex to design and execute than standard studies of cytotoxic agents. Performance of such studies also invites the question of who the participants should be. Phase I studies of cytotoxic agents are frequently undertaken in patients with all or varying cancer diagnoses and no other therapeutic options. Instead, the focus with a targeted agent should ideally be on patients with target-bearing tumors with an understanding of the molecular pathways involved. For example, although KRAS (Kirsten rat sarcoma viral oncogene homolog) mutations are not associated with EGFR (epidermal growth factor receptor) abnormalities, the presence of KRAS mutations confer resistance to therapies targeting EGFR in lung and colorectal cancers (17, 18). Therefore, in designing a clinical trial for an agent targeting EGFR in colorectal cancer, the optimal patient for inclusion in a trial would not have the presence of a KRAS mutation as this patient may be less likely to respond to the therapy. This design requires not only an understanding of the pathways involved but also a practical and cost-effective way to assess presence of the target and changes in the target to evaluate the optimal biologic dose.
MECHANISMS OF RESISTANCE An understanding of mechanisms of resistance to current treatments is necessary if new therapies are to improve outcomes. This mandates that the focus of clinical trials be on addressing these mechanisms of resistance in a
38 WHERE DO WE NEED TO GO WITH CLINICAL TRIALS IN ONCOLOGY?
stepwise fashion. For example, ketoconazole, an adrenal androgen inhibitor that works by inhibiting cytochrome P450 14 α-demethylase, a catalyst of the conversion of lanosterol to cholesterol, plays a role in the management of castration resistant prostate cancer (CRPC) as a secondary hormonal manipulation (19). It has been demonstrated that at the time of progression on ketoconazole, adrenal androgen levels rise, suggesting that progression does not occur because of resistance to the low testosterone state, but rather from an inability of the drug to continue to suppress adrenal androgen production. Therefore, it was hypothesized that a drug that targets a different component of the adrenal androgen production pathway could be effective in treating patients with progression on ketoconazole. Abiraterone acetate, a 17αhydroxylase C17, 20-lyase inhibitor has demonstrated activity in patients with ketoconazole resistance (20). These patients have had increases in their adrenal androgen levels that are able to be suppressed with the use of abiraterone. Thus, an understanding of the mechanism allows both the design and clinical testing of an agent in a more focused and efficient setting.
PK In addition to incorporating mechanisms of activity and resistance into study design, it is critical to incorporate PK end points in the design of studies, as well. For example, in evaluating VEGF inhibitors, solubility and stability were determined to be vital factors affecting drug availability. Indolin-2 structural analogs were determined to have optimal solubility, and sunitinib was chosen as a lead clinical candidate among these because it was 10- to 30-fold more potent against VEGFR2 and PDGFR-β in biochemical and cell-based assays (21, 22). The agent also demonstrated greater solubility and stability in serum compared with other structurally related analogs. Based on this information, dosing studies were then performed in sensitive in vivo models to assess the minimal concentrations required to inhibit these receptor tyrosine kinases in mice with tumors, taking into account the extensive plasma protein binding of sunitinib (approximately 95% in humans and mice). The dose required to achieve the target plasma concentration necessary to induce sustained inhibition of the targets was determined. Phase I studies in humans utilized this data and included a goal of achieving these in vivo concentrations, with confirmation that inhibition of the planned targets was achieved. This understanding formed the basis of dosing schedules in humans, demonstrating that a rational stepwise strategy is more likely to result in clinical success than an empiric approach.
359
PATIENT/PHARMACOGENOMIC FACTORS As medicine becomes more personalized, patient-specific factors, such as pharmacogenomics and risk assessment according to nomograms, are likely to become a more important component of treatment decisions. Germline genetic variations may determine how hormones or drugs are metabolized. For example, germline genetic variation in the androgen axis has been evaluated in a cohort of 529 men with advanced prostate cancer on androgen deprivation therapy (23). One hundred and twenty-nine DNA polymorphisms were evaluated in 20 different genes involved in androgen metabolism and three 9CYP19A1, HSD3B1, and HSD17B4 were significantly associated with time-toprogression (P <0.01) during androgen deprivation therapy. This has not yet impacted how patients are treated in a standardized way, but it has begun to make its way into patient selection for clinical trials. It may ultimately be used to identify patients likely to have early progression on androgen deprivation therapy who may be candidates for more aggressive treatment with other therapies. As another example, tamoxifen is a pro-drug that requires metabolism by CYP2D6 to its active form. Polymorphisms in CYP2D6 have been correlated with outcome in tamoxifen treated breast cancer patients. In a retrospective cohort of 206 tamoxifen-treated and 280 tamoxifen-untreated breast cancer patients, patients with those alleles associated with decreased enzyme function had more recurrences, a shorter relapse-free time, and worse event-free survival than patients who carried functional alleles. These differences were not present in patients who were not treated with tamoxifen (24). CYP2D6 testing is not yet routinely performed in patients for whom tamoxifen use is intended, but it is becoming more prevalent. When testing is performed, if polymorphisms that predict a poorer outcome with tamoxifen use are identified, an alternative agent is chosen. This is the goal: a pharmacogenomic factor that is correlated with outcome and can be used to guide treatment decisions. Risk assessment models may also be useful in risk stratifying patients for choice of therapies. Halabi and colleagues developed a risk assessment model for patients with metastatic CRPC and demonstrated that performance status, hemoglobin, lactic dehydrogenase (LDH), alkaline phosphatase, Gleason score, prostate specific antigen (PSA), and the presence of visceral disease predict patient outcome (25). This model was then used to develop a nomogram which is useful for multiple reasons: it allows the rough comparison of phase II trial results to predicted results/historical controls in obtaining a preliminary estimate of efficacy for new
360
ONCOLOGY CLINICAL TRIALS
therapies, it allows patients to be risk stratified for study entry, and it allows therapies to be targeted to patients who may be most likely to benefit. For example, CALGB 90401, a study of docetaxel and prednisone with or without bevacizumab for metastatic castrate resistant prostate cancer, prospectively stratifies patients using the Halabi model to ensure that baseline prognosis is similar in the two groups. In another risk assessment model, patients with metastatic prostate cancer and pain have been demonstrated in a retrospective study to have a worse outcome with a shorter survival (26). This information has led to targeting of these patients for more aggressive therapies. For example, two clinical trials involving the prostate cancer vaccine GVAX were performed. For patients with metastatic prostate cancer and no pain, a randomized phase III study of docetaxel versus GVAX was planned; whereas for patients with pain, a randomized phase III study of docetaxel plus GVAX versus docetaxel alone was planned (27, 28). Differences in clinical outcome based on clinical factors need to be prospectively validated, but their incorporation into clinical trial design will allow therapies to be targeted to the appropriate populations.
INTEGRATION OF BIOMARKERS AND IMAGING In addition to rational design with the use of target identification and validation, the future of clinical trial design will require determination and integration of biomarkers. A biomarker must be an effective marker of outcome, meaning a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention. For a biomarker to be effective, the effects on an intervention of the biomarker must reliably predict the overall effect on clinical outcome (and not merely correlate with outcome) (29). This is frequently the requirement that fails in clinical testing. This can occur because the disease process could affect clinical outcome through pathways not mediated through the marker. Alternatively, the intervention’s effect on pathways different from the effect on the marker-intervention might also affect the clinical outcome by mechanisms of action that operate independently of the disease process. The ideal biomarker is easily measured. Tumor factors that may predict response to a therapy but require fresh tissue from a biopsy are difficult both to evaluate and to use therapeutically. Factors that can be measured in the serum or urine are preferable because they can be obtained repeatedly and may be
useful in monitoring a response to therapy. For example, circulating tumor cells appear to correlate with prognosis and may reflect a response to therapy in multiple malignancies. They are likely to be an effective biomarker in metastatic breast cancer, in which elevations have been shown to correlate with poor prognosis. In addition, changes in circulating tumor cells with therapy correlate with PFS and OS in patients with metastatic disease (30). However, additional prospective validation is needed before modifications in circulating tumor cells can be used to routinely monitor therapy or as the end point for clinical trials. In addition, the findings demonstrated in breast cancer have not been as well validated in other malignancies. There is increasing interest in imaging biomarkers as the use of targeted agents has expanded. These agents are often thought to produce cytostasis rather than toxicity, resulting in tumor stability rather than shrinkage, making routine use of computed tomography (CT) or MRI to monitor outcome to therapy less useful. Imaging biomarkers that are able to detect antitumor activity for agents with different mechanisms of action are needed. There is additional interest in positron emission tomography (PET) and magnetic resonance imaging (MRI) techniques, such as dynamic contrast enhanced MRI (DCE-MRI), that are able to evaluate microvascular changes (31). Currently, there are no FDA-approved imaging biomarkers, but there are multiple modalities undergoing evaluation, including measurement of change in Hounsfield units on CT, which is used to evaluate the ability of the tumor to accumulate contrast medium; measurement of the enhancing fraction on contrast CT and MRI, which is used to measure the status of tumor vasculature; and PET with 18-fluorodeoxyglucose, which is used to measure glucose transport and metabolism in tumors. The use of dihydrotestosterone as a PET tracer that labels cells containing the androgen receptor is also being evaluated for use in prostate cancer, a disease in which traditional fluorodeoxyglucose (FDG) PET scans are not generally useful. Additional validation of these modalities will be needed before they can be used to evaluate new therapies. Use of survival end points in prostate cancer, a disease with a relatively long natural history means that follow-up takes multiple years, delaying progress and adding expense to clinical trials. It is vital to determine a way to evaluate response more expeditiously without losing the ability to accurately predict outcome. The current lack of acceptable biomarkers is a considerable hindrance to drug development in cancer. The use of appropriate intermediate markers of outcome would significantly shorten the time required to evaluate new drugs. For example, in studies of prostate cancer, PSA
38 WHERE DO WE NEED TO GO WITH CLINICAL TRIALS IN ONCOLOGY?
nadir on androgen deprivation therapy is known to be a predictor of survival (32). However, some agents clearly modulate PSA without modulating disease, and changes in PSA do not necessarily correlate with survival in every disease state (33). Therefore, PSA is not a universal biomarker and modulations in PSA are not an approvable end point for new drugs. PSA doubling time (PSADT) has demonstrated the ability to predict survival in specific settings, but changes in PSADT have not been correlated with survival; therefore, it is also not an acceptable intermediate marker of outcome (34, 35). The search for biomarkers must become a goal of clinical trials rather than merely an exploratory end point. This prevents the problem of not knowing whether a biomarker is unsuccessful because it is not useful or because it was evaluated in conjunction with a therapy that was not effective. In addition, it increases the likelihood that a useful biomarker will be identified. For example, the I-SPY study uses serial breast biopsies and MRI to measure the response to chemotherapy in a multicenter study of neoadjuvant chemotherapy in 237 women with locally advanced breast cancer (http://www.bcrfcure.org/action_0708 grantees_ esserman.html). The study utilizes serial biopsy and MRI to evaluate response to neoadjuvant chemotherapy. Patients received anthracycline-based chemotherapy and had serial MRI and core biopsies performed prior to chemotherapy, after one cycle, during treatment, and before surgery. The primary end point of the study is to identify surrogate markers of response that are predictive of pathologic complete response and survival in women with stage II and stage III breast cancer, with the ultimate goal of eventually identifying nonresponders early to design more effective therapies for them. Determining molecular predictors of response and validating them offers the possibility of rapidly optimizing therapies and improving responses more rapidly. Making the determination of biomarkers using a strong biologic rationale a focus of clinical studies, rather than an exploratory or correlative end point, is a vital first step.
WHERE WE’RE HEADED: HOW TO DO IT RIGHT The movement to include more targeted agents for use in cancer therapy is emblematic of the direction that clinical trials in oncology need to take. A focus on rational drug design, with identification of targets that play a pathogenic role in disease rather than merely being present increases the likelihood of a targeted therapy playing an important role. Early incorporation of PK measurements and target measurements will make drug failure secondary to an inappropriate schedule or
361
subtherapeutic levels less likely. Attention to patient specific factors, such as pharmacogenomic parameters and information from nomograms, in the planning of studies, rather than as an afterthought, will increase the likelihood that the population most likely to benefit from a drug is targeted for evaluation. Finally, a focus on biomarker identification and evaluation is a vital way to evaluate and optimize agents quickly. These changes leverage the exciting advances in the understanding of the pathogenesis and treatment of disease so that new agents can be translated into the clinical setting with the best possible clinical outcomes.
References 1. DiMasi J, Paquette C. The economics of follow-on drug research and development: trends in entry rates and the timing of development. PharmacoEconomics. 2004;22:1–14. 2. DiMasi JA, Grabowski HG. Economics of new oncology drug development. J Clin Oncol. 2007;25(2):209–216. 3. Latif F, Tory K, Gnarra J, et al. Identification of the von HippelLindau disease tumor suppressor gene. Science. 1993;260: 1317–1320. 4. Nicol D, Hii SI, Walsh M, et al. Vascular endothelial growth factor expression is increased in renal cell carcinoma. J Urol. 1997;157(4):1482–1486. 5. Siemeister G, Weindel K, Mohrs K, Barleon B, Martiny-Baron G, Marmé D. Reversion of deregulated expression of vascular endothelial growth factor in human renal carcinoma cells by von Hippel-Lindau tumor suppressor protein. Cancer Res. 1996;56(10):2299–2301. 6. Jacobsen J, Rasmusen T, Grankvist K, Ljungberg B. Vascular endothelial growth factor as prognostic factor in renal cell carcinoma. J Urol. 2000;163(1):343–347. 7. Quesada J, Swanson DA, Trindale A, Gutterman JU. Renal cell carcinoma: antitumor effects of leukocyte interferon. Cancer Res. 1983;43(2):940–947. 8. Escudier B, Pluzanska A, Koralewski P, et al. Bevacizumab plus interferon alfa-2a for treatment of metastatic renal cell carcinoma: a randomised, double-blind phase III trial. Lancet. 2007;370(9605):2103–2111. 9. Rini BI, Halabi S, Rosenberg JE, et al. Bevacizumab plus interferon alfa compared with interferon alfa monotherapy in patients with metastatic renal cell carcinoma: CALGB 90206. J Clin Oncol. 2008;26(33):5422–5428. 10. Motzer R, Escudier B, Oudard S, et al. Efficacy of everolimus in advanced renal cell carcinoma: a double-blind, randomised, placebo-controlled phase III trial. Lancet. 2008;372:449–456. 11. Motzer R, Bukowski RM, Figlin RA, Hutson TE, Michaelson MD, et al. Prognostic nomogram for sunitinib in patients with metastatic renal cell carcinoma. Cancer. 2008;113(7):1552–1558. 12. Motzer R, Hutson TE, Tomczak P, et al. Sunitinib versus interferon alfa in metastatic renal cell carcinoma. N Engl J Med. 2007;356(2):115–124. 13. Escudier B, Eisen T, Stadler WM, et al. Sorafenib in advanced clear cell renal cell carcinoma. N Engl J Med. 2007;356(2):125–134. 14. Boulay A, Zumstein-Mecker S, Stephan C, et al. Antitumor efficacy of intermittent treatment schedules with the rapamycin derivative RAD001 correlates with prolonged inactivation of ribosomal protein S6 kinase 1 in peripheral blood mononuclear cells. Cancer Res. 2004;64(1):252–261. 15. Romond EH, Perez EA, Bryant J, et al. Trastuzumab plus adjuvant chemotherapy for operable HER2-positive breast cancer. N Engl J Med. 2005;353(16):1673–1684. 16. Fox E, Curt GA, Balis FM. Clinical trial design for target-based therapy. Oncologist. 2002;7(5):401–409.
362
ONCOLOGY CLINICAL TRIALS
17. Tsao A, Tang XM, Sabloff B, et al. Clinicopathologic characteristics of the EGFR gene mutation in non-small cell lung cancer. J Thoracic Oncol. 2006;1(3):231–239. 18. DiFiore M, Charbonnier F, Lefebure B, et al. Clinical interest of KRAS mutation detection in blood for anti-EGFR therapies in metastatic colorectal cancer. Br J Cancer. 2008;99(3):551–552. 19. Small E, Halabi S, Dawson NA, et al. Antiandrogen withdrawal alone or in combination with ketoconazole in androgen-independent prostate cancer patients: a phase III trial (CALGB 9583). J Clin Oncol. 2004;22:1025–1033. 20. Ryan C, Smith MR, Rosenberg JE, et al. Impact of prior ketoconazole therapy on response proportion to abiraterone acetate, a 17-alpha hydroxylase C17,20-lyase inhibitor in castration resistant prostate cancer (CRPC). J Clin Oncol. 2008;26(May 20 suppl):[abstr]5018. 21. Chow L, Eckhardt SG. Sunitinib: from rational design to clinical efficacy. J Clin Oncol. 2007;25(7):884–896. 22. Mendel D, Laird AD, Xin X, et al. In vivo antitumor activity of SU11248, a novel tyrosine kinase inhibitor targeting vascular endothelial growth factor and platelet-derived growth factor receptors: determination of a pharmacokinetic/pharmacodynamic relationship. Clin Cancer Res. 2003;9(1):327–337. 23. Ross R, Oh WK, Xie W, et al. Inherited variation in the androgen pathway is associated with the efficacy of androgen-deprivation therapy in men with prostate cancer. J Clin Oncol. 2008;26(6):842–847. 24. Schroth W, Antoniadou L, Fritz P, et al. Breast cancer treatment outcome with adjuvant tamoxifen relative to patient CYP2D6 and CYP2C19 genotypes. J Clin Oncol. 2007; 25(33):5187–5193. 25. Halabi S, Small EJ, Kantoff PW, et al. Prognostic model for predicting survival in men with hormone-refractory metastatic prostate cancer. J Clin Oncol. 2003;21(7):1232–1237. 26. Halabi S, Vogelzang NJ, Kornblith AB, et al. Pain predicts overall survival in men with metastatic castration-refractory prostate cancer. J Clin Oncol. 2008;26(15):2544–2549.
27. Higano C, Saad F, Somer B, et al. A phase III trial of GVAX immunotherapy for prostate cancer versus docetaxel plus prednisone in asymptomatic, castration-resistant prostate cancer (CRPC). In: ASCO Genitourinary Cancers Symposium. Orlando, FL; 2009. 28. Small E, Demkow T, Gerritsen WR, et al. A phase III trial of GVAX immunotherapy for prostate cancer in combination with docetaxel versus docetaxel plus prednisone in symptomatic, castration-resistant prostate cancer (CRPC). In: ASCO Genitourinary Cancer Symposium. Orlando, FL; 2009. 29. Prentice R. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989;8(4):431–440. 30. Budd G, Cristofanilli M, Ellis MJ, et al. Circulating tumor cells versus imaging—predicting overall survival in metastatic breast cancer. Clin Cancer Res. 2006;12(21):6403–6409. 31. O’Connor J, Jackson A, Asselin MC, Buckley DL, Parker GJ, Jayson GC. Quantitative imaging biomarkers in the clinical development of targeted therapeutics: current and future perspectives. Lancet Oncol. 2008;9(8):766–776. 32. Hussain M, Tangen CM, Higano C, et al. Absolute prostatespecific antigen value after androgen deprivation is a strong independent predictor of survival in new metastatic prostate cancer: data from Southwest Oncology Group Trial 9346 (INT0162). J Clin Oncol. 2006;24(24):3984–3990. 33. Armstrong A, Garrett-Mayer E, Ou Yang YC, et al. Prostatespecific antigen and pain surrogacy analysis in metastatic hormone-refractory prostate cancer. J Clin Oncol. 2007; 25(25):3965–3970. 34. Semeniuk R, Venner PM, North S. Prostate-specific antigen doubling time is associated with survival in men with hormonerefractory prostate cancer. Urology. 2006;68(3):565–569. 35. Valicenti R, DeSilvio M, Hanks GE, et al. Post-treatment prostatic-specific antigen doubling time as a surrogate endpoint for prostate cancer-specific survival: an analysis of Radiation Therapy Oncology Group Protocol 92–02. Int J Radiat Oncol Biol Phys. 2006;66(4):1064–1071.
Index
Note: Page references followed by “f ” refer to figures and those in “t” refer to tables. Abbreviations, 121 Abnormal pulmonary function, clinical trials in, 274 Absorption, distribution, metabolism, and excretion (ADME), 251, 257 Abstracts, 180, 287–288 Accelerated approval of drugs, 312–313 Activities of daily living (ADL), 267–268 Acute lymphoblastic leukemia (ALL), 9 Adaptive allocation covariate-adaptive designs, 344–346 response-adaptive designs, 346–348 Adaptive approaches in clinical trials, 353 Adaptive design for biomarkers. See Biomarker-adaptive designs Adaptive Designs Working Group (ADWG), 344, 353 Adaptive dose escalation, 349 Adaptive randomization, 79–80, 115 minimizing imbalance, 80–81 treatment assignment, 79–80 Adaptive sample size and stopping designs, 351–353 multi-arm designs, 352–353 single-arm designs, 351–352 AdEERS reports, 146–148 Adjuvant chemotherapy anthracycline-based versus non-anthracycline-based, 291t integration of biomarkers and imaging in, 361 adjuvantonline.com, 267 ADL. See Activities of daily living (ADL) ADME. See Absorption, distribution, metabolism, and excretion (ADME)
Administration method, 181 Adverse drug reactions (ADRs), 153t, 154 Adverse Event Expedited Reporting System (AdEERS), 134, 143 Adverse events (AEs). See also Serious adverse events (SAEs) data analysis and reporting, 185 definitions of, 153t routine reporting of, 133 selection in data and safety monitoring plans, 126 Adverse events (AEs), collection of, 156 patients on clinical trials, 157–159 practical considerations and suggestions for clinical investigators, 158–160 Adverse events (AEs), reporting of assessment of attributions, 148 background of, 141–142 basic principles of, 142–143 database, 148 FDA guidelines, 143 narratives and supplemental documents, 148 NCI/CTEP guidelines, 143 selection of terms and grades, 147–148 trials under CTEP IND, 144–145t, 149t ADWG. See Adaptive Designs Working Group (ADWG) AEs. See Adverse events (AEs) Agent Specific AE List (ASAEL), 146 Aggregate-level meta-analysis, 292 Agreements, 317–318 Alkylating agents, 9 ALL. See Acute lymphoblastic leukemia (ALL)
363
364
Alternative designs better efficiency, 203–205 better prediction, 201–202 Altman, D. G., 194 American College of Radiology Imaging Network (ACRIN), 336 American Joint Committee on Cancer (AJCC), 180 American Society of Clinical Oncology (ASCO), 176, 236 Analysis of trials HRQOL, 262–265 noninferiority trials, 106–107 Analysis of trials, pitfalls in censoring when data are missing, 211–213, 212f competing risks, 209–210, 210f outcome-by-outcome analysis, 210–211 subsets, 208–209 testing after randomization, 205–207, 206f vanishing denominator, 207–208 Analysis of variance (ANOVA), 47 Angiogenesis, 240–241 Animal models, preclinical drug assessment in, 25 Antiangiogenesis effects, 47 Antifolates, 9 Appendices, 121, 129–130 Arsenic, 8 Arterial input function (AIF), 248 ASCO. See American Society of Clinical Oncology (ASCO) Aspirational benefit, defined, 17 As treated analysis, 76 Audits, 339. See also Site audits Authorship, 180, 317 Autologous transplantation, 198 Autumn crocus, 8 Background adverse event reporting, 141–142 investigational study, 122 terminology development, 153–154 Background knowledge, importance of, 30 Baseline tumor measurement forms, 136 Batimastat, 25 Bayesian Adaptive Randomization (BAR), 347–349, 350f Bayesian designs, 44–45, 69 advantages of, 110 examples, 114–117 problems of, 116–117 recommendations, 117 requirements, 110–114 types of, 109–110 Becquerel, Henri, 7 Begg, C. B., 77 Belmont Report, 12 beneficence in, 14 justice in, 15 Beneficence, in clinical research, 14–15 Benefits, 17. See also Risk/benefit analysis Benefit-to-risk ratio, 163, 165
INDEX
Benivieni, Antonio, 6 “Be ready for use” predictive biomarkers, 220 Bernard, Claude, 7 Berry, D. A., 109, 112 Beta distribution, 45f, 69, 112 Bias definition of, 73 experimental bias, 75 publication bias, 295–296 selection bias, 75 sources of, 74, 316 Biased coin randomization, 78 Bichat, Marie François Xavier, 6 Biologically optimal dose (BOD), 36 Biologic License Application (BLA), 143 accelerated approval, 312–313 approval of oncology drugs, 313 legal basis behind drug approval requirements, 312 Bioluminescence, 25–26 Biomarker-adaptive designs, 223–224 Biomarker-based follow-up design, 222f, 223 Biomarker-based strategy design, 222f, 223 Biomarkers definitions, 215, 241 designs, 221–224, 222f as determinants of efficacy, 252–253 expression in phase I/II dose-finding trial, 115 identification and validation of, 219–221, 220t integration of, 360–361 preventing toxicity, 255–258t prognostic versus predictive, 216f, 217 relationships between treatment and clinical end point, 216f relationships with PKs, 253 selection criteria in clinical trials, 252, 255–258t selection of therapy, 257t statistical criteria for, 215–218 trial designs, 221–224, 222f uses of, 218–219 Biostatistical input, in trial design, 32 Biostatisticians, role of, 2, 32 Bivariate designs, 69–70 BLA. See Biologic License Application (BLA) Bleomycin, 274 Blinded control, 199–200 Blocked randomization, 78–79, 78t, 79t Bonferroni procedure, 96 Bootstrapping, 195–196 Boundary criteria, determining, 113 BRB-ArrayTools, 232 British Medical Research Council, 8 Budget, development of, 318 Burden of research, 15 Byar, D. P., 2 caAERS, 135, 138 CALGB. See Cancer and Leukemia Group B (CALGB) CALGB/SWOG trial, 94
INDEX
CALGB 9481 trial multivariable analyses, 186–187 poor accrual, 185–186 stratified analysis, 186 CALGB-89803 trial, 168 CALGB 90203 trial, 189 CALGB 90206 trial, 191 CALGB 90401 trial, 190 Calibration, 113–114, 195 Camptothecins, 9 Cancer and Leukemia Group B (CALGB), 135, 138, 181 classification of hepatic and renal impairment, 275t Data and Safety Monitoring Board (DSMB), 185 Pharmacology and Experimental Therapeutics Committee, 272 Cancer Biomedical Informatics Grid (caBIG) of NCI, 131 Cancer Diagnosis Program (DCP) of NCI, 338 Cancer Imaging Program of NCI, 245 Cancer Therapy Evaluation Branch of NCI, 133 Cancer Therapy Evaluation Program (CTEP) of NCI, 153 Cancer Treatment Evaluation Program (CTEP) of NCI, 2, 17, 119, 123, 143, 338 Cancer Trials Support Unit (CTSU), 338 Carcinoembryonic antigen (CEA), 218 Case histories, managing, 127 Case report forms (CRFs), 131 data elements of, 146 events collected in, 146 reporting adverse events, 142–143 types of, 132 Cause specific proportional hazards models, 210 CEA. See Cost-effectiveness analysis (CEA) Cell theory, 6 Censored patients, 182 Censoring definition of, 192–193 pitfalls of, 211–213, 212f Center for Biologics Evaluation and Research (CBER), 307 Center for Devices and Radiological Health (CDRH), 307 Center for Drug Evaluation and Research (CDER), 307 Centers for Medicare and Medicaid Services (CMS), 269 Central Institute Review Board (CIRB), 19. See also Institutional Review Boards (IRBs) CEO Roundtable on Cancer, 318 CFR. See U. S. Code of Federal Regulations (CFR) CGA. See Comprehensive geriatric assessment (CGA) Challenges international collaboration of trials, 342t multiple arm trials, 98 Chemicals, use of, 9 Chemistry, manufacturing, and control (CMC), 310 Chemotherapy bias in, 74 origin of, 8–9 patients with organ impairment, 270 pregnancy and, 256 terminology and concept of, 9 timing of randomization, 76
Chemotherapy agents demonstrating link between PKs and PDs, 256f food effect and, 257 monitoring renal insufficiency, 273t Cheson Criteria for lymphomas, 127 Children, respect for, 13 Children’s Cancer Group, 76 Childs-Pugh system, 272, 275t Chi-square method, 181 Chronic kidney disease, stages of, 273t Chuang-Stein, C., 353 Citrus fruits, 7–8 Classification trees, 193–194 Cleanup phase, in data collection, 138 Clinical benefit (CB), 30 Clinical Community Oncology Program (CCOP), 338 Clinical equipoise, 75–76 Clinical investigators, role of, 158–160 Clinical pharmacologists, role of, 311 Clinical research beneficence, 14–15 conflict of interest, 18 ethical issues with phase I oncology trials, 16–18 ethical issues with tissue and data banking, 18 ethical principles, 11–12 future issues, 19 historical perspectives, 11 IRB approval issues, 16 justice, 15–16 respect for persons, 12–13 Clinical research associates (CRAs), role of, 134, 325 Clinical research nurse (CRN), role of, 324–325, 326t Clinical research organization (CRO) personnel, 156 Clinical Review Division of FDA, 309 Clinical reviewers, role of, 311 Clinical trials conducting consent discussion, 13t definition of, 1, 119 designing, 32 enrollment in, 12t ethics, 11–19 history, 5–9 objectives of, 1 phases, 1–2 resources, 3–4, 3t scope, 2–3 clinicaltrials.gov, 1, 128, 287 Clinical Trials Monitoring Branch (CTMB), 133 Clinical Trials Support Unit (CTSU), 137, 138 Closed-ended questions, 136 Clowes, George, 9 Cockcroft-Gault formula, 270–271, 272t Codes of conduct, 11 Coinvestigators, role of, 323–324 Colchicine, 8 Collaboration with industry, 315 budget, development of, 318 contract negotiation, recognizing, 317–318 investigators’ role in, 316
365
366
Collateral benefit, 17 Commercial INDs, 300 Common Data Elements (CDE), 131 Common Rule, 12, 327, 332 Common Terminology Criteria (CTC), 141 Common Terminology Criteria for Adverse Events (CTCAE), 153, 349 advantages of, 155 disadvantage of, 155 grading guidelines, 134 grading scale examples, 155t MedDRA versus, 156 of NCI, 141, 142 Common Toxicity Criteria (CTC) of NCI, 134 Communication, clinical investigators and statisticians, 114, 117 Comorbidity, 267–268 Comparative genomic hybridization, 22 COMPARE computer algorithm, 24 Competing risks, pitfalls of, 209–210, 210f Completeness of follow-up, 183–184, 184f, 184t Complete platelet recovery (CR), 46 Complete remission rate, 45 Complete remission without complete platelet recovery (CRp), 46 Complete response (CR), 30–31, 37, 244, 246t Composite end point, 209 Comprehensive Adverse Event and Potential Risk List (CAEPR), 146 Comprehensive Adverse Event and Potential Risks (CAEPR) list, 134 Comprehensive geriatric assessment (CGA), 268 Computed tomography (CT), 239–240, 242f, 360 surrogate biomarkers imaging, 241, 243f target lesion identification, 242–243 Computer-aided response assessment, 248 Computer programs, and randomization, 77 Concept sheet, protocol writing, 120 Concluding noninferiority, 107–108 Concordance index (c-index), 195 Confidence intervals, 51, 53, 107 Conflict of interest, 18 Consent documents. See Informed consents Consolidated Standards of Reporting Clinical Trials (CONSORT) patient flow diagram, 182, 183f statement, 179, 182 Contact information, 121 “Content and Format of Investigational New Drug Applications (INDs) for Phase I Studies of Drugs, including Well- Characterized, Therapeutic, Biotechnology-Derived Products” (FDA), 302–303 Continual reassessment method (CRM), 64 example of, 352f, 351 modifications of, 349–350 Continuous complete remission (CCR) rates, 173 Contract negotiators, role of, 316 Contract review, 318 Contracts, 316
INDEX
Control groups, pitfalls of historical controls, 198–199 randomized controls, 199–200 Controlled Trials, 3 Controls concurrent, 75 historical, 75 Control versus experimental drugs, 94, 94t “Cooperative Group Guidelines for the Development, Conduct and Analysis of Clinical Trials with International Collaborating Institutions,” 342 Cooperative groups, 335, 339t brief history, 336–337 clinical trial contributions, 343t collaboration with NCI, 337–338 cost of trials, 339–340 development of concept and protocol, 339 development of research strategy, 338–339 intergroup collaboration, 338 international collaboration, 340–342, 342t organizational structure of, 338 Coronary Drug Research Project Group, 207 Correlation (r), 47 Correlative studies, 126–127 Cost-effectiveness analysis (CEA) critical inputs, estimating, 280–281, 282 critical questions, 278 examples interventions, 277–278 identifying relevant stakeholders, 278–279, 282 sensitivity analysis, 281, 283t, 283 summarization of outcomes, 280–281, 282–283, 284t time frame, 279, 282 trade-offs, 278, 282 Costs cooperative group trials, 339–340 interim monitoring, 172–175 Covariate-adaptive designs, 344–346 Cox model, 193, 211 Credible region, 45, 45t CRFs. See Case report forms (CRFs) Critical inputs, in CEA model, 279–280, 282 CRM. See Continual reassessment method (CRM) CRN. See Clinical research nurse (CRN) Crossing hazards, 172–173 Cross-sectional date analysis, 263 Cross-validation, 195, 232 CT. See Computed tomography (CT) Curie, Marie, 7 Curie, Pierre, 7 Cytostatic agents, 30 Dahut, W. L., 30 Data and safety monitoring boards (DSMBs), 14–15, 339 Data and safety monitoring plans (DSMPs), 14, 126 Databases collection, 138 meta-analysis, 287 reporting adverse events, 148
INDEX
Data collection, 136–138 Data extraction, 288–289 Data integrity, importance of, 325 Data management errors in, 116 process of, 127–128 quality control in, 136–138 Data managers (DMs), role of, 325, 327t Data monitoring committee (DMC), role of, 14, 171–172 Data safety and monitoring committee, 105, 338 DCE-MRI (dynamic contrast enhanced MRI), 240–241, 360 measuring efficacy of antiangiogenics and vascular disrupting agents, 243 response assessment criteria, 247 surrogate biomarkers imaging, 242 Death as event, 182 Death times (tj), calculation of, 52–53, 52t Decision-making process, in clinical trials, 17 Decision theory, in clinical trial design, 112 Declaration of Helsinki, 11, 327–328 Deficiencies, major versus minor, 133 Definitions, as component of investigational study, 121 “De Humani Corporis Fabrica” (On the Fabric of the Human Body: Vesalius), 6 Delegation of Log, 322 Department of Health, Scotland, 75 “De Sedibus et Causis Morborum per Anatomen Indagatis” (The Seats and Causes of Diseases Investigated by Anatomy: Morgagni), 6 Designing a sound scientific and ethical research study, 15t Designs, pitfalls of alternative designs, 201–205 choice of control group, 197 choice of experimental arm, 200 historical controls, 197–199 randomized controls, 199–200, 199f structural issues, 200–201, 201f design, 3 + 3, 58–60 Desu, M. M., 88, 90 Developmental Therapeutics Program (DTP) of NCI, 23 Device regulations, 303–305, 306t DFS. See Disease-free survival (DFS) Diagonal linear discriminant analysis, 231 Dictionaries, 154–156 Dihydrotestosterone, 360 Dinosaurs, cancer in, 5 Dioscorides, 8 Direct benefit, 17 Disclosure, 18 Discrimination, in model validation, 195 Discussion, in data analysis and reporting of results, 187 Disease-free survival (DFS), 39–40, 209 Disease stabilization (SD), 30 DLT. See Dose limiting toxicity (DLT) DLT/dose reduction score (SCORE), 62 DLT rate (θ), 63 DNA methyltransferase I inhibitors (DNMTi), 32 Doll, Richard, 8
Dollars per life-year, 280 Dollars per QALYs, 280–281 Dose escalation schema, 125, 125t Dose-escalation trial, 36 Dose limiting toxicity (DLT), 57–58, 125, 349 Dose-toxicity curve (DTC), 350f, 349 Dosing delays, 125–126 Dosing modification, 125–126 Double placebos, 200 Drug administration schedule, 124t Drug approval process INDs, 151–152 NDAs, 152 Drug discovery and development advances in, 22–23 new approaches, 26 preclinical models in, 23–24 stages of, 22f Drug-drug interactions, 257–258 Drug labels, 314 Drummond, M. F., 277 DSMBs. See Data and safety monitoring boards (DSMBs) Dunnett-type procedure, 86 Duplicate screening, 288 Early Breast Cancer Trialists’ Cooperative Group (EBCTCG), 286 Eastern Cooperative Oncology Group (ECOG), 129 EBCTCG trial 2000, 209 Ebers, George, 5 ECOG’s E-3200 trial, 170 Efficacy cost estimation of, 280 data analysis and reporting, 185–187 Efficacy, monitoring, 164–166 Efron’s biased coin randomization, 345–346 EGFRs. See Epidermal growth factor receptors (EGFRs) Ehrlich, Paul, 9 Elderly population, 267 Elicitation of experts, 113 EMBASE, 287 Emergency use INDs, 300, 312 Endpoints correlation with biomarker, 217–218, 218f definition of, 35 overall survival as, 40–41 phase I trials, 36–37 phase II and III trials, 36–40 primary versus secondary, 35 progression- and disease-free survival as, 39–40 quality of life as, 41 question and objectives, 33 response rate as, 37–39 surrogate, 35–36 Enrichment design, 233–235, 235f Environmental factors, 190 EORTC. See European Organization for Research and Treatment of Cancer (EORTC)
367
368
EORTC-08923 trial, 170 Epidemiology, birth of, 7 Epidermal growth factor receptors (EGFRs), 243 Errors, data management, 116 Escalation with Overdose Control (EWOC), 350–351 Ethical issues phase I oncology trials, 16–18 tissue and data banking, 18 European Organization for Research and Treatment of Cancer (EORTC), 221, 265, 340 QLQ-C30 scoring manual, 261, 263 response criteria, 245, 249t Ewing, James, 7 Executive Committee of the Association of American Medical Colleges (AAMC), 317 Expectedness, 146 Expected sample size (EN), 46 Expedited reporting, 146 Experimental arm, choice of, 200 Experimental bias, 75 Experts, elicitation of, 113 Exploratory statistical methods, 195 External validation, prognostic model, 195 Extra-corporeal membrane oxygenation (ECMO) trial, 80, 116, 346 5-fluorouracil, 9 Factorial design 2 × 2, analysis, 86–87 2 × 2, pitfalls of, 204–205 interim analysis, 169–170 multiple arm trials, 97–98 phase III clinical trials, 86–87 Factorial designs, pitfalls of, 204–205 Factorial trials, 94–95, 95t Failed pivotal trial, 233 False-positive rate. See Type I error rates Farber, Sidney, 9, 336 FDA. See U. S. Food and Drug Administration (FDA) FDA Form 1572, 323 Federal Express, 302 Federal Food, Drug, and Cosmetic Act (1938), 312 Federal Food, Drug, and Cosmetic Act (1962), 308 Federal regulations informed consents, 332 vulnerability, 15 Fellows, roles of, 325–326 18 F-FDG PET, 240 integration with biomarkers, 360 response assessment criteria, 245–247 18 F-FLT PET (thymidine analogue 3'-deoxy-3'18 fluorothymidine ), 240 Fibonacci escalation scheme, 58–60, 59t Fibonacci sequence, 58 “Financial Disclosure by Clinical Investigators” (FDA), 300 Fixed effects pooling, 291 Fixed margin method, in noninferiority trials, 102–103
INDEX
Fleming’s K-stage designs, 67–68 Folic acid, 9 Follicular Lymphoma International Prognostic Index (FLIPI), 180 Follow-up forms, 135, 136 Follow-up, measures of, 184, 184t Follow-up time, estimating, 182 Food and Drug Administration Amendments Act (FDAAA) of 2007, 312 Food and Drug Administration Modernization Act (FDAMA) of 1997, 312 Food effect, 257 Forest plot, 291 Fowler’s solution, 8 Frequentist properties, evaluation of, 113–114 Fully sequential designs, 69 Funnel plot, 295–296, 298f 70-gene signature, 220, 236–237 Galen, 5–6 GCP. See Good Clinical Practice (GCP) Gehan’s two-stage designs, 66–67 Gene expression profile, 224 Gene knockouts, 23 George, S. L., 88, 90 GEPARDUO trial, 171 Geriatric population, monitoring, 256–257 Germ-line testing, genetic, 129 GFRs. See Growth factor receptors (GFRs) Ghostwriting, 317 Gilman, Alfred, 9 Gleason sum, 193 Glioblastoma multiforme (GBM), 264–265 Glomerular filtration rate (GFR), 257 Glucose metabolism, 242 GOG-0165 trial, 169 Golub’s weighted voting method, 231 Good Clinical Practice (GCP), 305–306, 339 Goodman, Louis S., 9 Goodman’s modified CRM, 350 Graham, Evarts, 8 Graunt, John, 7 Green, S. B., 2 Greenwood formula, 51, 53 Group C INDs, 312 Growth factor receptors (GFRs), 271 “Guidance for Industry—E6 Good Clinical Practice: Consolidated Guidance” (ICH), 306 “Guidance for Industry: IND Exemptions for Studies of Lawfully Marketed Drug or Biological Products for the Treatments of Cancer” (FDA), 182, 309 H0, 44, 45 H1, 44, 45 Halabi nomogram, 180 Harrell, F., Jr., 195 Harvey, William, 7 Haseman, J. K., 88
INDEX
Haybittle-Peto monitoring boundary, 164 Hazard function, 193 Hazard ratios, 88–89, 191t multiple arm trials, 96 noninferiority trials, 102–103 Headings, in informed consents, 329–330 Health-related quality of life (HRQOL) analysis of trials, 262–265 challenging trials, 264 designing trials, 261–262, 264t examples of successful trials, 264–265 reporting and evaluating published studies, 265 Hepatic arterial infusion (HAI), 185 Hepatic impairment clinical trials in populations with, 272–274 drugs, 276t pharmacokinetic and pharmacodynamic monitoring, 254–255 Herson’s method, 69 Heterogeneity, 294 High-quality studies, in meta-analysis, 295 Hill, Austin Bradford, 8 Hill, John, 8 Hippocrates, 5, 7 Histone deacetylase inhibitors (HDACi), 32 Historical controls, pitfalls of historical information from the medical literature, 197–198 specific historical control groups, 198–199 Hochberg method, 96 Hodgkin, Thomas, 6 Hollow fiber technology, 24 Hormone replacement therapy (HRT), 180 Host-related factors, prognostic, 190 Hounsfield units, 360 HRQOL. See Health-related quality of life (HRQOL) HRQOL forms, 263 HRQOL scores, 264 Human subjects, safety of, 14 Human tumor xenografts, 24–26 Hybrid designs, 103–104 Hypothesis testing framework, 65–66 IDE exemption, 305 IHP. See International Harmonization Project (IHP) Imaging technology, advances in, 25–26 Immunoassays, 22–23 “Include it all” approach, 128 Incomplete response. See Stable disease Incremental cost-effectiveness ratios, 281 Independent data monitoring committees, role of, 171–172 Independently prognostic factors, 236–237 Indirect benefit, 17 Individual level association, biomarker and clinical end point, 217 INDs. See Investigational new drug applications (INDs) Industry collaborators, role of, 316
Industry-sponsored trials, 316 Information fraction, 164 Informed consents, 129 elements of, 14t exceptions, 332 information to be included in, 332t NCI checklist, 331t NCI template, 331 ongoing nature of consent, and obligation to share new information, 13t purpose and rationale, 327–328 respect for persons and, 12–13 wordings, 333t written form of, 328–331 Initial area under the contrast agent concentration-time curve (IAUC), 242 Institutional Review Boards (IRBs), 126, 181, 315 approval issues, 16, 16t future issues, 19 Instrumental ADL (IADL), 268 Intention-to-treat (ITT) analysis, 76, 104, 106, 182 Interim analysis, 110 data monitoring committees, role of, 171–172 lack of benefit or futility, 166–169, 167t, 168f, 169f multi-arm and factorial designs, 169–170 noninferiority trials, 171 potential costs of, 172–175 specifying interim analysis plan, 163–164 superiority or efficacy monitoring, 164–166, 164t, 165f, 166f Internal validation, 195, 232 International Committee of Medical Journal Editors (ICMJE), 317 International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), 102, 141, 153, 179 International Conference on Harmonization Tripartite (ICH-GCP), 328 International Harmonization Project (IHP), 245, 248t International Uniform Criteria for Multiple Myeloma, 127 International Working Group (IWG), 245 Interpretation, noninferiority trial, 107–108 Interquartile range (IQR), 346 Interstate Commerce Act, 309 Interstitial lung disease (ILD), 158 Intertumor heterogeneity, 227 Introduction, 180 Investigational agents, 126, 311–312 Investigational new drug applications (INDs), 21, 143, 151–152 content and corresponding notes, 305t exemption, 300–302 FDA’s responsibilities, 309–310 review of, 310–311 Safety Report, 143 types of, 300
369
370
Investigational study, components of abbreviations, 121 background, 122 definitions, 121 protocol schema or synopsis, 121 study objectives, 122 title page, 121 Investigator-initiated-grants (R01s), 339 Investigator-initiated INDs, 300, 316 conducting multisite study, 129 preparing and filing, 302–303 Investigators, role of AE reporting, 143 reviewing follow-up forms, 136 reviewing INDs, 311, 316 In vitro studies, drug-drug interaction, 257 IRBs. See Institutional review boards (IRBs) Irinotecan, 181 ISPOR standards, cost effectiveness analysis, 277 IWG. See International Working Group (IWG) Joint modeling toxicity, 115 Journal of Biopharmaceutical Statistics, 344 Journal of Clinical Oncology (JCO), 265 Justice, 15–16 Kadane, J. B., 112 Kalish, L. A., 77 Kaplan-Meier curves, 51, 201f Kaplan-Meier method, 39, 53, 54t, 55f, 182 Karkinoma, meaning of, 5 Kefauver-Harris Amendments to the Food Drug and Cosmetic Act (1962), 312 Key data elements, 131 Krailo, M. D., 77 KTrans, 242, 247 L Lachin-Foulkes adjustment, 104 Lack of benefit, interim analysis, 166–168, 167t, 168f, 169f Lakatos, E., 89 Languages, in contracts, 317–318 LARs. See Legally authorized representatives (LARs) Lasker, Mary, 336 Lead time, 353 Legally authorized representatives (LARs), 13, 332 Lenalidomide, 105 Leukemia, origin of term, 6 Life expectancy, 279, 281t Lind, James, 8 Lindskog, Gustav, 9 Linear discriminants, defined, 231 Lissauer, Heinrich, 8 Local tumor registries, 132, 135 Logistic regression model, 192 Log [-log S(t)] versus time model, 193 Log-rank tests, 51, 89, 103, 206
INDEX
Longitudinal modeling, 263 Loss function, 351 Lost to follow-up, 135 Lung cancer, and tobacco consumption, 8 Lymphoma, origin of term, 6 6-Mercaptopurine (6-MP), 9 Magnetic resonance imaging (MRI), 240, 360 Major deficiencies, 133 Manual for the Completion of the NCI/DCTD/CTEP Clinical Trials Monitoring Service, 127 Manual of operations, 137–138 Manual of procedures, 129 Manuscripts, structure and content of abstract, 180 authorship, 180 discussion, 187 introduction, 180 methods, 180–182 results, 182–187 study references, 187 title, 180 Maximum tolerated dose (MTD), 31, 36, 57, 58, 125, 203, 358 M. D. Anderson Cancer Center, 109, 114 MDCT. See Multidetector CT (MDCT) Mean difference, 48 Measurement of effect, 127 Mechanisms of resistance, 358–359 MedDRA advantage of, 154 CTCAE versus, 156 disadvantages of, 154–155 Median survival time, 96 Medical Dictionary for Regulatory Activities (MedDRA), 141–142, 154–155 Medical journals, 317 Medical research. See Clinical trials MEDLINE, 287 MedWatch Form, 143 Meetings, importance of, 114 Melan chole (black bile), 6 Meta-analysis, 285 aggregate data, 292 assessing for publication bias, 295–296 assessing heterogeneity, 294 assessing quality of constituent studies, 295 determining which studies to include, 287–288 extracting relevant data, 288–289, 291t, 292t flow diagram, 290f identifying studies, 286–287 patient-level data, 292–294, 295f pooling data, 289–292, 293t posing research question, 286 shared limitations, 294 Metabonomics/metabolomics, 23 Meta-regression, 294–295 Methodologies, 16
INDEX
Methods, data analysis and reporting statistical methods, 181–182 study conduct, 181 study population, 180–181 treatment and administration, 181 Methotrexate, 9 Microarrays, 22 MINDACT trial (Microarray in Node-negative Disease May Avoid Chemotherapy Trial), 221, 236–237 Minimal clinical important difference (MCID), 262 Minimax design. See Simon’s two-stage design Minimization randomization, 80–81, 81f Minimum acceptable response rate, 69 Minor deficiencies, 133 Model validation, prognostic factors, 195–196 Modification of Diet in Renal Disease (aMDRD) formula, 270–271, 272t Modified biomarker-based strategy design, 222f, 223 Molecularly targeted therapy, 226–227 Monotonic relationship, in multiple arm trial, 93, 94t Morgagni, Giovanni Battista, 6 MRI. See Magnetic resonance imaging (MRI) MTD. See Maximum tolerated dose (MTD) Multi-arm adaptive designs Bayesian stopping rules in phase III studies, 352–353 sample size adjustment in phase III studies, 353 seamless phase II/III studies, 353 Multi-arm designs, 169–170 Multicenter guidelines, 127 Multidetector CT (MDCT), 240 Multimodality treatment, 9 Multiphase imaging protocols, 240 Multiple arm trials challenges in, 98 design considerations, 95–98 sample size determination, 96–98 selection and screening design, 95 type I error rate, 95–96 types of, 93–95 Multiple hypothesis testing, 185 Multiple training-test partitions, 232 Multistage designs, phase II trials Fleming’s K-stage designs, 67–68 Gehan’s two-stage designs, 66–67 Simon’s optimal two-stage designs, 68–69 Multivariable analysis, 185, 186–187 Multivariate analysis, 236 Multivariate-gene expression-based classifiers, 231–233 Myelosuppression, 257 Narrative description, in AE reports, 148 Narrowness, defined, 66 Narrow therapeutic range (NTR) drugs, 258 National Cancer Institute of Canada Clinical Trials Group trial MA-17, 175
National Cancer Institute of Canada (NCIC), 264 National Cancer Institute of Cancer Clinical Trials Group (NCIC-CTG), 340 National Cancer Research Institute, Genoa, Italy, 286 “National Commission for Protection of Human Subjects of Biomedical and Behavioral Research,” 12 National Institutes of Health (NIH), 119 DSMBs, 15 ethical issues with tissue and data banking, 18 registration of clinical trials in, 19 National Research Act, U.S., 12 National Surgical Adjuvant Breast and Bowel Project trial NSAPB-B14, 173 National Surgical Adjuvant Breast and Bowel Project trial NSABP B-31, 165 NCEs. See New chemical entities (NCEs) NCI/CTEP AE reporting guidelines, 143 NDAs. See New drug applications (NDAs) Negotiation, contracts, 318 Neoadjuvant chemotherapy, 384 Neoplastic Diseases (Ewing), 7 New chemical entities (NCEs), 26 New drug applications (NDAs), 143 accelerated approval, 312–313 approval of oncology drugs, 313 drug approval process, 152 legal basis behind drug approval requirements, 312 Newton-Raphson procedure, 96 NHL. See Non-Hodgkin’s lymphoma (NHL) NIH. See National Institutes of Health (NIH) Nitrogen mustard, cancer caused by, 9 Nitrosoureas, 274 Noncomparative trials, 201 Non-English speaking individuals, 13 Non-Hodgkin’s lymphoma (NHL), 9 Noninferiority margin (Δ), 101, 102f Noninferiority trials, 171 analysis, 106–107 interpretation, 107–108 monitoring, 105–106 statistical design, 101–105 Non-NTR drugs, 258 Non-overall-survival treatment effects, 175 Non-small-cell lung cancer (NSCLC), 158, 265 Nonstandard-of-care tests, 318 Nonsteroidal anti-inflammatory drugs (NSAIDs), 154 Nontarget lesions, defined, 37 North Central Cancer Treatment Group (NCCTG) trials, 165–166, 181 Northrup Grumman Corporation, 154 Notice of Claimed Investigational Exemption for a New Drug, defined, 300 NSAIDs. See Nonsteroidal anti-inflammatory drugs (NSAIDs) Number at risk, 53 Number of deaths in treatment, 53 Nuremberg Code, 11 Nursing, development in, 7
371
372
INDEX
Objective response rate (ORR), 243 Objectives, clinical trials avoiding ambiguity, 30–31 checklist for, 30 choosing feasible, 31 choosing pharmaceutical company and FDA, 31–32 primary versus secondary, 31 O’Brien-Fleming boundary, 164 Office for Human Research Protections (OHRP), 18 Office of New Drugs (OND), 307 Office of Oncology Drug Products (OODP), 307 Office of Surveillance and Epidemiology (OSE), 307 Offices of Drug Evaluation (ODEs), 307 Off-label use, drugs, 151 Older patients, designing clinical trials barriers to trials participation, 269–270 eligibility criteria, 268 major issues to consider in, 270t statistical considerations, 269 treatment intent, 268 Oncologic Drugs Advisory Committee (ODAC), 40, 313 Oncology advances in pathology, 6 modern era, 6–7 origins of, 5–6 Oncology drugs, approval of, 313 Oncotype DX recurrence index, 236 One-stage design, 46 On-study form, screening and baseline patient information, 132 Open-ended questions, 137 Operating characteristics, 113–114 Optimal designs, 45–46. See also Simon’s two-stage design O’Quigley’s CRM, 349 Oral chemotherapy agents, 257 Organ impairment, clinical trials in hepatic impairment, 272–274 renal impairment, 270–272 ORR. See Objective response rate (ORR) Orthotopic models, 25 Osler, Sir William, 6 Outcome-by-outcome analysis, 210–211 Outcomes. See also Endpoints measurement of, 127, 135–136 summarization of, 280–281, 282–283, 284t Overall survival (OS), 40–41 Overfitting/overlearning, in model validation, 195 Overly minimalist approach, 128 Package insert, 152 Paget, Stephen, 6 Partial response (PR), 30, 37, 244, 246t Participants/volunteers, 12 Patient characteristics, 132 data analysis and reporting, 182–185, 183f, 184f, 184t Patient demographics, 132 Patient flow diagram, 182 Patient-level meta-analysis, 292–294, 295f
Patient/pharmacogenomic factors, 359–360 Patient population, choosing, 32–33 Patient progression-free survival (PFS), 37 Patient-reported outcome (PRO) methodology, 157 questionnaires, 134 Patient selection, 123 PDs. See Pharmacodynamics (PDs) Pediatric Oncology Group trial POG-9006, 174, 174f Pediatric population, monitoring, 256 Pediatric Research Act of 2003, 256 Percent retention method, in noninferiority trials, 103 Permuted block design, 78–79, 78t, 79t Permutt, T., 207 Perprotocol (PP) population, 106 PET. See Positron emission tomography (PET) Peto, R., 173 p53 trial, 223 Pharmaceutical companies. See also Sponsors choosing objectives, 31–32 monitoring toxicity, 151–152 Pharmaceutical information, 126–127 Pharmaceutical Management Branch (PMB), NCI, 137 Pharmaceutical Research and Manufacturers of America (PhRMA), 344 Pharmacodynamics (PDs). See also PK-PD relationships defined, 251 phase II trials, 252–253 Pharmacoeconomics, 277 Pharmacokinetic analysis, 37 Pharmacokinetic parameters, 251, 255t Pharmacokinetics (PKs), 359. See also PK-PD relationships biomarkers and, 253 defined, 251 phase I trials, 251–252 Phase I trials AE collection methodology for patients on a clinical trial, 157–158, 157f, 158f, 158t endpoints for, 36–37 ethical issues with, 16–18 molecularly targeted therapy, 228 oncology dose-finding study, 111 pharmacokinetic monitoring, 251–252 pitfalls, 203–204 primary objective, 31 secondary objectives, 31 Phase I trials, design of, 57–58 designs that target toxicity rate other than 0.2 or 0.25, 60–61 traditional or standard or 3 + 3 design, 58–60 trials with long follow-up, 62–64 trials with ordinal toxicity outcome, 61–62, 61t Phase I/II design, pitfalls of, 203–204 Phase II adaptive randomization trials, 111 Phase II trials, 158–159 endpoints, 36–40 molecularly targeted therapy, studies of, 228–229 pharmacodynamic monitoring in, 252–253
INDEX
pitfalls, 203–204 primary and secondary objectives, 31 Phase II trials, design of bivariate designs, 69–70 Fleming’s K-stage designs, 67–68 fully sequential designs, 69 Gehan’s two-stage designs, 66–67 multistage designs, 66–69 phase II design using time-to-event endpoints, 70 randomized designs, 70–71 Simon’s optimal two-stage designs, 68–69 single-stage designs, 65–66 Phase II/III design, pitfalls of, 204 Phase III ECOG trial E1A06, 105 Phase III trials, 1–2 endpoints, 36–40 pharmacokinetic monitoring in, 253–254 primary and secondary objectives, 31 Phase III trials, design of comparing success rates, 87–88 comparing survival distributions, 88–91 factorial design, 86–87, 87t predictive classifiers in, 233–236, 235f testing equality among k treatment arms, 85–86, 86t testing hypotheses, 83–84 two or more experimental arms and a control arm, 86 unequal sample sizes, 85 unknown variances, 84–85 Phase IV trials, pharmacokinetic and pharmacodynamic monitoring in, 254–258 Physicians, 148, 269. See also Principal investigators (PIs) PIs. See Principal investigators (PIs) PK-PD relationships, 251, 254f PKs. See Pharmacokinetics (PKs) Pocock boundary, 166 Pocock, S. J., 80 Pooling data, 289–292, 293t Poor accrual, 185–186 Population PKs, 253–254 Populations perprotocol, 106 targeted patients, 32–33 Positron emission tomography (PET), 240, 360 Posterior distribution, 347 Postmarketing safety review, 314 Potential pitfalls, 114 Power, 47, 47t, 49, 49t Power prior, 112 Pr[DLT], 62 Pre and post microvessel density data, 48, 48t Preclinical efficacy testing, 23 Preclinical evaluation process, 23 Predicted futility, 352 Predictive accuracy model validation, 195 statistical measures of, 220 Predictive biomarkers. See also Predictive classifiers definition, 215 distinctions with prognostic biomarkers, 216f, 217
identification and validation of, 220–221, 220t molecularly targeted therapy, 229–230 statistical criteria for, 216–217 trial designs, 223 uses of, 219 Predictive classifiers, 230 multivariate-gene expression-based, 231–233 prospective phase III clinical trials, 233–236 Predictive probability (PP), 69, 352 Pregnancy, pharmacokinetic and pharmacodynamic monitoring of, 256 Prentice, R. L., 217 Principal investigators (PIs), role of, 121, 322–323, 324t, 325t, 339 data collection, 131 toxicity assessment, 134–135 “Principles for Protecting Integrity in the Conduct and Reporting of Clinical Trials” (AAMC), 317 Prior historical information, 112–113 Probability density function, 88 Probability of response, 347 Probability of terminating the trial early (PET), 46 Product description, 126 Prognostic biomarkers definition, 215 identification and validation of, 220–221, 220t molecularly targeted therapy, 236–237 predictive biomarkers versus, 216f, 217 statistical criteria for, 216–217 trial designs, 221–223 uses of, 219 Prognostic factors common problems with modeling, 194–195 identification of, 192–195 importance of, 189–190 model validation, 195–196 relationship between, 190f study design, 191–192 types of, 190–191 Prognostic models, 189 problems with, 194–195 validating, 195–196 Program projects (PO1s), 339 Progression free survival (PFS), 39–40, 70 Progressive disease (PD), 37, 244, 246t Proportional hazards model, 89, 192–193, 194f, 206. See also Log-rank tests Proportionality hazard assumption, 193 Prostate Cancer Working Group 2 recommendations, 33 Prostate-specific antigens (PSAs), 132, 218 Protection of Human Subjects, 12, 129 Protections, and justice, 15 Proteomics, 22 Protocol coordinator, 120 Protocols development of, 119–120 keeping track of the versions of, 121 requirements, 120t Protocol schema or synopsis, 121, 122t
373
374
PSA assessment, 136 PSA doubling time (PSADT), 361 PSAs. See Prostate-specific antigens (PSAs) Pseudo-data, 350 Publication bias, 295–296 Public Health Service (PHS) Act, 312 Pure Food and Drugs Act (1906), 312 p-value, 47, 164 grade 4 and 5 hematologic toxic, 50 pre and post microvessel density data, 48 statistical validation of prognostic biomarker, 220 toxicity rates, 49 Quality assessment, meta-analysis, 295 Quality Assurance Review Center (QARC), 339 Quality control, 136–138 Quality of life (QOL), 37, 41 Questions closed-ended, 136 fundamental components of formulating, 29–30 open-ended, 137 Radiation Therapy Oncology Group, 156, 165 Radioactivity, discovery of, 7 Radiologic Physics Center (RPC), 339 Random effects pooling, 291 Randomization adaptive, 79–80, 115 balanced versus unbalanced, 77 Bayesian designs and, 110 bias, 73–74 biased coin, 78 deciding whether to randomize, 75–76 goals of, 74–75, 74t intention-to-treat analysis, 76 minimization, 80–81, 81f permuted block design, 78–79, 78t, 79t pitfalls in testing, 206–207, 206f replacement, 78 simple, 77–78, 77t stratified, 80 testing, 205–206 timing of, 76 Randomized clinical trials (RCTs), 8, 261 Randomized controls blinded control, 199–200 unblinded control, 199, 199f Randomized trials cost-effectiveness analysis in, 278, 280t phase II trials, design of, 70–71 phase II trials, pitfalls of, 201–202, 202t RCTs. See Randomized clinical trials (RCTs) Real-time updating, 110–111 Reasonable person, definition of, 13 Receiver operating characteristics (ROC) curves, 195, 220 RECIST 1.0, 136 RECIST 1.1, 136, 138 RECIST Criteria, 37, 127, 244–245, 246f Recruitment of subjects, 16
INDEX
Recurrence Score (RS), 104 Recursive partitioning, 192–193 References, 121, 129 region-of-interest (ROI), 247–248 Registration form, screening and baseline patient information, 132 Registration procedures, centralized institutional, 123, 123t Regression proportional hazards (PH) model, 193 Regulatory definitions, 299–300 Regulatory framework, 299 REMARK guidelines, 191, 221, 236 Renal impairment clinical trials in populations with, 270 pharmacokinetic and pharmacodynamic monitoring, 254 Renal Insufficiency and Anticancer Medications (IRMA) study, 270 Replacement randomization, 78 Replication, defined, 73 Report of the National Commission for the Protection of Human Subjects of Research, 12 Required duration of study, 89–91, 90t Required number of events, 88–89 Researcher-subject relationship, 12 Research groups, role of, 143 Research pharmacists, role of, 325, 328t Research study team, role of, 321–322 Residents, role of, 325–326 Response-adaptive designs, 346–349 Response assessment criteria, 243–244 DCE-MRI Response Criteria, 247 18 F-FDG PET Response Criteria, 245–247 IWG and IHP Criteria, 245 RECIST Criteria, 244–245 WHO Criteria, 244 Response evaluable patients, 207 Response Evaluation Criteria in Solid Tumors (RECIST), 30, 244 Response rate, 37–39, 38t Resubstitution, defined, 232 Retrospective correlation, 237 Reverse transcription-polymerase chain reaction, 22 Risk assessment models, 359–360 Risk/benefit analysis, 14, 17 Risk evaluation and management strategy (REMS), 314 RNA interference technology, 23 Roles and responsibilities coinvestigators, 323–324 CRAs, 325 CRNs, 324–325, 326t DMs, 325, 327t fellows, 325–326 PIs, 322–323, 324t, 325t research pharmacists, 325, 328t research study team, 322–323 Röntgen, Wilhelm Conrad, 7 Rothwell, P., 208 Route of administration, 126 R scripts, 111
INDEX
SAEs. See Serious adverse events (SAEs) Safety review. See Postmarketing safety review Sample size, determination of, 47, 47t, 50t adjustment, 353 comparison of survival curves, 96–97 factorial design, 97–98 SAS macros, 111 Schoenfeld, D. A., 89, 193 Screening baseline patient information, 132–133 selection of designs, 95 sctweb.org, 3 Scurvy, treatment of, 7–8 Seamless phase II/III studies, 353 Secondary end points, 174–175 Secretary of Health and Human Services (HHS), 307 Selection bias, 75 “Select one” approach, 136–137 Sensitivity analysis, 281, 283t, 283 Sentinel Initiative (2008) of FDA, 314 Serious adverse events (SAEs) definitions of, 143, 153t reporting of, 133–134, 147 Serum creatinine measurement, 270, 272t Shared limitations, meta-analysis, 294 Shrunken centroid classification, 231 Simon, R., 79, 80 Simon’s two-stage design, 32, 45–46 multiple-arm, 68–69 single-arm, 351–352, 354t Simple randomization, 77–78, 77t Simple stratification, 80 Single-arm adaptive designs, 351–352 Single-stage designs, phase II trials confidence interval or precision framework, 66 hypothesis testing framework, 65–66 Single-treatment phase II studies, 111–112 Site audits, 138 Sloan Kettering Cancer Center, 156 Smith, Edwin, 5 Smoking, and lung cancer, 8 Social Security Death Index, 132, 135 Society of Clinical Trials, 3 Software/tools BRB-ArrayTools, 232 computerized collaborative clinical trial writing systems, 120 electronic data capture and clinical data management, 138 real-time updating, 110–111 Solution preparation, 126 Somers’ D rank correlation, 195 Source documentation, 131, 138 Southwest Oncology Group SWOG-8203 trial, 170 SWOG-8412 trial, 171, 172f SWOG-9701 trial, 174, 174f Specialized Programs of Research Excellence (SPOREs), 339 Special populations, clinical trials in, 274–275 Special research studies, 126–127
Spending function approach, 164 Split-sample validation, 195, 232 Sponsor-investigators, responsibilities of, 305, 307t Sponsors definition of, 300 responsibilities of INDs, 143, 311 Sposto, R., 77 Stable disease, 37, 244, 246t Staff, and form designs, 137 Stakeholders, 278–279, 282 Standard deviation (s ), 47 Standard error (SE), 67 Standardized effect size, 84 Standard uptake values (SUVs), 242 Statistical and Data Management Center (SDMC), 338 Statistical considerations, writing clinical trials, 129 Statistical design, noninferiority trials, 101–102 Statistical methods, 181–182 Statisticians, role of, 172, 311 Statistics, use of, 7 Step-up procedures, 96 Storage requirements, 126 Stratification, 80–81, 81f, 82f Stratified analysis, 186 Stratified randomized design, 222f, 223 Streptomycin, 8 Structural issues, in trial designs, 200–201 Study calendar, 127, 128t Study conduct, 181 Study designs, examples of, 104–105 Study drugs, 316 Study eligibility checklist, 132 Study objectives, 122 Study population, 180–181 Study references, 187 Sub-distribution approach, 209 Subgroup analysis, 294 Subinvestigators, role of, 323–324 Subset analysis, pitfalls of, 208–209 Success rates, comparing, 87–88, 88t Summary statistics, in data analysis, 263 Superiority, monitoring, 164–166 Superiority trials, 107 Support vector machines (SVMs), 231 Supreme Court, U.S., 313 Surrogate biomarkers, 30 definition, 215 identification and validation of, 220t, 221 imaging modalities, 241–242 statistical criteria for, 217–218 uses of, 219 Surrogate endpoints, 32, 35–36 Surveillance, Epidemiology, and End Results (SEER) database, 269 Survival distributions, comparing, 88–91 Survival event, 182 Survival function, 53 Survival time, 184f Survivor function, 193 SUVersus. See Standard uptake values (SUVs)
375
376
INDEX
SVMs. See Support vector machines (SVMs) Systematic errors. See Bias 2 × 2 factorial design, 86–87, 204–205 3 + 3 design, 58–60 Table of contents, 121 TAILORx phase III trial, 104, 222–223 Target definition and validation, 357–358 Target lesions, identification of, 242–243 Target level (Γ), 58 TARGET trial, 175 Taves method, 80–81 Taxanes, 9 Templates, 119 Temporal relationship, 146 Test criterion, 47 Testing after randomization, pitfalls in, 205–207, 206f Testing hypotheses, design of, 83–84 Test negative/positive patients, trial designs, 235–236, 235f Therapeutic index (T.I.), 151 Therapeutic optimists, 17 Time frame models, 279, 282 Time-to-event endpoints, 39, 70, 192–193 Time-to-progression (TTP), 243 Title pages, 121 Titles, 180, 288 Tobacco, and cancer, 8 Toxicity assessment of, 133–135 definition of, 152–153 joint modeling, 115 monitoring, 152t primary endpoint, 36 Toxicity rates, 47 Toxicology evaluation, 26 Toxicology reviewers, role of, 310–311 Toxins, cancer caused by, 9 Trade-offs, 278, 282 Translational research, 227 Treatment-by-cycle forms, 133 Treatment compliance, 133 Treatment effect modifier, 216 Treatment INDs, 300, 312 Treatment method, 181 Treatment plan, 123–125 dose escalation schema for phase I studies, 125t typical drug administration schedule, 124t Treatment Protocols, 312 Trial-level associations, 217–218 Trials of War Criminals before the Nuremberg Military Tribunals, 11 Trim and fill techniques, 296 T-test method, 181 TTP. See Time-to-progression (TTP) Tumor cells, classification of, 7 Tumor marker utility grading system (TMUGS), 236 Tumor-related factors, prognostic, 190 Tumor response rate, 37
Tumor shrinkage, 39 Type I error rates, 44 multiple arm trials, 95–96, 96t noninferiority trials, 103 phase III trials, 84 Type II error rates, 44 noninferiority trials, 103 phase III trials, 84 Unblinded control, 199, 199f Unexpected adverse drug reaction, defined, 153t “Uniform Requirements for Manuscripts Submitted to Biomedical Journals,” 319 Unknown variances, 84–85 U. S. Code of Federal Regulations (CFR), 299, 308 U. S. Food and Drug Administration (FDA), 1, 40, 251 approved agents, 126 barriers to clinical trials participation, 269–270 basis for regulatory decisions, 310t choosing objectives, 31–32 decision making, 308–309 definition of clinical trial, 119 drug approval process, 33, 151–152 Good Clinical Practice Program, 299 Guidance for Industry document, 313 guidelines for clinical trials in patients with organ impairment, 274t noninferiority trials and, 102 organizational structure, 307 premarket and postmarket safety review, 313–314 regulations for informed consents, 327 reporting of adverse events, 143 responsibilities regarding INDs, 309–310 review process for INDs, 310 U. S. National Cancer Institute (NCI), 17 Adverse Event Expedited Reporting System (AdEERS), 134 barriers to clinical trials participation, 269 CAEPR (Comprehensive Adverse Event and Potential Risks) list, 134 Cancer Biomedical Informatics Grid (caBIG), 131 Cancer Imaging Program, 245 Cancer Therapy Evaluation Program (CTEP), 119 Central Institute Review Board (CIRB) of, 19 clinical trials treatment sites (2000), 338f collaboration with cooperative groups, 337–338 Common Data Elements (CDE), 131 Common Terminology Criteria (CTC), 141 Common Terminology Criteria for Adverse Events (CTCAE), 57 consent documents, 331 Cooperative Groups, 172 Developmental Therapeutics Program (DTP) of, 23 Group C INDs, 312 Guidelines for Treatment Regimen Expression and Nomenclature, 123–124 hollow fiber technology, 24 human tumor cell line (60-Cell) screen, 23–24
INDEX
Pharmaceutical Management Branch (PMB), 137 RECIST Criteria, 244 recognizing contract negotiation, 318 Toxicology and Pharmacology Branch, 26 U. S. State Department, 9 Validation biomarkers, 219–221 prognostic model (see Model validation, prognostic factors) Valid biomarkers, defined, 252, 255–258t Value in Health, 277 Vanishing denominator, pitfalls of, 207–208 Variable selection, 195 Variations, tumor measurement CT scans, 247 DCE-MRI, 248 18 F-FDG PET scans, 247 Vascular endothelial growth factor (VEGF) pathway, 243 Version number, 121 Version of protocol, 121 Vesalius, Andreas, 6 Vinca alkaloids, 9 Virchow, Rudolf, 6 Voluntariness, fundamental concept of, 11 Vulnerability, concept of, 15
Wilcoxon method, 181 Working Group of the National Kidney Foundation, 271 World Medical Association, 11, 327 Writing clinical studies, 120–121 Writing investigator-initiated trials, 119–120 appendices, 129–130 conducting investigator-initiated multisite study, 129 critical components of an investigational study, 121–122 data and safety monitoring plan, 126 data management, 127–128 dosing delays and dosing modification, 125–126 informed consent, 129 measurement of effect, 127 patient selection, 123 pharmaceutical information, 126–127 preparing to write a clinical study, 120–121 references, 129 section on protection of human subjects, 129 statistical considerations, 129 study calendar, 127 treatment plan, 123–125 Writing protocols, 3 Written consent document, 328–331 Wynder, Ernst, 8 X-rays, discovery of, 7
Waldeyer-Hartz, Wilhelm von, 6 Whole-genome transcript-expression profiling, 232 WHO (World Health Organization) Criteria, response assessment, 244, 246t
Zelen, M., 75, 79 Zubrod, C. Gordon, 9 Z-value, 164
377