Handbook of Statistics in Clinical Oncology

ISBN: 0-8247-9025-1 This book is printed on acid-free paper. Headquarters Marcel Dekker, Inc. 270 Madison Avenue, New Y...

Author: John Crowley

76 downloads 1021 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

ISBN: 0-8247-9025-1 This book is printed on acid-free paper. Headquarters Marcel Dekker, Inc. 270 Madison Avenue, New York, NY 10016 tel: 212-696-9000; fax: 212-685-4540 Eastern Hemisphere Distribution Marcel Dekker AG Hutgasse 4, Postfach 812, CH-4001 Basel, Switzerland tel: 41-61-261-8482; fax: 41-61-261-8896 World Wide Web http:/ /www.dekker.com The publisher offers discounts on this book when ordered in bulk quantities. For more information, write to Special Sales/Professional Marketing at the headquarters address above. Copyright  2001 by Marcel Dekker, Inc. All Rights Reserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher. Current printing (last digit): 10 9 8 7 6 5 4 3 2 1 PRINTED IN THE UNITED STATES OF AMERICA

To my mother, who has been an inspiration to me for 55 years, and to others for 91.

The idea for this book came to me from Graham Garrett of Marcel Dekker, Inc., who, tragically, passed away while the book was in progress. I hope that the result lives up to his expectations. All royalties for the editor will go to Cancer Research And Biostatistics, a nonprofit corporation whose mission is to help conquer cancer through the application of biostatistical principles and data management methods.

Preface

This book is a compendium of statistical approaches to the problems facing those trying to make progress against cancer. As such, the focus is on cancer clinical trials, although several of the contributions also apply to observational studies, and many of the chapters generalize beyond cancer research. This field is approximately 50 years old, and it has been at least 15 years since such a summary appeared; because much progress has been made in recent decades, the time is propitious for this book. The intended audience is primarily but not exclusively statisticians working in cancer research; it is hoped that oncologists might benefit as well from reading this book. The book has six sections: 1. Phase I Trials. This area has moved from art to science in the last decade, thanks largely to the contributors to this book. 2. Phase 2 Trials. Recent advances beyond the widely accepted two-stage design based on tumor response include designs based on toxicity and response and selection designs, meant to guide in the decisions regarding which of many treatments to move to Phase 3 trials. 3. Phase 3 Trials. A comprehensive treatment is provided of sample size, as well as discussions of multiarm trials, equivalence trials, and early stopping. 4. Complementary Outcomes. Quality-of-life and cost of treatment have become increasingly important, but pose challenging analytical problems, as the chapters in this section describe. 5. Prognostic Factors and Exploratory Analysis. The statistical field of survival analysis has had its main impetus from cancer research and the v

vi

Preface

6.

chapters in this section demonstrate the breadth and depth of activity in this field today. Interpreting Clinical Trials. This section provides lessons—never outdated and seemingly always needing repeating—on what can and cannot be concluded from single or multiple clinical trials.

I would like to thank all the contributors to this volume. John Crowley

Contents

Preface Contributors

v xi

PHASE I TRIALS 1. Overview of Phase I Trials Lutz Edler

1

2. Dose-Finding Designs Using Continual Reassessment Method John O’Quigley

35

3. Choosing a Phase I Design Barry E. Storer

73

PHASE II TRIALS 4. Overview of Phase II Clinical Trials Stephanie Green 5. Designs Based on Toxicity and Response Gina R. Petroni and Mark R. Conaway

93

105 vii

viii

6.

Contents

Phase II Selection Designs P. Y. Liu

119

PHASE III TRIALS 7.

Power and Sample Size for Phase III Clinical Trials of Survival Jonathan J. Shuster

129

8.

Multiple Treatment Trials Stephen L. George

149

9.

Factorial Designs with Time-to-Event End Points Stephanie Green

161

10.

Therapeutic Equivalence Trials Richard Simon

173

11.

Early Stopping of Cancer Clinical Trials James J. Dignam, John Bryant, and H. Samuel Wieand

189

12.

Use of the Triangular Test in Sequential Clinical Trials John Whitehead

211

COMPLEMENTARY OUTCOMES 13.

Design and Analysis Considerations for Complementary Outcomes Bernard F. Cole

229

14.

Health-Related Quality-of-Life Outcomes Benny C. Zee and David Osoba

249

15.

Statistical Analysis of Quality of Life Andrea B. Troxel and Carol McMillen Moinpour

269

16.

Economic Analysis of Cancer Clinical Trials Gary H. Lyman

291

Contents

ix

PROGNOSTIC FACTORS AND EXPLORATORY ANALYSIS 17. Prognostic Factor Studies Martin Schumacher, Norbert Holla¨nder, Guido Schwarzer, and Willi Sauerbrei

321

18. Statistical Methods to Identify Prognostic Factors Kurt Ulm, Hjalmar Nekarda, Pia Gerein, and Ursula Berger

379

19. Explained Variation in Proportional Hazards Regression John O’Quigley and Ronghui Xu

397

20. Graphical Methods for Evaluating Covariate Effects in the Cox Model Peter F. Thall and Elihu H. Estey

411

21. Graphical Approaches to Exploring the Effects of Prognostic Factors on Survival Peter D. Sasieni and Angela Winnett

433

22. Tree-Based Methods for Prognostic Stratification Michael LeBlanc

457

INTERPRETING CLINICAL TRIALS 23. Problems in Interpreting Clinical Trials Lilllian L. Siu and Ian F. Tannock 24. Commonly Misused Approaches in the Analysis of Cancer Clinical Trials James R. Anderson 25. Dose-Intensity Analysis Joseph L. Pater 26. Why Kaplan-Meier Fails and Cumulative Incidence Succeeds When Estimating Failure Probabilities in the Presence of Competing Risks Ted A. Gooley, Wendy Leisenring, John Crowley, and Barry E. Storer

473

491

503

513

x

27.

Index

Contents

Meta-Analysis Luc Duchateau and Richard Sylvester

525

545

Contributors

James R. Anderson, Ph.D. Department of Preventive and Societal Medicine, University of Nebraska Medical Center, Omaha, Nebraska Ursula Berger, Dipl.Stat. Institute for Medical Statistics and Epidemiology, Technical University of Munich, Munich, Germany John Bryant, Ph.D. National Surgical Adjuvant Breast and Bowel Project, and Biostatistical Center, University of Pittsburgh, Pittsburgh, Pennsylvania Bernard F. Cole, Ph.D. Department of Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire Mark R. Conaway, Ph.D. Division of Biostatistics and Epidemiology, Department of Health Evaluation Sciences, University of Virginia, Charlottesville, Virginia John Crowley, Ph.D. Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington James J. Dignam, Ph.D. National Surgical Adjuvant Breast and Bowel Project and Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania, and Department of Health Sciences, University of Chicago, Chicago, Illinois Luc Duchateau, Ph.D. EORTC Data Center, European Organization for Research and Treatment of Cancer, Brussels, Belgium xi

xii

Contributors

Lutz Edler, Ph.D. Biostatistics Unit, German Cancer Research Center, Heidelberg, Germany Elihu H. Estey, M.D. Department of Leukemia, University of Texas M.D. Anderson Cancer Center, Houston, Texas Stephen L. George, Ph.D. Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina Pia Gerein, Dipl.Stat. Institute for Medical Statistics and Epidemiology, Technical University of Munich, Munich, Germany Ted A. Gooley, Ph.D. Department of Clinical Statistics, Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington Stephanie Green, Ph.D. Program in Biostatistics, Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington Norbert Holla¨nder, M.Sc. Department of Medical Biometry and Statistics, Institute of Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany Michael LeBlanc, Ph.D. Program in Biostatistics, Fred Hutchinson Cancer Research Center, Seattle, Washington Wendy Leisenring, Sc.D. Departments of Clinical Statistics and Biostatistics, Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington P. Y. Liu, Ph.D. Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington Gary H. Lyman, M.D., M.P.H., F.R.C.P.(Edin) Department of Medicine, Albany Medical College, and Department of Biometry and Statistics, State University of New York at Albany School of Public Health, Albany, New York Carol McMillen Moinpour, Ph.D. Division of Public Health Sciences, Southwest Oncology Group Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, Washington

Contributors

xiii

Hjalmar Nekarda, Dr. Institute for Medical Statistics and Epidemiology, Technical University of Munich, Munich, Germany John O’Quigley, Ph.D. Department of Mathematics, University of California– San Diego, La Jolla, California David Osoba, B.Sc., M.D.(Alta), F.R.C.P.C. Quality of Life Consulting, West Vancouver, British Columbia, Canada Joseph L. Pater, M.D., M. Sc., F.R.C.P.(C) NCIC Clinical Trials Group, Queen’s University, Kingston, Ontario, Canada Gina R. Petroni, Ph.D. Division of Biostatistics and Epidemiology, Department of Health Evaluation Sciences, University of Virginia, Charlottesville, Virginia Peter D. Sasieni, Ph.D. Department of Mathematics, Statistics, and Epidemiology, Imperial Cancer Research Fund, London, England Willi Sauerbrei, Ph.D. Department of Medical Biometry and Statistics, Institute of Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany Martin Schumacher, Ph.D. Department of Medical Biometry and Statistics, Institute of Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany Guido Schwarzer, M.Sc. Department of Medical Biometry and Statistics, Institute of Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany Jonathan J. Shuster, Ph.D. Department of Statistics, University of Florida, Gainesville, Florida Richard Simon Biometric Research Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland Lillian L. Siu, M.D., F.R.C.P.(C) Department of Medical Oncology and Hematology, Princess Margaret Hospital, Toronto, Ontario, Canada Barry E. Storer, Ph.D. Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, Washington

xiv

Contributors

Richard Sylvester, Sc.D. EORTC Data Center, European Organization for Research and Treatment of Cancer, Brussels, Belgium Ian F. Tannock, M.D., Ph.D., F.R.C.P.(C) Department of Medical Oncology and Hematology, Princess Margaret Hospital, Toronto, Ontario, Canada Peter F. Thall, Ph.D. Department of Biostatistics, University of Texas M.D. Anderson Cancer Center, Houston, Texas Andrea B. Troxel, Sc.D. Division of Biostatistics, Joseph L. Mailman School of Public Health, Columbia University, New York, New York Kurt Ulm, Ph.D. Institute for Medical Statistics and Epidemiology, Technical University of Munich, Munich, Germany John Whitehead, Ph.D. Medical and Pharmaceutical Statistics Research Unit, The University of Reading, Reading, England H. Samuel Wieand, Ph.D. National Surgical Adjuvant Breast and Bowel Project and Department of Biostatistics, University of Pittsburgh, Pittsburgh, Pennsylvania Angela Winnett, Ph.D.* Department of Epidemiology and Public Health, Imperial Cancer Research Fund, London, England Ronghui Xu, Ph.D. Department of Biostatistics, Harvard School of Public Health and Dana-Farber Cancer Institute, Boston, Massachusetts Benny C. Zee, Ph.D. Clinical Trials Group, National Cancer Institute of Canada, Kingston, Ontario, Canada

* Current affiliation: Imperial College School of Medicine, London, England.

1 Overview of Phase I Trials Lutz Edler German Cancer Research Center, Heidelberg, Germany

I.

INTRODUCTION

The phase I clinical trial constitutes a research methodology for the search and establishment of new and better treatment of human diseases. It is the first of the three phases—phase I–III trials—that became a ‘‘gold standard’’ of medical research during the second half of the 20th century (1). The goal of the phase I trial is to define and to characterize the new treatment in humans to set the basis for later investigations of efficacy and superiority. Therefore, the safety and the feasibility of the treatment are at the center of interest. A positive risk–benefit judgment should be expected such that possible harm of the treatment is outweighed by possible gain in cure, in suppression of the disease and its symptoms, and in improved quality of life and survival. The phase I trial should define a standardized treatment schedule to be safely applied to humans and worth being further investigated for efficacy. For non-life-threatening diseases, phase I trials are usually conducted on human volunteers, at least as long as the expected toxicity is mild and can be controlled without harm. In life-threatening diseases such as cancer, AIDS, and so on phase I studies are conducted with patients because of the aggressiveness and possible harmfulness of cytostatic treatments, because of possible systemic treatment effects, and because of the high interest in the new drug’s efficacy in those patients directly. After failure of standard treatments or in the absence of a curative treatment for seriously chronically ill patients, the new drug may be the small remaining chance for treatment. 1

2

Edler

The methodology presented below is restricted to the treatment of cancer patients. Biostatistical methodology for planning and analyzing phase I oncological clinical trials is presented. The phase I trial is the first instance where patients are treated experimentally with a new drug. Therefore, it has to be conducted unconditionally under the regulations of the Declaration of Helsinki (2) to preserve the patient’s rights in an extreme experimental situation and to render the study ethically acceptable. The next section provides an outline of the task, including the definition of the maximum tolerated dose (MTD), which is crucial for the design and analysis of a phase I trial. Basic assumptions underlying the conduct of the trial and basic definitions for the statistical task are given. The presentation of phase I designs in Section III distinguishes between the determination of the dose levels (action space) and the choice of the dose escalation scheme (decision options). This constitutes the core of this chapter. Phase I designs proposed during the past 10 years are introduced there. The sample size per dose level is discussed separately. Validations of phase I trials rely mostly on simulation studies because the designs cannot be compared competitively in practice. Practical aspects of the conduct of a phase I trial, including the choice of a starting dose, are presented in Section IV. Individual dose adjustment and dose titration studies are also addressed. Section V exhibits standard methods of analyzing phase I data. Basic pharmacokinetic methods are outlined. Regulatory aspects and guidelines are dealt with in Section VI. It will become clear that the methodology for phase I trials is far from being at an optimal level at present. Therefore, Section VII addresses practical needs, problems occurring during the conduct of a phase I trial, and future research topics.

II. TASKS, ASSUMPTIONS, AND DEFINITIONS A.

Clinical Issues and Statistical Tasks

Clinical phase I studies in oncology are of pivotal importance for the development of new anticancer drugs and anticancer treatment regimens (3,4). If a new agent has successfully passed preclinical investigations (5) and is judged as being ready for application in patients, then the first application to humans should occur within the framework of a phase I clinical trial (6–9). At this early stage, an efficacious and safe dosing is unknown and information is available at best from preclinical in vitro and in vivo studies (10). Beginning treatment at a low dose very likely to be safe (starting dose), small cohorts of patients are treated at progressively higher doses (dose escalation) until drug-related toxicity reaches a predetermined level (dose limiting toxicity [DLT]). The objective is to determine the MTD (8) of a drug for a specified mode of administration and to characterize the DLT. The goals in phase I trials are according to Von Hoff et al. (11):

Phase I Trials

3

1. Establishment of an MTD, 2. Determination of qualitative and quantitative toxicity and of the toxicity profile, 3. Characterization of DLT, 4. Identification of antitumor activity, 5. Investigation of basic clinical pharmacology, 6. Recommendation of a dose for phase II studies. The primary goal is the determination of a maximum safe dose for a specified mode of treatment as basis for phase II trials. Activity against tumors is examined and assessed, but tumor response is not a primary end point. Inherent in a phase I trial is the ethical issue that anticancer treatment is potentially both harmful and beneficial to a degree that depends on dosage. The dose dependency is at that stage of research not known for humans (12). To be on the safe side, the treatment starts at low doses that are probably not high enough such that the drug can be sufficiently active to elicit a beneficial effect. Even worse, the experimental drug may appear finally—after having passed the clinical drug development program—as inefficacious and may have been of harm only. Then, retrospectively seen, patients in phase I trials hardly had any benefit from the medical treatment. This is all unknown at the start of the trial. The dilemma of probably unknowingly underdosing patients in the early stages of a phase I trial has been of concern and challenged the search for the best possible methodology for the design and conduct of a phase I trial. The goal is to obtain the most information on toxicity in the shortest possible time with the fewest patients (13). Important clinical issues in phase I trials are patient selection and identification of factors that determine toxicity, drug schedules, and the determination and assessment of target toxicity (6,11). Important statistical issues are the design parameters (starting dose, dose levels, dose escalation) and the estimation of the MTD. B. Assumptions Most designs for dose finding in phase I trials assume a monotone dose–toxicity relationship and a monotone dose–(tumor) response relationship (14). This idealized relationship is biologically inactive dose ⬍ biologically active dose ⬍ highly toxic dose. Methods considered below apply to adult cancer patients with confirmed diagnosis of cancer not amenable to established treatment. Usually excluded are leukemias and tumors in children (9). Phase I studies in radiotherapy may require further consideration because of long delayed toxicity. The conduct of a phase I trial requires an excellently equipped oncological center with high-quality

4

Edler

means for diagnosis and experimental treatment, for detection of toxicity, and for fast and adequate reaction in the case of serious adverse events. Furthermore, easy access to a pharmacological laboratory is needed for timely pharmacokinetic analyses. These requirements indicate the advisability of restricting a phase I trial to one or very few centers. C.

Definitions

Throughout this article we denote the set of dose levels at which patients are treated by D ⫽ {xi , i ⫽ 1, . . .} assuming xi ⬍ xi⫹1. The dose unit is usually mg/ m2 body surface area (15). But this choice does not have an impact on the methods nor is the route of application. It is assumed that the patients enter the study one after the other numbered by j, j ⫽ 1, 2, . . . and that treatment starts immediately after entry (informed consent assumed). Denote by x( j ) the dose level of patient j. Toxic response of patient j is assumed to be described by the dichotomous random variable Yj , where Yj ⫽ 1 indicates the occurrence of a DLT and Yj ⫽ 0 the nonoccurrence. To comply with most articles we denote the dose–toxicity function by ψ(x, a) with a parameter (vector) a: P(Y ⫽ 1|Dose ⫽ x) ⫽ ψ(x, a)

(1)

ψ(x, a) is assumed as a continuous monotone nondecreasing function of the dose x, defined on the real line 0 ⱕ x ⬍ ∞ with ψ(0, a) ⱖ 0 and ψ(∞, a) ⱕ 1. Small cohorts of patients of size nk, 1 ⱕ nk ⱕ nmax are treated on a timely consecutive sequence of doses x[k] ∈ D, where nmax is a theoretical limit of the number of patients treated per dose level (e.g., nmax ⫽ 8) and where k ⫽ 1, 2, . . . count the time periods of treatment with dose x[k], i.e., x[k] ≠ x[h] for k ≠ h is not assumed. Notice that a dose xi ∈ D may be visited more than one with some time delay between visits. If the treatment at each of these dose levels x[k], k ⫽ 1, 2 . . . lasts a fixed time length ∆t (e.g., 2 months), the duration of the phase I trial is then equal to ∆t times the number of those cohorts entering the trial, independent of the number nk of patients per cohort at level x[k]. D.

Maximum Tolerated Dose

The notion of an MTD is defined unequivocally in terms of the observed toxicity data of the patients treated using the notion of DLT under valid toxicity criteria (8). Drug toxicity is considered as tolerable if the toxicity is acceptable, manageable, and reversible. Drug safety has been standardized for oncological studies recently by the establishment of the common toxicity criteria (CTC) of the U.S. National Cancer Institute (NCI) (16). This is a large list of adverse events (AEs)

Phase I Trials

5

subdivided into organ/symptom categories that can be related to the anticancer treatment. Each AE has been categorized into five classes: 1. 2. 3. 4. 5.

CTC CTC CTC CTC CTC

grade grade grade grade grade

0, 1, 2, 3, 4,

no AE or normal; mildly (elevated/reduced); moderate; serious/severe; very serious or life threatening.

The CTC grade 5, fatal, sometimes applied, is not used in the sequel because death is usually taken as very serious adverse event preceeded by a CTC grade 4 toxicity. Of course, a death related to treatment has to be counted as DLT. The list of CTC criteria has replaced the list of the World Health Organization (17) based on an equivalent 0–4 scale. Investigators planning a phase I trial have to identify in the CTC list a subset of candidate toxicities for dose limitation, and they have to fix the grade for which that toxicity is considered to be dose limiting such that treatment has to be either stopped or the dose has to be reduced. Usually, a toxicity of grade 3 or 4 is considered dose limiting. That identified subset of toxicities from the CTC list and the limits of grading define the DLTs for the investigational drug. Sometimes the list of DLTs is open such that any AE from the CTC catalogue of grade 3 and higher related to treatment is considered a DLT. During cancer therapy patients may show symptoms from the candidate list of DLTs not caused by the treatment but by the cancer disease itself or by concomitant treatment. Therefore, the occurrence of any toxicity is judged by the clinician or study nurse for its relation to the investigational treatment. A commonly used assessment scale is as follows: 1. 2. 3. 4. 5.

Unclear/no judgment possible, Not related, Possibly, Probably, Definitively

related to treatment (18). Often, a judgment of ‘‘possibly’’ and more (i.e., probably or definitively) is considered as drug-related toxicity and called adverse drug reaction (ADR). Therefore, one may define the occurrence of DLT for a patient more strictly as if at least one toxicity of the candidate subset of the CTC criteria of grade 3 and higher has occurred that was judged as at least possibly treatment related. Obviously, this definition carries subjectivity: choice of the candidate list of CTCs for DLT, assessment of the grade of toxicity, and assessment of the relation to treatment (19). Uncertainty of the assessment of toxicity has been investigated (e.g., in 20, 21). When anticancer treatment is organized in treatment cycles—mostly for 3–4 weeks—DLT is usually assessed retrospec-

6

Edler

tively before the start of a new treatment cycle. In phase I studies, often two cycles are awaited before a final assessment of the DLT is made. If at least one cycle exhibits at least one DLT, that patient is classified as having reached DLT. An unambiguous definition of the assessment rules of individual DLT is mandatory for the study protocol. For the statistical analysis, each patient should be assessable at his or her dose level either as having experienced a DLT (Y ⫽ 1) or not (Y ⫽ 0). With the above definition of DLT, one can theoretically assign to each patient an individual MTD (I-MTD) as the highest dose that can be administered safely to that patient: The I-MTD is the highest dose x that can be given to a patient before a DLT occurs. No within-patient variability is considered at this instance (i.e., I-MTD is nonrandom). Because a patient can be examined at only one dose, it is not possible to observe the I-MTD. It is only observed if the given dose x exceeded the I-MTD or not. It is implicitly assumed that all patients entering a phase I trial react in a statistical sense identically and independently from each other. A population of patients gives rise to a statistical distribution. Therefore, one postulates the existence of a population-based random MTD (realized in I-MTDs) to describe the distribution of this MTD. The probability that x exceeds the random MTD is P(x ⬎ MTD) ⫽ F(x)

(2)

and describes the proportion of the population showing a DLT when treated by dose x. This then becomes the adequate statistical model for describing the MTD. This probabilistic approach known as tolerance distribution model for a quantal dose–response relationship (22) allows any reasonable cumulative distribution function F for the right side of Eq. (2). F is a nondecreasing function with values between 0 and 1. In practice one should allow F(0) ⬎ 0 as a ‘‘baseline’ toxicity’’ and also F(1) ⬍ 1 for saturation of toxicity. Classes of well-known tolerance distributions are the probit, logit, and Weibull models (22). Based on this probabilistic basis, a practicable definition of an MTD of a phase I trial is obtained as a percentile of the statistical distribution of the (random population) MTD as follows. Determine an acceptable proportion 0 ⬍ θ ⬍ 1 of tolerable toxicity in the patient population before accepting the new anticancer treatment. Define the MTD as that dose for which the proportion of patients exceeding the DLT is at least as large as θ: F(MTD) ⫽ θ or MTD ⫽ F⫺1(θ) (Fig. 1). Obviously, there is a direct correspondence between F in Eq. (2) and ψ in Eq. (1) as ψ(MTD) ⫽ P(Y ⫽ 1|Dose ⫽ MTD) ⫽ F(MTD) ⫽ θ

(3)

If ψ(x) is monotone nondecreasing and continuous, the MTD for θ denoted MTDθ is the θ percentile: MTDθ ⫽ ψ⫺1(θ)

(4)

Phase I Trials

7

Figure 1 Schematic dose–toxicity relationship ψ(x, a). The maximum tolerable dose MTDθ is defined as the θ percentile of the monotone increasing function ψ(x, a) with model parameter a.

The choice of θ depends on the nature of the DLT and the type of the target tumor. For an aggressive tumor and a transient and non-life-threatening DLT, θ could be as high as 0.5. For persistent DLT and less aggressive tumors, it could be as low as 0.1 to 0.25. A commonly used value is θ ⫽ 1/3 ⫽ 0.33. E.

Dose–Toxicity Modeling

The choice of an appropriate dose–toxicity model ψ(x) is important not only for the planning but also for the analysis of phase I data. Most applications use an extended logit model and apply the logistic regression because of its flexibility, the ease of accounting for patient covariates (e.g., pretreatment, disease staging, performance, etc.), and the availability of computing software. A general class of dose–toxicity models is a two-parameter family: ψ(x, a) ⫽ F(a 0 ⫹ a1h(x))

(5)

8

Edler

where F is a known cumulative distribution function, h a known dose metric, and a ⫽ (a 0, a1) unknown parameters. Monotone increasing functions F and h are sufficient for a monotone increasing ψ. If h(x) ⫽ x, the MTD is MTDθ ⫽

F⫺1(θ) ⫺ a 0 a1

(6)

Convenient functions F are the PROBIT (x): Φ(x) LOGIT (x): {1 ⫹ exp(⫺x)}⫺1 HYPERBOLIC TANGENT (x): {[tanh(x) ⫹ 1]/2}⫺a2 with a further unknown parameter component a2; see O’Quigley et al. (23).

III. DESIGN A phase I trial design has to determine which dose levels are applied to how many patients and in which sequel, given the underlying goal of estimating the MTD. This implies three tasks: determination of the possible set of dose levels, choosing the dose levels sequentially, and determining the number of patients per dose level. A.

Choice of the Dose Levels—Action Space

From previous information—mostly preclinical results—a range D of possible doses (action space) is assumed. One may distinguish between a continuous set of doses DC , a discrete finite action space DK ⫽ {x1 ⬍ x2 ⬍ ⋅ ⋅ ⋅ ⬍ xk} of an increasing sequence of doses, or an infinite ordered set D ∞ ⫽ {x1 ⬍ x2 ⬍ . . .}. Simple dose sets are the additive set xi ⫽ x1 ⫹ (i ⫺ 1)∆ x i ⫽ 1, 2, . . .

(7)

and the multiplicative set xi ⫽ x1 ⋅ f (i⫺1) i ⫽ 1, 2, 3 . . .

(8)

where f denotes the factor by which the starting dose x1 dose is increased. A pure multiplicative set cannot be recommended and is not used in phase I trials because of its extreme danger of jumping from a nontoxic level directly to a highly toxic level. In use are modifications of the multiplicative scheme that start with a few large steps and slow down later. Such a modified action space could be the result of a mixture where the first steps of low doses are obtained multiplicatively and

Phase I Trials

9

the remaining ones additively. Another smoother set is obtained when the factors are decreasing with higher doses, for example; xi ⫽ fi⫺1 xi⫺1 i ⫽ 1, 2, 3, . . .

(9)

where { fi} is a nonincreasing sequence of factors, which may start with f1 ⫽ 2 as doubling dose from x1 to x2 and continues with 1 ⬍ fi ⬍ 2 for i ⱖ 2. The modified Fibonacci scheme, described next, is of this general type. It has been in use from the beginning of systematic phase I research. 1. Modified Fibonacci Dose Escalation The most popular and most cited dose escalation scheme is the so-called modified Fibonacci dose escalation (MFDE) (Table 1). A review of the literature for its origin and justification as a dose-finding procedure is difficult. A number of authors (8,24) refer to an article from 1975 of Goldsmith et al. (25), who present the MFDE as an ‘‘idealized modified Fibonacci search scheme’’ in multiples of the starting dose and as percent of increase. Two years earlier, Carter (3) summarized the study design principles for early clinical trials. For methodology he refered in a general way to O. Selawry, Chief of the Medical Oncology Branch at the NCI in the early seventies, ‘‘who has elucidated many of the phase I study principles.’’ Carter stated that ‘‘this scheme has been used successfully in two Phase I studies performed by the Medical Oncology Branch’’ in 1970, one by Hansen (26) and one by Muggia (27). Both studies are published in the Proceedings of the American Association of Cancer Research without a bibliography.

Table 1 Evolution of the Modified Fibonacci Scheme from the Fibonacci Numbers fn Defined Recursively fn⫹1 ⫽ fn ⫹ fn⫺1, n ⫽ 1, 2, . . . , f0 ⫽ 0, f1 ⫽ 1 Fibonacci numbers fn , n ⫽ 1, 2, . . . 1 2 3 5 8 13 21 34 55

Fibonacci multiples fn⫹1 /fn

Modified Fibonacci fn⫹1 /fn

Smoothed modified Fibonacci

— 2.0 1.5 1.67 1.60 1.63 1.62 1.62 1.62

— 2 1.65 1.52 1.40 1.33 1.33 1.33 1.33

— 2.0 1.67 1.50 1.40 1.30–1.35 1.30–1.35 1.30–1.35 1.30–1.35

10

Edler

However, the relation to the Fibonaccci numbers is not clarified in these early studies. In 1977, Carter (28) refers to Schneiderman (10) and ‘‘a dose escalation based on a numeral series described by the famed 13th Century Italian mathematician Leonardo Pisano, alias Fibonacci.’’ He also reported on the use of the MFDE by Hansen et al. (29) when studying the antitumor effect of 1.3-bis(2-chloroethyl)-1-nitrosourea (BCNU) chemotherapy. Both refer to Schneiderman (10). The weakness of the foundation of the MFDE for phase I trials becomes even more evident when one looks deeper into the history. Fibonacci (Figlio Bonaccio, the son of Bonaccio) also known as Leonardo da Pisa or Leonardo Pisanus, lived from 1180 to 1250 as a mathematician at the court of Friedrich II in Sicilia working in number theory and biological applications. The sequence of numbers named after Fibonacci is created by a simple recursive additive rule: Each number is the sum of its two predecessors: fn⫹1 ⫽ fn ⫹ fn⫺1 and starts from f1 ⫽ 1 using f0 ⫽ 0 (Table 1). Fibonacci related this sequence to the ‘‘breeding of rabbits’’ problem in 1202 and also to the distribution of leaves about a stem. The sequence fn grows geometrically and is approximately equal to a[(1 ⫹ √5)/2]n , where a ⫽ (1 ⫹ √5)/√5. The ratio of successive numbers fn /fn⫺1 converges to (1 ⫹ √5)/2 ⫽ (1 ⫹ 2.236)/2 ⫽ 1.618, the Golden Section, a famous principle of ancient and Renaissance architecture. The Fibonacci numbers have been used in optimization and dynamic programming in the 1950s (30) for determining the maximum of an unimodal function. One application can be illustrated as follows: ‘‘How many meters long can a bent bridge be, such that one can always locate its maximum height in units of meters by measuring at most n times?’’ The solution is given by Bellman’s theorem (30) as fn meters. The results say nothing on the placement of the measurements but only on the needed number of measurements. In his article on methods for early clinical trials research, Schneiderman (10) showed that he was familiar with this work on optimization and cites Bellman’s result, but actually with the wrong page number (correct is Ref. 30, page 34 and not 342). The optimization result is even better explained in Ref. 31 on page 152. Schneiderman (10) tried to transpose the Fibonacci search by fixing an initial dose x1, a maximum possible dose xK, and the number n of steps for moving upward from x1 to xK. ‘‘By taking a Fibonacci series of length n ⫹ 1, inverting the order, and spacing the doses in proportion to the n intervals in the series’’ (10), Schneiderman obtained an increasing sequence of doses. In contrast to the MFDE, which is based on a multiplicative set of doses, this approach is somehow still additive. But it leads to smaller and smaller steps toward higher doses similar to the MFDE. However, the steps obtained by Schneiderman’s inversion are at the beginning very large and later very small. This escalation is fixed to K doses and does not open to higher doses if the MTD was not reached at xK. Schneiderman discussed a restarting of the scheme if no toxicity was seen at xK and concluded that then ‘‘no guide seems to exist for the number of steps.’’ And he confesses that he has ‘‘not seen it in any published

Phase I Trials

11

account of preliminary dose finding.’’ The number of steps in this reversed Fibonacci scheme is strongly related to the escalation factor and so provides no guidance for dose escalation. In the same article, however, he cites. De Vita et al. (32) in which a dose escalation with the factors of 2, 1.5, 1.33, and 1.25 is used. A hint at the use of the inverse of a Fibonacci scheme where the dose increments decrease with increasing numbers is also given by Bodey and Legha (8), who refer to Ref. 25 ‘‘who examined the usefulness of the modified Fibonacci method as a guide to reaching the MTD.’’ In summary, it seems that the idea of the so-called MFDE came up in the NCI in the sixties when the early clinical trials programs started there and was promoted by the scientists mentioned above. They searched for a dose escalation scheme that slows down from doubling the dose to smaller increases within a few steps. The MFDE (Table 1), slowing down the increase from 65% to 33% within the first five steps, seemed reasonable enough to be used in many trials. The method has been successful to the extent that MTDs have been determined through its use. From empirical evidence and the simulation studies performed later, the MFDE seems now to be too conservative in too many cases. 2. Starting Dose The initial dose given to the first patients in a phase I study should be low enough to avoid severe toxicity but also high enough for a chance of activity and potential efficacy in humans. Extrapolation from preclinical animal data focused on the lethal dose 10% (LD10) of the mouse (dose with 10% drug-induced deaths) converted into equivalents in units of mg/m2 (33) of body surface area. The standard starting dose became 1/10 of the minimal effective dose level for 10% deads (MELD10) of the mouse after verification that no lethal and no life-threatening effects were seen in another species, for example, rats or dogs (7,11,34). Earlier recommendations had used higher portions of the MELD10 (mouse) or other characteristic doses as, for example, the lowest dose with toxicity (toxic dose low) in mammals (35). B. Dose Escalation Schemes If a clinical action space has been defined as set of dose levels D, the next step in designing a phase I trial consists of the establishment of rule by which the doses of D are assigned to patients. Proceeding from a starting dose x1, the sequence of dosing has to be fixed in advance in a so-called dose escalation rule. This section starts with the traditional escalation rules (TER). Those rules are also known as ‘‘3 ⫹ 3’’ rules because it became usual to enter three patients at a new dose level and when any toxicity was observed to enter six patients in total at that dose level (11) before deciding to stop at that level or to increase the dose. Carter

12

Edler

(3) is also the first source I am aware of where the so-called 3 ⫹ 3 rule (the traditional dose escalation scheme) is listed as a phase I study principle. Two versions of this 3 ⫹ 3 rule are described below as TER and strict TER (STER), respectively. Then we introduce the up-and-down rules (UaD) as fundamental but not directly applicable rules and turn from this to Bayesian rules and the intensively and sometimes also controversially discussed continual reassessment method (CRM). Methods for the determination of the MTD have to be addressed in this context also. 1. Traditional Escalation Rule A long used standard phase I design has been the TER where the doses escalates in DK or D∞ step by step from xi to xi⫹1, i ⫽ 1, 2, . . ., with three to six patients per dose level; see Table 2 for an example taken from Ref. 36. Using TER, patients are treated in cohorts of three each receiving the same dose, say xi. If none of the three patients shows a DLT at level xi, the next cohort of three patients receives the next higher dose xi⫹1. Otherwise, a second cohort of three is treated at the same level xi again. If exactly one of the six patients treated at xi exhibits DLT, the trial continues at the next higher level xi⫹1. If two

Table 2 Example of a Phase I Study Performed According to the Standard Design (TER) Dose level 1 2 3 4 5 6 7 8 9 10 11 12 13

Dosage

Escalation factor

N

DLT

Grade

0.45 0.9 1.5 3 4.5 6.0 7.5 10.5 13.5 17.5 23.1 30.0 39.0

— 2 1.67 2 1.5 1.33 1.25 1.4 1.29 1.29 1.33 1.3 1.3

3 6 6 3 4 3 3 3 4 3 6 5 1

0 1 1 0 0 0 1 0 1 0 1 4 1

000 111311 041111 012 1122 111 113 111 0320 010 123122 33331 3

Each row shows the number of patients N of that dose level, the number of cases with DLT defined as grade 3–4, and the actually observed toxicity grades of the N patients. (From Ref. 36.)

Phase I Trials

13

or more patients of the six exhibit DLT at the level xi , the escalation stops at that level. When the escalation has stopped, various alternatives of treating a few more patients are in use: 1. Treat a small number of additional patients at the stopping level xi , e.g., to a total of eight patients. 2. Treat another cohort of three patients at the next lower level xi⫺1 if six patients had not already been treated. 3. Treat another cohort of three patients at all next lower levels xi⫺1, xi⫺2, . . . if only three patients had been treated earlier, possibly going down as far as x1. 4. Treat a limited number of patients at a level not previously included in D located between xi⫺1 and xi . A slightly more conservative escalation is implemented in a modified TER denoted here as STER. Using STER, patients are treated in cohorts of three each receiving the same dose, say xi . If none of the three patients shows a DLT at level xi , the next cohort of three patients receives the next highest dose xi⫹1. If one DLT is observed, three other patients are included at the same level xi and the procedure continues as TER. If two or three DLTs are observed among the first three of that cohort, escalation stops at xi and the dose is de-escalated to the next lower level xi⫺1 where a prefixed small number of cases is treated additionally according to one of the options 2–4. STER can be described formally as follows: Assume that j patients have been treated before the turn to the level xi at lower dose levels x(1), . . . x( j ) ⬍ xi . Assume that ni⫺1 patients had been treated at dose level xi⫺1. Then denote by S mji the number of patients with a DLT among m patients at dose level xi when j patients have been treated before. Then x( j ⫹ 1) ⫽ x( j ⫹ 2) ⫽ x( j ⫹ 3) ⫽ xi

(10)

and set x( j ⫹ 4) ⫽ x( j ⫹ 5) ⫽ x( j ⫹ 6) ⫽

冦

xi⫹1 if S 3ji ⫽ 0 and continue xi if S 3ji ⫽ 1 and continue xi⫺1 if S 3ji ⫽ 2 and stop

(11)

Then set next x( j ⫹ 7) ⫽ x( j ⫹ 8) ⫽ x( j ⫹ 9) ⫽ xi⫹1 if S 6ji ⱕ 1 x( j ⫹ 7) ⫽ x( j ⫹ 8) ⫽ x( j ⫹ 9) ⫽ xi⫺1 if S 6ji ⱖ 2 and ni⫺1 ⬍ 6 or stop if S 6ji ⱖ 2 and ni⫺1 ⫽ 6

(12)

14

Edler

If the escalation stops at a dose level xi , that level would be the first of unacceptable DLT, and one would conclude that the MTD is exceeded. The next lower level xi⫺1 is then considered as the MTD. It is common practice that at the dose level of the MTD, at least six patients are treated. For this reason, the options 2 and 3 above are often applied at the end of a phase I trial. Therefore, the MTD can be characterized as the highest dose level below the stopping dose level xi at which (at least) six patients have been treated with no more than one case of DLT. If no such dose level can be identified, the starting dose level x1 would be taken as MTD. 2. Random Walk (RW) Designs A large class of escalation rules is based on the sequential assignment of doses to one patient after the other. Those rules have their origin in sequential statistical designs and in stochastic approximation theory. If the dose assignment to the current patient depends only on the result seen in the previous one, the assignment process becomes Markovian and performs a random walk on the action space D. Basically, a patient is assigned to the next higher, the same, or the next lower dose level with a probability that depends on the previous subject’s response. RW designs operate mostly on the finite lattice of increasingly ordered dosages DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK }. A Markov chain representation of the random walk on D is given in Ref. 37. In principal, the MTD is estimated after each patient’s treatment and the next patient is then treated at that estimated level. All optimality results of RW designs require that the set D of doses remains unchanged during the trial. They are simply to implement, essentially nonparametric, and of known finite and asymptotic distribution behavior. RW designs have been applied to phase I studies (37,38). Early prototypes in statistical theory were the UaD (39), proposed originally for explosives testing, and the stochastic approximation method (SAM) (40). SAM has never been considered seriously for phase I trials. One reason may be the use of a continuum of dose levels leading to impracticable differentiation between doses; another reason could be the ambiguity of the adapting parameter sequence {aj } (but see Ref. 41). The main reason has been stated already in (10): ‘‘the up and down procedures and the usual overshooting are not ethically acceptable in an experiment on man.’’ However, the UaD rules were more recently adapted for medical applications by considering grouped entry, biased coin randomization, and Bayesian methods (38). Such, the elementary UaD has been reintroduced into phase I trials—cited as Storer’s B design (37)— as a tool to construct more appropriate combination designs. Elementary UaD. Given patient j has been treated on dose level x( j ) ⫽ xi, the next patient j ⫹ 1 is treated at the next lower level xi⫺1 if a DLT was observed in patient j, otherwise at the next higher level xi⫹1. Formally,

Phase I Trials

x( j ⫹ 1) ⫽

15

冦 xx

i⫹1

i⫺1

if x( j ) ⫽ xi and no DLT if x( j ) ⫽ xi and DLT

(13)

Two modifications of the elementary rule were proposed (37): ‘‘modified by two UaD’’ or Storer’s C design (UaD-C) and ‘‘modified by three UaD’’ or Storer’s D design (UaD-D, which is quite similar to the 3 ⫹ 3 rule: UaD-C: Proceed as in UaD but escalate only if two consecutive patients are without DLT. UaD-D: Three patients are treated at a new dose level. Escalate if no DLT and de-escalate if more than one DLT occurs. If exactly one patient shows a DLT, another three patients are treated at the same level and the rule is repeated. For an algorithm see O’Quigley and Chevret (42). The two designs were combined with the elementary UaD to Storer’s BC and BD two-stage designs: UaD-BC: Use the UaD until the first toxicity occurs and continue with the UaD-C at the next lower dose level. UaD-BD: Use the UaD until the first toxicity occurs and continue with UaD-D design. Simulations revealed a superiority of the UaD-BD over the UaD-BC and the elementary UaD (37). Although the single-stage designs UaD, UaD-B, and UaD-C were not considered as sufficient and only the two-stage combinations were proposed for use (37), unfortunately, proposals of new designs were calibrated mostly on the one-stage designs instead of the more successful two-stage designs such that recommendations for practice are hard to deduct from those investigations. A new sequential RW (38) is the so-called biased coin design (BCD) applicable for a action space DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK }. Using BCD, given patient j has been treated on dose level x( j ) ⫽ xi , the next patient j ⫹ 1 is treated at the next lower level xi⫺1 if a DLT was observed in patient j, otherwise at xi with some probability p# not larger than 0.5 or at xi⫹1 with probability 1 ⫺ p# (not smaller than 0.5). When reaching boundaries x1 and xK, the procedure must stay there. This design centers the dose allocation unimodally around the MTDθ for any θ, 0 ⬍ θ ⱕ 0.5 of p# is chosen as θ/(1 ⫺ θ), and if the MTD is in the inner of the dose space DK (38). Interestingly, the 33% percentile, MTD1/3, is obtained with a nonbiased coin of probability p# ⫽ (1/3)/ (2/3) ⫽ 0.5. 3. Continual Reassessment Method To reduce the number of patients treated with possibly ineffective doses by the TER/STER or the RW type designs, a Bayesian based dose escalation rule was

16

Edler

introduced by O’Quigley et al. (23). A starting dose is selected using a prior distribution of the MTD, and this distribution is updated with each patient’s observation with regard to the absence or presence of DLT. Each patient is treated at the dose level closest to the currently estimated MTD. We describe the CRM next and postpone a straightforward Bayesian design to the next subsection. Assume a finite action space DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK } of dose levels, a fixed sample size N and a one-parameter dose-toxicity function ψ(x, a) as defined in Eq. (1) depending on the model parameter a. Estimation of the MTD is therefore equivalent to the estimation of a. Assume an unique solution ao at the MTDθ: ψ(MTD, ao) ⫽ θ

(14)

Let Yj denote the dichotomous response variable of the jth patient, j ⫽ 1, . . . , N, and summarize the sequentially collected dose-toxicity information up to the j-1th patient by Ωj ⫽ { y1, . . . , yj⫺1}. The information upon the parameter a is described by its density function f (a, Ωj ), the posterior density distribution of the parameter a given Yi ⫽ yi, i ⫽ 1, . . . , j ⫺ 1. In the following, a is assumed as a scalar a ⬎ 0 and f normalized by

冮

∞

0

f (a, Ωj ) da ⫽ 1, j ⱖ 1.

(15)

Using CRM, given the previous information Ωj ⫽ { y1, . . . , yj⫺1}, the next dose level is determined such that it is closest to the current estimate of an MTDθ. For this, the probability of a toxic response is calculated for each xi ε DK given Ωj ⫽ { y1, . . . , yj⫺1} as P(xi, a) ⫽

冮

∞

0

ψ(xi, a) f (a, Ωj ) da ⫽ θij

(16)

and the dose level x ⫽ x( j ) for the jth patient is selected from DK such that the distance of P(x, a) to the target toxicity rate θ becomes minimal: x( j ) ⫽ xi if θ ⫺ θij minimum. After observing the toxicity Yj at dose level x( j ), the posterior density of the parameter a is obtained from the prior density f (a, Ωj ) and the likelihood of the jth observation L ( yj , x( j ), a) ⫽ ψ(x( j ), a) yj [(1 ⫺ ψ(x( j ), a)]1⫺yj

(17)

using Bayes theorem as f (a, Ωj⫹1) ⫽

L ( yj , x( j ), a) f (a, Ωj ) ∫ L( yj , x( j ), u) f (u, Ωj ) du ∞ 0

(18)

The CRM starts with an a priori density g(a). The MTD is then estimated as y D ⫽ x(N ⫹ 1), the dose for the (N ⫹ 1)st patient. Consistency in the sense MT that the recommended dose converges to the target level was shown even under

Phase I Trials

17

model misspecification (43). There are cases where it converges to a close but not the closest level. The treatment of batches of patients per dose level has also been proposed (23). 4. Modifications of the CRM The CRM was criticized mainly from three aspects: choice of the starting dose x(1) according to a prior g(a) that could result in a dose level in the middle and not in the lower dose region, allowance to jump over a larger part of the dose region and skipping intermediate levels, and lengthening the trial because of allowing treatment and examination of toxicity only for one patient after the other. The ethical argument of risking too high toxicity and the practical argument of undue duration elicited modifications that try to reduce that risk and keep at the same time the benefit of reaching the MTD with a smaller number of dose levels and treating more patients at effective doses. Modifications of the CRM were obtained (44–48) through restrictions on choosing x1 as starting dose, not skipping consecutive dose levels and allowing groups of patients at one dose level. From these proposals evolved a design sketched (49) relying on a suggestion of Faries (45). Using modified CRM, start with one patient at dose level x1 and apply the CRM. Given patient j-1 has been treated on dose level x( j ⫺ 1) ⫽ xi and information Ωj ⫽ { y1, . . . , yj⫺1} predicts for the next patient j the dose level xCRM ( j), then the next dose level x( j ) is chosen as follows: xCRM ( j ) if xCRM ( j ) ⱕ x( j ⫺ 1) if xCRM ( j ) ⬎ x( j ⫺ 1) and if yj⫺1 ⫽ 0 (no DLT) x( j ) ⫽ xi⫹1 if xCRM ( j ) ⬎ x( j ⫺ 1) and if yj⫺1 ⫽ 1 (DLT) xi

(19)

The main restriction in Eq. (19) is the start at x1 and the nonskipping. Further modifications demonstrate the efforts of reducing anticonservatism in the CRM: 1. Modified CRM but stay with x( j ) always one dose level below xCRM ( j) (version 1 in 45). 2. Modified CRM but x( j ) is not allowed to exceed the MTD estimate based on Ωj (version 2 in 45). 3. Modified CRM but use the starting level of the CRM and enter there three patients (version 4 in 45). 4. CRM but escalation (i.e., xCRM ( j) ⬎ x( j ⫺ 1)) is restricted within DK ⫽ {x1 ⬍ ⋅ ⋅ ⋅ ⬍ xK } to one step only (restricted version in 46). 5. Modified CRM but stopping if the next level has been already visited by a predetermined number of patients (e.g., six) (44). Korn et al. (44) introduce additionally a dose level x0 ⬍ x1 for an estimate of the MTD

18

Edler

6.

7.

if a formally proposed MTD equal to x1 would exhibit an unacceptable toxicity. Modified CRM run in three variants of one, two, or three patients per cohort (48). This version is identical to 5 except that a dose level cannot be passed after a DLT was observed. CRM but allow more than one patient at one step at a dose level and limit escalation to one step (47). A simulation study showed a reduction of trial duration (50–70%) and reduction of toxicity events (20–35%) compared with the CRM.

Another approach to modify the CRM consisted in starting with the UaD until the first toxicity occurs and then switching to the CRM using all information obtained so far (46). Implicitly, this extended CRM starts at x1 and escalates not by more than one step until the first toxicity occurs, but it may cover more doses than the CRM. Further modifications of this type with two and three patients per cohort were investigated (46,48). Modifications Using Toxicity of CTC Grade 2 (Secondary Grade Toxicity). A substantial modification of previous discussed phase I designs was the suggestion to use further toxicity information than that defined for the determination of the DLT (44,45,48). It was proposed to account for the observation of grade 2 toxicity (secondary toxicity) that does not contribute normally to the DLT criterion. The CRM was modified such that two patients were treated on the level where the first patient had exhibited secondary toxicity (version 3 in 45). Referring to this and to a previous proposal (44), Ahn (48) implemented a so-called secondary grade design similar to the UaD-BD design: The elementary UaD is used with two patients per cohort. If one patient of the two shows a DLT or if there is a second case of grade 2 toxicity, dose escalation continues with the standard STER design. 5. Bayesian Designs Bayesian methods are attractive for phase I designs because they can be applied even if little data are present but prior information is available. Furthermore, Bayesian methods are excellently adapted to decision making, and they tend to allocate patients at higher doses after nontoxic and at lower doses after toxic responses. The cumulation of information and the tendency for allowing realtime decision makes Bayesian designs best suited when patients are treated one at a time and when toxicity information is available quickly. A full Bayesian design (50) is set up by the data model ψ(x, a), the prior distribution g(a) of the model parameter a, and a continuous set of actions D. A gain function (negative loss) G is needed to characterize the gain of information if an action is taken given the true (unknown) value of a or equivalently of the MTD. Whitehead and

Phase I Trials

19

Brunier (50) use the precision of the MTD estimate obtained from the next patient as gain function. The number of patients is fixed in advance. Given the responses Yj ⫽ yj , the Bayes rule determines the posterior density g (a| yj ) of the parameter a as g (a |yj ) ⫽

ψ( yj , a)g(a) ψ( yj , a)g(a)da

(20)

A new dose level is then selected by maximizing the posterior expected gain E[G(a)| yj ]

(21)

Given a, the likelihood of the data Ωj for the first j⫺1 patients is then a product of the terms (17) used already for the CRM j⫺1

fj⫺1(Ωj , a) ⫽

兿

ψ(x(s), a)ys [1 ⫺ ψ(x(s), a)]1⫺ys

(22)

s⫽1

The next dose x( j ) is then determined such that the posterior expected gain is maximized with respect to x( j ). For details see Ref. 50. Gatsonis and Greenhouse (51) used a Bayesian method to estimate directly the dose–toxicity function and the MTDθ in place of the parameter a. The probability α ⫽ P(Dose ⬎ MTDθ) of overdosing a patient was used as target parameter for Bayesian estimation in the so-called escalation with overdose control (EWOC) method (41). Using EWOC, given the previous results Ωj of the first j ⫺ 1 patients, obtain the posterior cumulative distribution of the MTD as πj (x) ⫽ P(MTD ⱕ x |Ωj ) that is the conditional probability of overdosing the jth patient with the dose x, given the current data Ωj. The criterion for determining the dose level for the ith patient is πj (x( j )) ⱕ α. Then, the posterior density of {ψ(x1, a), MTDθ, given Ωj } is obtained from the prior distribution of a and from there the marginal posterior distribution density π(MTD| Ωj ). The dose level x( j ) is then calculated as x( j ) ⫽ πj⫺1 (α). C. Sample Size per Dose Level The number of patients to be treated per dose level was often implicitly determined in the previous described designs. Statistical methods on optimal sample sizes per dose levels seem to be missing. Recommendations vary between one and eight patients per dose level; other suggestions were three to six patients per lower dose level (9) or a minimum of three per dose level and a minimum of five near the MTD (7). Calculations of sample sizes separately from the sequential design can be based on the binomial distribution and hypothesis testing of toxicity rates (52). This gives some quantitative aid in planing the sample size per selected

20

Edler

dose level. Using the two probabilities PAT and PUAT of acceptable toxicity (AT) and unacceptable toxicity (UAT), tables for a fixed low PAT ⫽ 0.05 and a few PUAT values were given by Rademaker (52) with a straightforward algorithm. Characteristically, the sample size and the error rates increase rapidly if the two probabilities approach each other. If PAT ⫽ 0.05, a sample size of n ⱕ 10 per dose level is only achieved if PUAT ⱖ 0.2. If n ⫽ 6, the probability of escalating becomes 0.91 if the toxicity rate equals PAT ⫽ 0.05 and the probability of nonescalating becomes 0.94 if the toxicity rate equals PUAT ⫽ 0.4. The latter probability decreases from 0.94 to 0.90, 0.83 and 0.69 if n decreases from 6 to 5, 4 and 3, respectively. If PUAT ⫽ 0.3, the probability of nonescalating decreases from 0.85 to 0.77, 0.66 and 0.51 when n decreases from 6 to 5, 4 and 3. Given the toxicity rate p, one may determine a sample size by considering the probability POT ( p) of overlooking this toxicity by the simple formula POT ( p) ⫽ (1 ⫺ p)n (53,54). Given p ⫽ 0.33 POT ( p) takes the values 0.09, 0.06, 0.04, 0.03, 0.02, 0.012, 0.008 for n ⫽ 6, 7, 8, 9, 10, 11, 12. D.

Validation and Comparison of Phase I Designs

Simulation studies were performed to compare new designs with the standard designs TER/STER or to compare the CRM with its modifications (23,37,38,42,44,45). Unfortunately, the results are based on different simulation designs and different criteria of assessment such that the comparison is rather limited. The criteria of the comparisons were 1. 2. 3. 4. 5. 6. 7. 8.

Distribution of the MTD on the dose levels, Distribution of the occurrence of the dose levels, Average number of patients, Average number of toxicities, Toxicity probability (percentage treated above the MTD), Percentage treated at the MTD or one level below, Average number of steps (cohorts treated on successive dose levels), Percentage of correct MTD estimation.

Roughly summarizing the findings of those studies and without going into details, it appears that the TER is inferior to the UaD-BC and the UaD-BD (37) in terms of the fraction of successful trials. The UaD-BD is superior to UaD-BC at least in some situations. The BCD design from the RW class and the CRM performed very similar (38). By examining the distribution of the occurrence of toxic dose levels, percentage of correct estimation, or percentage treated at the MTD or one level below (23), the single-stage UaD designs were inferior to the CRM (23); the UaD recommended toxic levels much more often than the CRM. STER designs were inferior to some modified CRMs with respect to the percentage of correct estimation, the percentage treated at the MTD or one level below,

Phase I Trials

21

and the percentage of patients treated with very low levels (44,45). The CRM and some modified CRMs needed on the average more steps than the STER, and they showed the tendency of treating more patients at levels higher than the MTD. On the other hand, the STER recommended more dose levels lower than the true MTD. The STER provided no efficient estimate of the MTD and did not stop at the specified percentile of the dose–response function (55).

IV. CONDUCT OF THE TRIAL A. Standard Requirements Before a phase I trial is initiated, the following design characteristics should be checked and defined in the study protocol: 1. 2. 3. 4. 5. 6. 7. 8.

Starting dose x1, Dose levels D, Prior information on the MTD, Dose–Toxicity model, Escalation rule, Sample size per dose level, Stopping rule, Rule for completion of the sample size when stopping.

A simulation study is recommended before the clinical start of the trial. Examples are found among the studies mentioned in Section II.D, and software has become available recently by Simon et al. (56). B. Dose Titration Approach Patients are treated in a phase I trial in oncology mostly with the therapeutic aim of palliation and relief; the goal of cure would be unrealistic in most cases, but achievement of a partial remission of the tumor remains a realistic motivation. The assessment of toxicity can therefore be based on only few treatment cycles, mostly two. If treatment with the experimental drug is without complication and if at least a status quo of the disease is retained, treatment normally continues. The study protocol usually makes provisions for individual dose reductions in case of non-DLT toxicity. From the same rationale by which the dose is decreased in the case of toxicity, it should be allowed to increase in the case of nontoxicity and good tolerance. Individual dose adjustment has been discussed repeatedly as a possible extension of the design of a phase I trial (6,11,12,56). Intraindividual dose escalation was proposed (6,11) if a sufficient time has elapsed after the last treatment course such that any existing or lately occurring toxicity could have been observed before an elevated dose is applied. It was recommended that pa-

22

Edler

tients escalated to a certain level are accompanied by ‘‘fresh’’ patients at that level to allow the assessment of cumulative toxic effects or accumulating tolerance. Therefore, all patients should be at least 3–4 weeks on their primarily scheduled dose before an intraindividual escalation is performed. When planning intraindividual dose escalation, one should weigh the advantage of dose increase and faster escalation in the patient population against the risks of cumulative toxicity in the individual patients. Further consideration should be given to the development of tolerance in some patients and the risk of then treating new patients at too high levels (12). The STER was modified as in Ref. 56 by a dose titration design. The end point was defined as a categorial variable with four levels: acceptable toxicity (grade ⱕ 1), conditionally acceptable toxicity (grade ⫽ 2), DLT (grade ⫽ 3), unacceptable toxicity (grade ⫽ 4). Two intraindividual strategies were considered: One uses no intraindividual dose escalation and only deescalation by one level per course in case of DLT or unacceptable toxicity, the other uses escalation per course by one level as long as no DLT or unacceptable toxicity occurs and de-escalation as in the first case. Three new designs were formulated by using one of these two options given an action space D or DK: Speed-up Design: Escalate the dose after each patient by one level as long as no DLT or unacceptable toxicity occurs in the first cycle and at most one patient shows grade 2 toxicity in the first cycle. If a DLT occurs in the first cycle or if grade 2 has occurred twice in the first cycle, switch to the STER with the second intraindividual strategy. Accelerated Speed-up Design: Same as the Speed-up Design except that doubling dose escalation is used in the first stage before switching to STER, using again the second intraindividual strategy. Modified Accelerated Speed-up Design: Same as the Accelerated Speedup Design but no restrictions on the escalation with respect to the cycle. If one DLT occurred in any cycle or if grade 2 toxicity occurred twice in any cycle, switch to the STER (second intraindividual strategy). C.

Toxicity–Response Approach

A design involving both dose finding and evaluation of safety and efficacy in one early phase I/II trial was proposed by Thall and Russell (57). The goal was to define a dose satisfying both safety and efficacy requirements, to stop early when it was likely that no such dose could be found, and to continue if there was chance enough to find one. The number of patients should be large enough to estimate reliably both the toxicity rate and the response rate at the selected dose. Two toxicity outcomes (absence or presence of toxicity) and three efficacyrelated outcomes (no effect, desired effect, undesired effect) are considered. Toxicity and efficacy outcomes were combined into a comprehensive ternary outcome variable Y with the values Y ⫽ 0 if no effect and no toxicity, Y ⫽ 1 if

Phase I Trials

23

desired effect and no toxicity, and Y ⫽ 2 if undesired effect and toxicity occurred. This results in three dose-dependent outcome probabilities ψj (x) ⫽ P(Y ⫽ j| Dose ⫽ d ), j ⫽ 0, 1, 2, and in a two-dimensional dose–effect end point ψ(x) ⫽ (ψ1(x), ψ2(x)). The goal is to find a dose d such that the effect-andno-toxicity probability is reasonably large, e.g., ψ1(d ) ⱖ θ*1 (e.g., 0.50), and that the undesired-effect-and-toxicity probability is limited ψ2(d ) ⱕ θ*2 (e.g., 0.33). The combined dose–response relationship is parameterized as γj (d ) ⫽ P(Y ⱖ j| Dose ⫽ d ) for j ⫽ 0, 1, 2. As dose–response model for (γ1, γ2) the proportional odds regression model and the cumulative odds model are considered (58) and a strategy is developed to find in DK that dose d* which satisfies both criteria.

V.

EVALUATION OF PHASE I DATA

The statistical evaluation of the data resulting from a phase I trial has two primary objectives,the estimation of the MTD and the characterization of the DLT. The MTD is obtained through dose–toxicity modeling. Evaluation of the DLT has to account for dose and time under treatment and should use pharmacokinetic modeling. Besides these objectives, phase I data have to be presented in full detail using transparent descriptive methods. A. Descriptive Statistical Analysis All results obtained in a phase I trial have to be reported in a descriptive statistical analysis that accounts for the dose levels. This is somehow cumbersome because each dose level has to be described as separate stratum. The evaluation of the response can usually be restricted to a case by case description of all those patients who exhibit a partial or complete response. Patients with a stable disease for a longer period or patients with a minor improvement not sufficient for partial response may be described and emphasized also. A comprehensive and transparent report of all toxicities observed in a phase I trial is an absolute must for both the producer’s (drug developer) and the consumer’s (patient) risks and benefits. Absolute and relative frequencies (related to the number of patients evaluable for safety) are reported for all toxicities of the CTC list by distinguishing the grading and the assessment of the relation to treatment. Table 3 provides an example. It exhibits the complete toxicity observed, the assessment of the relation to treatment, and two subtables that summarize for patients exhibiting toxicity the ADRs and DLTs defined in section II.C. A description of the individual load of toxicity of each patient has been made separately using individual descriptions eventually supported by modern graphical methods of linking scatterplots for multivariate data. Multiplicity of DLTs in some patients can be so presented (see e.g. Benner et al. (59)).

24

Edler

Table 3 Descriptive Evaluation of Phase I Toxicity Data A. Absolute and Relative Frequencies of Toxicity CTC item

n

Grade 0

Grade 1

Grade 2

Grade 3

Vomiting

6 100%

2 33

1 17

1 17

2 33

Grade 4 Total with grade ⬎ 0 0 0

4 67

B. Assessment of the Relation of the Toxicity to the Treatment by Case Type Listing Grade 0 —

Grade 1

Grade 2

Grade 3

Probable Possible Unprobable possible

Grade 4 —

C. Summary Table of ADRs n

No ADR

ADR

6

1

3

D. Summary Table of DLTs

B.

n

No DLT

DLT

6

3

1

MTD Estimation

The estimation of the MTD has been part of most search designs from Section y D resulted often directly through the stopping criterion. III, and an estimate MT y In TER, MTD is by definition the dose level next lower to the unacceptable dose level where a predefined proportion of in patients (e.g., 33%) experienced DLT. An estimate of a standard error of the toxicity rate at the chosen dose level is impaired by the small number of cases (ⱕ6) and also by the design. A general method for analyzing dose–toxicity data is the logistic regression of the Yj on the actually applied doses x( j ), j ⫽ 1, . . ., n, of all patients treated in the trial. This would disregard any dependency of the dose–toxicity data on the design that had created the data and may therefore be biased. If the sampling is forced y D may be biased toward lower values. to choose doses below the true MTD, MT The logistic regression takes all observed data ( yj , x( j ), j ⫽ 1, . . ., n) without prejudice assuming that they are independently sampled from the patient population and that the toxic responses per dose are identically distributed. The logistic model for quantal response is therefore given by P(Yj ⫽ 1 | x( j )) ⫽ {1 ⫹ exp[⫺(a0 ⫹ a1 x( j ))]}⫺1 (23)

Phase I Trials

25

Standard logistic regression provides the maximum likelihood estimate of a ⫽ (a0, a1) (22,60,61). MTDθ is estimated as the θ percentile of a tolerance distribution y θ ⫽ logit (θ) ⫺ ay0 ⫽ ⫺ln 2 ⫺ ay0 MTD ay1 ay1

(24)

y Dθ is given by if, e.g., θ ⫽ 0.33. The large sample variance of MT V⫽

y θ Va a ⫹ MTD y 2θ Va Va Va0 ⫹ 2MTD 0 1 0 1 ay21

(25)

where (Va0, Va0a1, Va1) denotes the asymptotic variance–covariance matrix of the model parameter vector. Confidence limits can be obtained by the delta method, Fieller’s theorem, or the likelihood ratio test (37). For the dose–toxicity model ψ(x, a) ⫽ F

冢

x ⫺ a0 a1

冣

y θ ⫽ a1 F ⫺1 (θ) ⫹ a0 . MTD

(26) (27)

C. Pharmacokinetic Phase I Data Analysis An often neglected but important secondary objective of a phase I trial is the assessment of the distribution and elimination of the drug in the body. Specific parameters that describe the pharmacokinetics of the drug are the absorption and the elimination rate, the drug half-life, the peak concentration, and the area under the time–drug concentration curve (AUC). Drug concentration measurements ci (tr) of patient i at time tr are usually obtained from blood samples (additionally also from urine samples) taken regularly during medication and are analyzed using pharmacokinetic models (62). One- and two-compartment models have been used to estimate the pharmacokinetic characteristics often in a two-step approach: first for the individual kinetic of each patient and then for the patient population using population kinetic models. In practice, statistical methodology for pharmacokinetic data analysis is primarily based in nonlinear curve-fitting using least-squares methods or their extensions (63). Subsequent to the criticism of the traditional methods for requiring too large a number of steps before reaching the MTD, pharmacokinetic information to reduce this number of steps was suggested (24,64). Therefore, a pharmacokinetically guided dose escalation (PGDE) was proposed (24,65) based on the equivalence of drug blood levels in mouse and humans and on the pharmacodynamic hypothesis that equal toxicity is caused by equal drug plasma levels. It postulates that the DLT is determined by plasma drug concentrations and that AUC is a measure that holds across species (64). The AUC calculated at the

26

Edler

MTD for humans was found to be fairly equal to the AUC for mice if calculated at the LD10 (in mg/m2 equivalents, MELD10). Therefore, AUC(LD10, mouse) was considered as a target AUC, and a ratio F⫽

AUC(LD10,mouse) AUC(Starting Dose human)

(28)

was used to define a range of dose escalation. One tenth of MELD10 is usually taken as the starting dose x1. Two variants have been proposed. In the square root method, the first step from the starting dose x1 to the next dose x2 is equal to the geometric mean between x1 and the target dose x1 F, i.e., x2 ⫽ √x1 Fx1 ⫽ x1 √F. Subsequent dose escalation continues with the MFDE. In the extended factors of two methods, the first steps are achieved by a doubling dose scheme as long as 40% of F has not been attained. Then one continues with the MFDE. Potential problems and pitfalls were discussed (66). Although usage and further evaluation was encouraged and guidelines for its conduct were proposed, the PGDE has not often been used lateron in practice. For a more recent discussion and appraisal, see Newell (67).

VI. REGULATIONS BY GOOD CLINICAL PRACTICE AND INTERNATIONAL CONFERENCE ON HARMONIZATION (ICH) According to ICH Harmonized Tripartite Guideline, General Consideration for Clinical Trials, Recommended for Adoption at Step 4 on 17 July 1997 (16), a phase I trial is most typically a study on human pharmacology ‘‘of the initial administration of an investigational new drug into humans.’’ It is considered as having nontherapeutic objectives ‘‘to determine the tolerability of the dose range expected to be needed for later clinical trials and to determine the nature of adverse reactions that can be expected.’’ The definition of pharmacokinetics and pharmacodynamics is seen as a major issue and the study of activity or potential therapeutic benefit as a preliminary and secondary objective. Rules for the planning and conduct of phase I trials are specifically addressed by the European Agency for the Evaluation of Medicinal Products and its Note for Guidance on Evaluation of Anticancer Medicinal (18 March 1997) CPMP/EWP/205/25. Primary objectives are 1. 2. 3.

The determination of the MTD, The characterization of frequent ‘‘side effects’’ of the agent and their dose–response parameters, The determination of relevant main pharmacokinetic parameters.

Phase I Trials

27

Single intravenous dosing every 3–4 weeks is recommended if nothing else is suggested. The starting dose is not fixed. Dose escalation is oriented at the MFDE or the PGDE scheme. A minimum of two cycles at the same dose level is preferred. The number of patients per dose level varies around n ⫽ 3 with an increase to six in case of overt toxicity. General Guidelines for obtaining and use of dose–response information for drug registration has been given in ICH Tripartite Guideline, Dose-Response Information to Support Drug Registration Recommended for Adoption Step 4 on 10 March 1994 that cover to some extent studies with patients suffering from life-threatening diseases such as cancer. Based on these regulations and extending earlier suggestions (6,7), a Phase I Trial Protocol may be organized as follows: Objectives and Preclinical Background Clinical Background Eligibility and In/Exclusion of Patients Treatment Dose-Limiting Toxicity and MTD Dose Levels and Dose Escalation Design Number of Patients per Dose Level End Points and Longitudinal Observations Toxicity, Response Termination of the Study References Appendices on Case Report Forms Important Definitions Common Toxicity Criteria Informed Consent Form(s) Declaration of Helsinki

VII. DISCUSSION AND OPEN PROBLEMS Phase I trials are by their primary goal dose-finding studies, but they are performed under the twofold paradigm of the existence of monotone dose–toxicity and dose–benefit relationships. The standard paradigm has been as follows: Start at a low dose that is likely to be safe and treat small cohorts of patients at progressively higher doses until drug-related toxicity reaches some predefined level of maximum toxicity or until unexpected and unacceptable toxicity occurs. The inherent ethical problem that a treatment is given for which risks and benefits are unknown and that presumably is applied at a suboptimal or inactive dose has to be accounted for by a design that treats each patient when entering the trial at the maximal dose known to be safe at this time. Phase I trial designs have arisen

28

Edler

rather empirically without strong statistical foundation and limited estimating precision (13). The rather opaque appearance of the Fibonacci scheme in phase I research seems to be symptomatic of this situation. The two main constituents of a phase I trial are the action space of the dose levels, including the choice of the starting dose, and the dose escalation scheme. Different from dose finding in bioassays, phase I trials are smaller in sample sizes, take longer for observing the end point, and are strictly constraint by the ethical requirement in choosing doses conservatively. Therefore, standard approaches as unrestricted UaD and SAM designs are not applicable. Further complexity arises by population heterogeneity, subjectivity in judging toxicity, and censoring because of early drop-out. Phase I trials are not completely restricted to new drugs but may also be conducted for new schedules or for new formulations (e.g., packed with liposomes). This challenges the use of prior information. For the use of drug combinations, see Ref. 68. Dose escalation was presented in broad detail ranging from the more pragmatic standard rules (S/TER) to partial and full Bayesian methods. A comprehensive comparison of all the methods proposed is despite a number of simulation results—mentioned in Section III.B.5—not available at present. All previous attempts to validate a design and to compare it with others have to use simulations because a trial can be performed only once with one escalation scheme and it cannot be ‘‘reperformed.’’ Therefore, all comparisons can only be as valid as the setup of the simulation study is able to cover clinical realty and complexity. Nevertheless, there is evidence from simulations and from clinical experience that the standard designs are too conservative in terms of underdosing and needless prolongation. At the same time, the pure Bayesian and the CRM rules are lacking because their optimality is connected with treating one patient after the other, which may prolong a trial even more, and that they run the risk of treating too many patients at toxic levels (for further criticism, see 44). This dilemma has motivated a large number of modifications both of the S/TER and the CRM (lowering the starting dose and restricting the dose escalation to one step to find a way between the conservatism and the anticonservatism). This has restricted the Bayesian dynamic of the CRM considerably, but there seems to be an advantage using a modified CRM. A driving force for the search of new designs and one of the most serious objections against the standard design was the argument that as many patients as possible should be treated at high dose levels best near the true MTD, such that they may have a therapeutic benefit from the experimental drug. This argument is absolutely correct, and given that the new experimental drug is finally proved to be efficient (e.g., the taxanes for breast cancer), it seems empirically obvious yet is a post-hoc argument. One has to be reminded that efficacy is not in accordance with the primary aim of a phase I study, which is dose finding. The percentage of patients entered into phase I trials who benefited from that treatment has rarely

Phase I Trials

29

been estimated in serious studies. Rough estimates give response rates—mostly partial response only—in the range of a few percent. Nineteen responses (3.1%) were recorded among 610 patients in the 3-year review (69) of 23 trials of the MD Anderson Cancer Center from 1991 to 1993. Therefore, it would be unrealistic to expect a therapeutic benefit even at higher dose levels for most patients, and the impact of treating at high doses should not be overemphasized. Nevertheless, it is the ethical concern to do as best as possible even in a situation with a very low chance. Phase I trials are weak in terms of a generalization to a larger patient population because only a small number of selected patients is treated under some special circumstances by specialists. Therefore, drug development programs implement usually more than one phase I trial for the same drug. This poses ethical concerns, however, about the premises of the phase I being the trial under which the first-time patients are treated. Mostly, repeated phase I trials use variations of the schedule and administration, but as long as there is no direct information exchange among those trials with respect to occurrence of toxicity, ethical concerns remain. The repeat of phase I trials has not found much consideration in the past. Further concern arises if patients are selected for inclusion into the trial (12), for example, by including at higher doses less pretreated cases because of the fear of serious toxicities. Those concerns are nourished by a seemingly missing possibility of randomization. Given multicentricity and enough patients per center, some restricted type of randomization may be feasible and should be considered, for example, randomly choosing the center for the first patient for the next dose level. Interestingly, the very early phase I trial of DeVita et al. (32) from 1965 used a randomized approach. Further improvement was put into the perspective of increasing the sample size of phase I trials and increasing so the information (13). The definition of the toxicity criteria and their assessment rules are mostly beyond the statistical methodology but are perhaps more crucial than any other means during the conduct of trial. It has to be taken care that no changes in the assessment of toxicity occur progressively during the trial. Statistical estimation of the MTD has been restricted above to basic procedures leaving aside the question of estimation bias when neglecting the design. Babb et al. (41) noted that they can estimate the MTD using a different prior than used in the design and refer to work of others who suggested using a Bayesian scheme for design and a maximum likelihood estimate of the MTD. The MTD estimation above was restricted to a qualitative and at most categorical outcome measure. Laboratory toxicity is, however, available on a quantitative scale, and this information could be used for the estimation of an MTD (70). Mick and Ratain (71) proposed therefore ln( y) ⫽ a0 ⫹ a1x ⫹ a2 ln(z) as a dosetoxicity model, where y was a quantitative toxicity of myelosuppression (nadir of white blood cell count) and z was a covariate (pretreatment white blood cell

30

Edler

count). Further research and practical application is needed to corroborate the findings. Extensions of the use of the toxicity grades 1–4 were addressed shortly in Section III.B.4. One may be tempted to define DLT specifically for different toxicity grades with the aim of a grade-specific MTD also using grade-specific acceptable tolerability θ. That would allow, for example, tolerability θ2 for grade 2 toxicity higher than the tolerability θ3 for grade 3 toxicity. This would demand an isotonic relationship such that any grade 3 toxicity is preceded by a grade 2 toxicity and P(DLT2 | x) ⬎ P(DLT3 |x) for all doses x. A multigrade dose escalation scheme was proposed (72) that allows escalation and reduction of dosage using knowledge of the grade of toxicity. Depending on severity of toxicity, up to six dose escalations and three stages of recruiting one, three, or six patients at each dose level were planned. They showed that the multigrade design was superior to the standard design and that it could compete successfully with the two-stage designs. But it was not uniformly better. The authors also admitted that a multigrade design is harder to comprehend and use, and to my knowledge that design has never been used in practice, although it may have influenced recent research in titration designs. A silent assumption in this article was that the cohort of nk (e.g., three) patients scheduled for one dose level x[k] are treated more or less in parallel and that the toxicity results become available at once when that kth stage has been finished. This situation may be rather rare in practice, and staggered information by treatment course and patient may be the rule. If so, how can the cumulative increase of information on the dose level x[k] be used to interfere with the dosing of the current patients or to decide on the dosing of the next cohort? Except for the dose-titration design in Section IV.B, no formal methods have been developed to my knowledge that would acount for such an overlap of information. The multiplicity of courses and even more the multiplicity of toxicities assessed with the CTC scheme needs further research. Bayesian approaches may be most promising to deal with this difficult problem. To comply with clinical practice, such a design should be flexible enough to allow both staggered entry distributed over 1–2 months and parallel entry in a few weeks.

ACKNOWLEDGMENTS This work could not have been done without the long-standing extremely fruitful cooperation with the Phase I/II Study Group of the AIO in the German Cancer Society and the work done there with Wolfgang Queißer, Axel Hanauske, E.-D. Kreuser, and Heiner Fiebig. I also owe the support and statistical input from the Biometry Departments of Martin Schumacher (Freiburg) and Michael Schemper (Vienna), Harald Heinzl (Vienna), and my next door colleague Annette KoppSchneider. For technical assistance I thank Gudrun Friedrich for analyses and

Phase I Trials

31

Regina Grunert and Renate Rausch for typing and the bibliography. Finally, I thank John Crowley for all the help and encouragement.

REFERENCES 1. Simon RM. A decade of progress in statistical methodology for clinical trials. Stat Med 1991; 10:1789–1817. 2. World Medical Association. Declaration of Helsinki (http:/ /www.aix-scientifics. com/). 3. Carter SK. Study design principles for the clinical evaluation of new drugs as developed by the chemotherapy programme of the National Cancer Institute. In: Staquet MJ, ed. The Design of Clinical Trials in Cancer Therapy. Brussels: Editions Scient Europ 1973:242–289. 4. Schwartsmann G, Wanders J, Koier IJ, et al. EORTC New Drug Development Office Coordinating and Monitoring Programme for Phase I and II Trials with new anticancer agents. Eur J Cancer 1991; 27:1162–1168. 5. Spreafico F, Edelstein MB, Lelieveld P. Experimental bases for drug selection. In: Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials. Methods and Practice. Oxford: Oxford University Press, 1984:193–209. 6. Carter SK, Bakowski MT, Hellmann K. Clinical trials in cancer chemotherapy. In: Carter SK, Bakowski MT, Hellmann K, eds. Chemotherapy of Cancer, 3rd ed. New York: John Wiley, 1987:29–31. 7. EORTC New Drug Development Committee. EORTC guidelines for phase I trials with single agents in adults. Eur J Cancer Clin Oncol 1985; 21:1005–1007. 8. Bodey GP, Legha SS. The phase I study: general objectives, methods and evaluation. In: Muggia FM, Rozencweig M, eds. Clinical Evaluation of Antitumor Therapy. Dordrecht: Nijhoff The Netherlands, 1987:153–174. 9. Leventhal BG, Wittes RE. Phase I trials. In: Leventhal BG, Wittes RE, eds. Research Methods in Clinical Oncology. New York: Raven Press, 1988:41–59. 10. Schneiderman MA. Mouse to man: statistical problems in bringing a drug to clinical trial. In: Proc. 5th Berkeley Symp Math Statist Prob, 4th ed. Berkeley: University of California Press, 1967:855–866. 11. Von Hoff DD, Kuhn J, Clark GM. Design and conduct of phase I trials. In: Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials: Methods and Practice. Oxford: Oxford University Press, 1984:210–220. 12. Ratain MJ, Mick R, Schilsky RL, Siegler M. Statistical and ethical issues in the design and conduct of phase I and II clinical trials of new anticancer Agents. J Nat Cancer Inst 1993; 85:1637–1643. 13. Christian MC, Korn EL. The limited precision of phase I trials. J Nat Cancer Inst 1994; 86:1662–1663. 14. Kerr DJ. Phase I clinical trials: adapting methodology to face new challenges. Ann Oncol 1994; 5:S67–S70. 15. Pinkel D. The use of body surface area as a criterion of drug dosage in cancer chemotherapy. Cancer Res 1958; 18:853–856.

32

Edler

16. International Conference on Harmonisation: Guideline for Good Clinical Practice (http:/ /ctep.info.nih.gov/ctc3/ctc.htm). 17. World Health Organization (WHO). WHO Handbook for Reporting Results of Cancer Treatment. WHO Offset Publication No. 48, Geneva: WHO, 1979. 18. Kreuser ED, Fiebig HH, Scheulen ME, et al. Standard operating procedures and organization of German Phase I, II, and III Study Groups, New Drug Development Group (AWO), and Study Group of Pharmacology in Oncology and Hematology (APOH) of the Association for Medical Oncology (AIO) of the German Cancer Society. Onkol 1998; 21(suppl 3), 1–22. 19. Mick R, Lane N, Daugherty C, Ratain MJ. Physician-determined patient risk of toxic effects: impact on enrollment and decision making in phase I trials. J Nat Cancer Inst 1994; 86:1685–1693. 20. Franklin HR, Simonetti GPC, Dubbelman AC, et al. Toxicity grading systems. A comparison between the WHO scoring system and the common toxicity criteria when used for nausea and vomiting. Ann Oncol 1994; 5:113–117. 21. Brundage MD, Pater JL, Zee B. Assessing the reliability of two toxicity scales: implications for interpreting toxicity data. J Nat Cancer Inst 1993; 85:1138–1148. 22. Morgan BJT. Analysis of Quantal Response Data. London: Chapman & Hall, 1992. 23. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics 1990; 46:33–48. 24. Collins JM, Zaharko DS, Dedrick RL, Chabner BA. Potential roles for preclinical pharmacology in phase I clinical trials. Cancer Treat Rep 1986; 70:73–80. 25. Goldsmith MA, Slavik M, Carter SK. Quantitative prediction of drug toxicity in humans from toxicology in small and large animals. Cancer Res 1975; 35:1354– 1364. 26. Hansen HH. Clinical experience with 1-(2-chloroethyl) 3-cyclohexyl-1-nitrosourea (CCNU, NSC-79037). Proc Am Assoc Cancer Res 1970; 11:87. 27. Muggia FM. Phase I study of 4′demethyl-epipodophyllotoxin-β d-thenylidene glycoside (PTG, NSC-122819). Proc Am Assoc Cancer Res 1970; 11:58. 28. Carter SK. Clinical trials in cancer chemotherapy. Cancer 1977; 40:544–557. 29. Hansen HH, Selawry OS, Muggia FM, Walker MD. Clinical studies with 1-(2-chloroethyl)-3-cyclohexyl-1-nitrosourea (NSC79037). Cancer Res 1971; 31:223–227. 30. Bellman RE. Topics in pharmacokinetics. III. Repeated dosage and impulse control. Math Biosci 1971; 12:1–5. 31. Bellman RE. Dynamic Programming. Princeton: Princeton University Press, 1962. 32. DeVita VT, Carbone PP, Owens AH, Gold GL, Krant MJ, Epmonson J. Clinical Trials with 1,3-bis(2-chlorethyl)-1-nitrosourea, NSC 409962. Cancer Treat Rep 1965; 25:1876–1881. 33. Goldin A, Carter S, Homan ER, Schein PS. Quantitative comparison of toxicity in animals and man. In: Staquet MJ, ed. The Design of Clinical Trials in Cancer Therapy. Brussels: Editions Scient Europ, 1973:58–81. 34. Penta JS, Rozencweig M, Guarino AM, Muggia FM. Mouse and large-animal toxicology studies of twelve antitumor agents: relevance to starting dose for phase I clinical trials. Cancer Chemother Pharmacol 1979; 3:97–101. 35. Rozencweig M, Von Hoff DD, Staquet MJ, et al. Animal toxicology for early clinical trials with anticancer agents. Cancer Clin Trials 1981; 4:21–28.

Phase I Trials

33

36. Willson JKV, Fisher PH, Tutsch K, Alberti D, Simon K, Hamilton RD, Bruggink J, Koeller JM, Tormey DC, Earhardt RH, Ranhosky A, Trump DL. Phase I clinical trial of a combination of dipyridamole and acivicin based upon inhibition of nucleoside salvage. Cancer Res 48:5585–5590. 37. Storer BE. Design and analysis of phase I clinical trials. Biometr 1989; 45:925– 937. 38. Durham SD, Flournoy N, Rosenberger WF. A random walk rule for phase I clinical trials. Biometr 1997; 53:745–760. 39. Dixon WJ, Mood AM. A method for obtaining and analyzing sensivity data. J Am Statist Assoc 1948; 43:109–126. 40. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat 1951; 29: 400–407. 41. Babb J, Rogatko A, Zacks S. Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med 1998; 17:1103–1120. 42. O’Quigley J, Chevret S. Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stat Med 1991; 10:1647–1664. 43. Shen LZ, O’Quigley J. Consistency of continual reassessment method under model misspecification. Biometrika 1996; 83:395–405. 44. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon RM. A comparison of two phase I trial designs. Stat Med 1994; 13:1799–1806. 45. Faries D. Practical modifications of the continual reassessment method for phase I clinical trials. J Biopharm Stat 1994; 4:147–164. 46. Moller S. An extension of the continual reassessment methods using a preliminary up-and-down design in a dose finding study in cancer patients, in order to investigate a greater range of doses. Stat Med 1995; 14:911–922. 47. Goodman SN, Zahurak ML, Piantadosi S. Some practical improvements in the continual reassessment method for phase I studies. Stat Med 1995; 14:1149–1161. 48. Ahn C. An evaluation of phase I cancer clinical trial designs. Stat Med 1998; 17: 1537–1549. 49. Hanauske AR, Edler L. New clinical trial designs for phase I studies in hematology and oncology: principles and practice of the continual reassessment model. Onkol 1996; 19:404–409. 50. Whitehead J, Brunier H. Bayesian decision procedures for dose determining experiments. Stat Med 1995; 14:885–893. 51. Gatsonis C, Greenhouse JB. Bayesian methods for phase I clinical trials. Stat Med 1992; 11:1377–1389. 52. Rademaker AW. Sample sizes in phase I toxicity studies. In American Statist. Assoc. (Alexandria, VA) ASA Proc Biopharm Sect 1989; 137–141. 53. Geller NL. Design of phase I and II clinical trials in cancer: a statistician’s view. Cancer Invest 1984; 2:483–491. 54. Edler L. Modeling and computation in pharmaceutical statistics when analysing drug safety. In: Kitsos CP, Edler L, eds. Contributions to Statistics. Heidelberg: Physica, 1997:221–232. 55. Storer BE. Small-sample confidence sets for the MTD in a phase I clinical trial. Biometr 1993; 49:1117–1125. 56. Simon RM, Freidlin B, Rubinstein LV, Arbuck SG, Collins J, Christian MC. Accel-

34

57.

58. 59.

60. 61. 62. 63.

64. 65.

66.

67. 68. 69.

70. 71.

72.

Edler erated titration designs for phase I clinical trials in oncology. J Nat Cancer Inst 1997; 89:1138–1147. Thall PF, Russell KE. A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometr 1998; 54:251– 264. McCullagh P. Regression models for ordinal data. J R Stat Soc B 1980; 42:109– 112. Benner A, Edler L, Hartung G. SPLUS support for analysis and design of phase I trials in clinical oncology. In: Millard S, Krause A, eds. SPLUS in Pharmaceutical Industry. New York: Springer, 2000. Bliss CI. The method of probits. Science 1934; 79:409–410. Finney DJ. Statistical Method in Biological Assay. London: C. Griffin, 1978. Gibaldi M, Perrier D. Pharmacokinetics. New York: Marcel Dekker, 1982. Edler L. Computational statistics for pharmacokinetic data analysis. In: Payne R, Green P, eds. COMPSTAT. Proceedings in Computational Statistics. Heidelberg: Physica, 1998:281–286. Collins JM. Pharmacology and drug development. J Nat Cancer Inst 1988; 80:790– 792. Collins JM, Grieshaber CK, Chabner BA. Pharmacologically guided phase I clinical trials based upon preclinical drug development. J Nat Cancer Inst 1990; 82:1321– 1326. EORTC Pharmacokinetics and Metabolism Group. Pharmacokinetically guided dose escalation in phase I clinical trials. Commentary and proposed guidelines. Eur J Cancer Clin Oncol 1987; 23:1083–1087. Newell DR. Pharmacologically based phase I trials in cancer chemotherapy. New Drug Ther 1994; 8:257–275. Korn EL, Simon RM. Using the tolerable-dose diagram in the design of phase I combination chemotherapy trials. J Clin Oncol 1993; 11:794–801. Smith TL, Lee JJ, Kantarjian HM, Legha SS, Raber MN. Design and results of phase I cancer clinical trials: three-year experience at M.D. Anderson Cancer Center. J Clin Oncol 1996; 14:287–295. Egorin MJ. Phase I trials: a strategy of ongoing refinement. J Nat Cancer Inst 1990; 82:446–447. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in phase I clinical trials: evidence for increased precision. J Nat Cancer Inst 1993; 85: 217–223. Gordon NH, Willson JKV. Using toxicity grades in the design and analysis of cancer phase I clinical trials. Stat Med 1992; 11:2063–2075.

2 Dose-Finding Designs Using Continual Reassessment Method John O’Quigley University of California at San Diego, La Jolla, California

I.

CONTINUAL REASSESSMENT METHOD

The continual reassessment method (CRM), as a tool for carrying out phase I clinical trials in cancer, has greatly gained in popularity in recent years. Here I describe the basic ideas behind the method, some important technical considerations, the properties of the method, and the possibility for substantial generalization, specifically the use of graded information on toxicities, the incorporation of a stopping rule leading to further reductions in sample size, the incorporation of information on patient heterogeneity, the incorporation of pharmacokinetics, and the possibility of modeling within patient dose escalation. At the time of writing, few of these generalizations have been fully studied in any depth, although it seems clear that the CRM provides a structure around which such further developments can be carried out.

A. Motivation The precise goals of a phase I dose-finding study in cancer have not always been clearly defined. The absence of such definitions and the lack of clinically motivated exigencies have led to the use of a number of schemes, in particular the up and down scheme, recalled in a broad review by Storer (28), having properties 35

36

O’Quigley

that can be considered undesirable in certain applications. This consideration underscored the development of a different approach to such studies, the CRM (15), in which the design was constructed to respond to specific requirements of the phase I clinical investigation in cancer. These requirements are the following: 1. 2. 3. 4.

We should minimize the number of undertreated patients, i.e, patients treated at unacceptably low dose levels. We should minimize the number of patients treated at unacceptably high dose levels. We should minimize the number of patients needed to complete the study (efficiency). The method should respond quickly to inevitable errors in initial guesses, rapidly escalating in the absence of indication of drug activity (toxicity) and rapidly de-escalating in the presence of unacceptably high levels of observed toxicity.

Before describing just how the CRM meets these requirements, let us first look at the requirements themselves in the context of cancer dose-finding studies. Most phase 1 cancer clinical trials are carried out on patients for whom all currently available therapies have failed. There will always be hope in the therapeutic potential of the new experimental treatment, but such hope is invariably tempered by the almost inevitable life-threatening toxicity accompanying the treatment. Given that candidates for these trials have no other options concerning treatment, their inclusion appears contingent on maintaining some acceptable degree of control over the toxic side effects and trying to maximize treatment efficacy (which translates as dose). Too high a dose, although offering in general better hope for treatment effect, will be accompanied by too high a probability of encountering unacceptable toxicity. Too low a dose, although avoiding this risk, may offer too little chance of seeing any benefit at all. Given this context, requirements 1 and 2 appear immediate. The third requirement, a concern for all types of clinical studies, becomes of paramount importance here where very small sample sizes are inevitable. This is because of the understandable desire to proceed quickly with a potentially promising treatment to the phase II stage. At the phase II stage the probability of observing treatment efficacy is almost certainly higher than that for the phase I population of patients. We have to do the very best we can with the relatively few patients available, and the statistician involved in such studies should also provide some ideas as to the error of our estimates, translating the uncertainty of our final recommendations based on such small samples. The fourth requirement is not an independent requirement and can be viewed as a partial re-expression of requirements 1 and 2. Taken together, the requirements point toward a method where we converge quickly to the correct level, the correct level being defined as the one having

Continual Reassessment Method

Figure 1 Typical trial histories. (Top) CRM, (bottom) standard design.

37

38

O’Quigley

Figure 2 Fit of the working model.

a probability of toxicity as close as possible to some value θ. The value θ is chosen by the investigator such that he or she considers probabilities of toxicity higher than θ to be unacceptably high, whereas those lower than θ unacceptably low in that they indicate, indirectly, the likelihood of too weak an antitumor effect. Figure 1 illustrates the comparative behavior of CRM with a fixed-sample up and down design in which level 7 is the correct level. How does CRM work? The essential idea is close to that of stochastic approximation, the main differences being the use of a nonlinear underparameterized model, belonging to a particular class of models, and a small number of discrete dose levels rather than a continuum. Figure 2 provides some insight into how the method behaves after having included a sufficient number of patients into the study. The true unknown dose– toxicity curve is not well estimated overall by the underparameterized dose– toxicity curve taken from the CRM class, but, at the point of main interest, the correct targeted level, the two curves nearly coincide. CRM will not do well if the task is to estimate the overall dose–toxicity curve. It generally does very well in addressing the more limited goal: to identify some single chosen percentile from this unknown curve. Patients enter sequentially. The working dose–toxicity curve, taken from the CRM class (described below), is refitted after each inclusion. The curve is then inverted to identify which of the available levels has an associated estimated probability as close as we can get to the targeted acceptable toxicity level. The next patient is then treated at this level. The cycle is continued until a fixed

Continual Reassessment Method

39

number of subjects have been treated or until we apply some stopping rule (see Sect. V). Typical behavior is that shown in Fig. 1. B. Operating Characteristics The above paragraphs outline how CRM works. The technical details are provided below. After asking how CRM works it is natural to ask how CRM behaves. The original article on the method by O’Quigley et al. (15) considered, as an illustration, a particularly simple model and how it worked when used to identify a target dose level having probability of toxicity as near as possible to 0.2. Simulations were encouraging and showed striking improvement over the standard design, both in terms of accuracy of final recommendation and in terms of concentrating as large a percentage as possible of studied patients close to the target level, thereby minimizing the number of overtreated and undertreated patients. A large sample study (25) showed that under some broad conditions, the level to which a CRM design converged will indeed be the closest to the target. As pointed out by Storer (30), large sample properties themselves will not be wholly convincing because, practically, we are inevitably faced with small to moderate sample sizes. Nonetheless, if any scheme fails to meet such basic statistical criteria as large sample convergence, we need to investigate with great care its finite sample properties. The tool to use here is mostly that of simulation, although for the standard up and down schemes, the theory of Markov chains enables us to carry out exact probabilistic calculations (23,29). Whether Bayesian or likelihood based, once the scheme is under way (i.e., the likelihood in nonmonotone), then it is readily shown that a nontoxicity always points in the direction of higher levels and a toxicity in the direction of lower levels, the absolute value of the change diminishing with the number of included patients. For nonmonotone likelihood it is impossible to be at some level, observe a toxicity, and then for the model to recommend a higher level as claimed by some authors (see Sect. VI.B), unless pushed in such a direction by a strong prior. Furthermore, when targeting lower percentiles such as 0.2, it can be calculated and follows our intuition that a toxicity, occurring with a frequency a factor of 4 less than that for the nontoxicities, will have a much greater impact on the likelihood or posterior density. This translates directly into an operating characteristic whereby model-based escalation is relatively cautious and de-escalation more rapid, particularly early on where little information is available. In the model and examples of O’Quigley et al. (15), dose levels could never be skipped when escalating. However, if the first patient, treated at level 3, suffered a toxic side effect, the method skipped when de-escalating, recommending level 1 for the subsequent two entered patients before, assuming no further toxicities were seen, escalating to level 2. Simulations in O’Quigley et al. (15), O’Quigley and Chevret (16), Good-

40

O’Quigley

man et al (9), Korn et al. (11), and O’Quigley (1999) show the operating characteristics of CRM to be good, in terms of accuracy of final recommendation, while simultaneously minimizing the numbers of overtreated and undertreated patients. However, violation of the model requirements and allocation principle of CRM, described in the following section, can have a negative, possibly disastrous, effect on operating characteristics. Chevret (6, situation 6 in Table 2) used a model, failing the conditions outlined in Section 11.A, that resulted in never recommending the correct level, a performance worse than we achieve by random guessing. Goodman et al. (9) and Korn et al. (11) worked with this same model, and their results, already favorable to CRM, would have been yet more favorable had a model not violating the basic requirements been used (19). Both Faries (7) and Moller (13) assigned to early levels other than those indicated by the model, leading to large skips in the dose allocation, in one case skipping nine levels after the inclusion of a single patient. We return to this in Section VI.

II. TECHNICAL ASPECTS The aim of CRM is to locate the most appropriate dose, the so-called target dose, the precise definition of which is provided below. This dose is taken from some given range of available doses. The problem of dose spacing for single drug combinations, often addressed via a modified Fibonacci design, is beyond the scope of CRM. The need to add doses may arise in practice when the toxicity frequency is deemed too low at one level but the next highest level is considered too toxic. CRM can help with this affirmation, but as far as extrapolation or interpolation of dose is concerned, the relevant insights will come from pharmacokinetics. For our purposes we assume that we have available k fixed doses; d 1, . . . d k. These doses are not necessarily ordered in terms of the d i themselves, in particular since each d i may be a vector, being a combination of different treatments, but rather in terms of the probability R(d i ) of encountering toxicity at each dose d i. It is important to be careful at this point since confusion can arise over the notation. The d i, often multidimensional, describe the actual doses or combinations of doses being used. We assume monotonicity and we take monotonicity to mean that the dose levels, equally well identified by their integer subscripts i (i ⫽ 1, . . . k), are ordered whereby the probability of toxicity at level i is greater than that at level i′ whenever i ⬎ i ′. The monotonicity requirement or the assumption that we can so order our available dose levels is thus important. Currently, all the dose information required to run a CRM trial is contained in the dose levels. Without wishing to preclude the possibility of exploiting information contained in the doses d i and not in the dose levels i, at present we lose no information when we replace d i by i.

Continual Reassessment Method

41

The actual amount of drug therefore, so many mg/m2 say, is typically not used. For a single-agent trial (see 13), it is in principle possible to work with the actual dose. We do not advise this since it removes, without operational advantages, some of our modeling flexibility. For multidrug or treatment combination studies there is no obvious univariate measure. We work instead with some conceptual dose, increasing when one of the constituent ingredients increases and, under our monotonicity assumption, translating itself as an increase in the probability of a toxic reaction. Choosing the dose levels amounts to selecting levels (treatment combinations) such that the lowest level hopefully has an associated toxic probability less than the target and the highest level possibly close or higher than the target. The most appropriate dose, the ‘‘target’’ dose, is that dose having an associated probability of toxicity as close as we can get to the target ‘‘acceptable’’ toxicity θ. Values for the target toxicity level, θ, might typically be 0.2, 0.25, 0.3, 0.35, although there are studies in which this can be as high as 0.4 (13). The value depends on the context and the nature of the toxic side effects. The dose for the jth entered patient, X j, can be viewed as random taking values x j, most often discrete in which case x j ∈ {d 1, . . . d k} but possibly continuous where X j ⫽ x; x ∈ R⫹. In light of the remarks of the previous two paragraphs we can, if desired, entirely suppress the notion of dose and retain only information pertaining to dose level. This is all we need and we may prefer to write x j ∈ {1, . . . k}. Let Y j be a binary random variable (0, 1) where 1 denotes severe toxic response for the jth entered patient ( j ⫽ 1, . . . n). We model R(x j ), the true probability of toxic response, at X j ⫽ x j; x j ∈ {d 1, . . . d k} or x j ∈ {1, . . . k} via R(x j ) ⫽ Pr(Y j ⫽ 1|X j ⫽ x j ) ⫽ E(Y j | x j ) ⫽ ψ(x j, a) for some one parameter model ψ(x j, a). For the most common case of a single homogeneous group of patients, we are obliged to work with an underparametrized model, notably a one-parameter model. Although a two-parameter model may appear more flexible, the sequential nature of CRM together with its aim to put the included patients at a single correct level means that we will not obtain information needed to fit two parameters. We are close to something like nonidentifiability. A likelihood procedure will be unstable and may even break down, whereas a two-parameter fully Bayesian approach (8,15,31) may work initially, although somewhat artificially, but behave erratically as sample size increases (see also Sect. VI). This is true even when starting out at a low or the lowest level, initially working with an up and down design for early escalation, before a CRM model is applied. Indeed, any design that ultimately concentrates all patients from a single group on some given level can fit no more than a single parameter without running into problems of consistency.

42

A.

O’Quigley

Model Requirements

The restrictions on ψ(x, a) were described by O’Quigley, Pepe and Fisher (1990). For given fixed x we require that ψ(x, a) be strictly monotonic in a. For fixed a we require that ψ(x, a) be monotonic increasing in x or, in the usual case of discrete dose levels d i i ⫽ 1, . . . , k, that ψ(d i, a) ⬎ ψ(d m, a) whenever i ⬎ m. The true probability of toxicity at x (i.e., whatever treatment combination has been coded by x) is given by R(x), and we require that for the specific doses under study (d 1, . . . , d k ) there exists values of a, say a 1, . . . , a k, such that ψ(d i, a i ) ⫽ R(d i ), (i ⫽ 1, . . . , k). In other words, our one-parameter model has to be rich enough to model the true probability of toxicity at any given level. We call this a working model since we do not anticipate a single value of a to work precisely at every level, that is, we do not anticipate a 1 ⫽ a 2 ⫽ ⋅ ⋅ ⋅ ⫽ a k ⫽ a. Many choices are possible. We have obtained excellent results with the simple choice: ψ(d i, a) ⫽ α ai, (i ⫽ 1, . . . , k)

(1)

where 0 ⬍ α 1 ⬍ ⋅ ⋅ ⋅ ⬍ α k ⬍ 1 and 0 ⬍ a ⬍ ∞. For the six levels studied in the simulations by O’Quigley et al. (15), the working model had α 1 ⫽ 0.05, α 2 ⫽ 0.10, α 3 ⫽ 0.20, α 4 ⫽ 0.30, α 5 ⫽ 0.50, and α 6 ⫽ 0.70. In that article this was expressed a little differently in terms of conceptual dose d i, where d 1 ⫽ ⫺1.47, d 2 ⫽ ⫺1.10, d 3 ⫽ ⫺0.69, d 4 ⫽ ⫺0.42, d 5 ⫽ 0.0, and d 6 ⫽ 0.42 obtained from a model in which α i ⫽ (tanh d i ⫹ 1)/2

(i ⫽ 1, . . . , k)

(2)

The above ‘‘tanh’’ model was first introduced in this context by O’Quigley et al. (15), the idea being that tanh (x) increases monotonically from 0 to 1 as x increases from ⫺∞ to ∞. This extra generality is not usually needed since attention is focused on the few fixed d i. Note that, at least as far as maximum likelihood estimation is concerned (see Sect. III.A), working with model (1) is equivalent to working with a model in which α i (i ⫽ 1, . . . , k), is replaced by α *i (i ⫽ 1, . . . , k), where α*i ⫽ α mi for any real m ⬎ 0. Thus, we cannot really attach any concrete meaning to the α i. The spacing, however, between adjacent α i will have an impact on operating characteristics. Working with real doses corresponds to using some fixed dose spacing, although not necessarily one with nice properties. The spacings chosen here have proved satisfactory in terms of performance across a broad range of situations. An investigation into how to choose the α i with the specific aim of improving certain aspects of performance has yet to be carried out. Some obvious choices for a model can fail the above conditions, leading to potentially poor operating characteristics. The one-parameter logistic model, ψ(x, a) ⫽ w/(1 ⫹ w), in which b is fixed and where w ⫽ exp(b ⫹ ax), can be seen to fail the above requirements (25). On the other hand, the less intuitive

Continual Reassessment Method

43

model obtained by redefining w so that w ⫽ exp(a ⫹ bx), b ≠ 0, belongs to the CRM class.

III. IMPLEMENTATION Once a model has been chosen and we have data in the form of the set Ω j ⫽ {y 1, x 1, . . . , y j, x j}, the outcomes of the first j experiments we obtain estimates Rˆ (d i ), (i ⫽ 1, . . . , k) of the true unknown probabilities R(d i ), (i ⫽ 1, . . . , k) at the k dose levels (see below). The target dose level is that level having associated with it a probability of toxicity as close as we can get to θ. The dose or dose level x j assigned to the jth included patient is such that |Rˆ (x j ) ⫺ θ| ⬍ |Rˆ (d i ) ⫺ θ|, (i ⫽ 1, . . . , k; x j ≠ d i ) Thus, x j is the closest level to the target level in the above precise sense. Other choices of closeness could be made, incorporating cost or other considerations. We could also weight the distance, for example, multiply |Rˆ (x j ) ⫺ θ| by some constant greater than 1 when Rˆ (x j ) ⬎ θ. This would favor conservatism, such a design tending to experiment more often below the target than a design without weights. Similar ideas have been pursued by Babb et al. (5). The estimates Rˆ (x j ) obtain from the one-parameter working model. Two questions dealt with in this section arise: How do we estimate R(x j ) on the basis of Ω j⫺1 and how do we obtain the initial data, in particular since the first entered patient or group of patients must be treated in the absence of any data based estimates of R(x 1 )? Even though our model is underparametrized, leading us into the area of misspecified models, it turns out that standard procedures of estimation work. Some care is needed to show this, and we look at this in Section IV. The procedures themselves are described just below. Obtaining the initial data is partially described in these same sections as well as being the subject of its own subsection, two-stage designs. To decide, on the basis of available information and previous observations, the appropriate level at which to treat a patient, we need some estimate of the probability of toxic response at dose level d i, (i ⫽ 1, . . . , k). We would currently recommend use of the maximum likelihood estimator (17) described in section III.A. The Bayesian estimator, developed in the original paper by O’Quigley et al. (15), will perform very similarly unless priors are strong. The use of strong priors in the context of an underparametrized and misspecified model may require deeper study. Bayesian ideas can nonetheless be very useful in addressing more complex questions such as patient heterogeneity and intrapatient escalation. We return to this in Section VII.

44

A.

O’Quigley

Maximum Likelihood Implementation

After the inclusion of the first j patients, the log-likelihood can be written as j

ᏸ j (a) ⫽

冱y

j

ᐉ

log ψ(xᐉ, a) ⫹

ᐉ⫽1

冱 (1 ⫺ y )log(1 ⫺ ψ(x , a)) ᐉ

ᐉ

(3)

ᐉ⫽1

and is maximized at a ⫽ aˆ j Maximization of ᏸ j (a) can easily be achieved with a Newton Raphson algorithm or by visual inspection using some software package such as Excel. Once we have calculated aˆ j, we can next obtain an estimate of the probability of toxicity at each dose level d i via Rˆ (d i ) ⫽ ψ(d i, aˆ j ), (i ⫽ 1, . . . , k) On the basis of this formula, the dose to be given to the ( j ⫹ 1)th patient, x j⫹1 is determined. We can also calculate an approximate 100(1 ⫺ α)% confidence interval for ψ(x j⫹1, aˆ j ) as (ψj⫺, ψj⫹ ) where ψj⫺ ⫽ ψ{x j⫹1, (aˆ j ⫹ z 1⫺α/2 v(aˆ j )1/2 )}, ψ j⫹ ⫽ ψ{xˆ j⫹1, (aˆ j ⫺ z 1⫺α/2 v(aˆ j )1/2 )} where z α is the αth percentile of a standard normal distribution and v(aˆ j ) is an estimate of the variance of aˆ j. For the model of Eq. (1), this turns out to be particularly simple, and we can write v⫺1 (aˆ j ) ⫽

冱

ψ(x ᐉ, aˆ j )(log α ᐉ )2 /(1 ⫺ ψ(x ᐉ, aˆ j ))2

ᐉ ⱕ j, y ᐉ ⫽0

Although based on a misspecified model these intervals turn out to be quite accurate, even for sample sizes as small as 16, and thus helpful in practice (14). A requirement to be able to maximize the log-likelihood on the interior of the parameter space is that we have heterogeneity among the responses, that is, at least one toxic and one nontoxic response (27). Otherwise, the likelihood is maximized on the boundary of the parameter space and our estimates of R(d i ), (i ⫽ 1, . . . k) are trivially either zero, one, or, depending on the model we are working with, may not even be defined. Thus, the experiment is considered as not being fully underway until we have some heterogeneity in the responses. These could arise in a variety of different ways: use of the standard up and down approach, use of an initial Bayesian CRM as outlined below, or use of a design believed to be more appropriate by the investigator. Once we have achieved heterogeneity, the model kicks in and we continue as prescribed above (estimation-allocation). Getting the trial underway, that is, achieving the necessary heterogeneity to carry out the above prescription, is largely arbitrary. This feature is specific to the maximum likelihood implementation and such that it may well be treated

Continual Reassessment Method

45

separately. Indeed, this is our suggestion and we describe this more fully below in the subsection Two-Stage Designs. B. Bayesian Implementation Before describing more closely the Bayesian implementation, it is instructive to consider Fig. 3. After seeing the first observed toxicity at level 3, two other patients being treated at this same level, a further two one level below, and a

Figure 3 Likelihood and posterior densities for small samples.

46

O’Quigley

single patient at the lowest level, the likelihood estimate for a can be seen to be very close to that based on the Bayesian posterior. We expect this for vague priors, but the illustration helps eliminate a potential concern that the maximum likelihood estimate may be too unstable for small samples. This could be the case for two-parameter models (see also Sect. VI.A and VI.C), but such erratic behavior is avoided by the dampening effects of the one-parameter model. Very few further observations are required before the maximum likelihood estimator and the Bayesian estimator become, for practical purposes, indistinguishable. In the original Bayesian setup (15), CRM stipulated that the first entered patient would be treated at some level, believed by the experimenter, in the light of all current available knowledge, to be the target level. Such knowledge, possibly together with his or her own subjective conviction, led the experimenter to a ‘‘point estimate’’ of the probability of toxicity at the starting dose to be the same as the targeted toxic level. This level was chosen to be level 3 in an experiment with six levels allowing the possibility of both escalation and de-escalation. In O’Quigley et al. (15) the targeted toxicity level was 0.2. In Eq. (2) the ‘‘dose’’ that satisfies this is ⫺0.69, so that we had d 3 ⫽ ⫺0.69. In addition, we had d 1 ⫽ ⫺1.47, d 2 ⫽ ⫺1.10, d 3 ⫽ ⫺0.69, d 4 ⫽ ⫺0.42, d 5 ⫽ 0.0, and d 6 ⫽ 0.42 so that for a 0 ⫽ 1, corresponding to the mean of some prior distribution, the prior point estimates of toxic probabilities were 0.05, 0.1, 0.2, 0.3, 0.5, and 0.7, respectively. We considered that our point estimate 0.2, corresponding to a ⫽ 1, was uncertain and likely to be in error. The notion of uncertainty was expressed via a prior density on a having support on Ꮽ and called g(a). In the model of Section II.A, Ꮽ is the positive real line, and so we gave consideration to distributions having support on Ꮽ⫹, the family of gamma distributions in particular. The simplest member of the gamma family, the standard exponential distribution with g(a) ⫽ exp(⫺a), showed itself to be a prior sufficiently vague for a large number of situations. For this prior 95% Bayesian intervals, for the probability of toxicity at the starting dose, lay between 0.003 and 0.96. For the lowest level, the corresponding interval is (10⫺5, 0.93), whereas for the highest level, having a point prior estimate of 0.7, the interval becomes (0.26, 0.99). Such a prior is therefore not vague at all levels and suggests that the highest level is likely to be too high. A more vague prior would help acceleration from the starting level to the highest level when we greatly overestimate the new treatment’s toxic potential. Even so, it does not take long for the accumulating information to ‘‘override’’ the prior, and this simple exponential formulation appeared to be fairly satisfactory for many cases (15). The starting level d i is such that we should have

冮 ψ(d , u)g(u)du ⫽ θ Ꮽ

i

Continual Reassessment Method

47

This may be a difficult integral equation to solve, and practically we might take the starting dose to be obtained from ψ(d i, µ 0 ) ⫽ θ where µ0 ⫽

冮 ug(u)du Ꮽ

The other doses could be chosen so that ψ{d 1, µ 0} ⫽ α 1, . . . , ψ{d k, µ k} ⫽ α k These initial values for the toxicities may reflect the experimenter’s best guesses about the potential for toxicities at the available doses. Note that in contrast to the maximum likelihood approach, the α i can be ascribed a more concrete meaning, in terms of probabilities, rather than simply being parameters to a model up to an arbitrary positive power transformation. Difficulties can arise if this procedure is not followed. For instance, suppose we decide to start out with a deliberately low dose, although according to the prior a higher dose would have been indicated. This can lead to undesirable behavior of CRM, an example being the potential occurrence of big jumps when escalating (7,13). Restricting escalation increments may appear to alleviate the problem (see MCRM, Sect. VI.B), but we do not recommend this in view of its ad hoc nature and since the problem can be avoided at the setup stage when the guidelines of O’Quigley et al. (15), recalled here, are carefully followed. Such difficulties do not arise with the maximum likelihood approach. Given the set Ω j we can calculate the posterior density for a as f(a|Ω j ) ⫽ H j⫺1 (a)exp{ᏸ j (a)}g(a)

(4)

where H j (a) is a sequence of normalizing constant, i.e., H j (a) ⫽

冮

∞

exp{ᏸ j (u)}g(u)du

(5)

0

The dose x j⫹1 ∈ {d 1, . . . d k} assigned to the ( j ⫹ 1)th included patient is the dose minimizing |θ ⫺ ∫ ψ{x j⫹1, u}}f(u|Ω j )du| If there are many dose levels, it may be more computationally efficient to locate the dose level x j⫹1 ∈ {d 1, . . . d k} satisfying Q ij (P ij ⫺ 2θ) ⬍ 0 (i ⫽ 1, . . ., k; x j⫹1 ≠ d i ) where Q ij ⫽

冮

∞

0

{ψ{x j⫹1, u} ⫺ ψ{d i, u}}f(u|Ω j )du

(6)

48

O’Quigley

and P ij ⫽

冮

∞

0

{ψ{x j⫹1, u} ⫹ ψ{d i, u}}f (u|Ω j )du

Often it will make little difference if rather than work with the expectations of the toxicities, we work with the expectation of a, thereby eliminating the need for k ⫺ 1 integral calculations. Thus, we treat the ( j ⫹ 1)th included patient at level x j⫹1 ∈ {d 1, . . . d k} such that |θ ⫺ ψ{x j⫹1, µ j}| is minimized where µ j ⫽ ∫uf(u|Ω j )du. As in the likelihood approach, we can calculate an approximate 100(1 ⫺ α)% Bayesian interval for ψ(x j⫹1, µ j ) as (ψj⫺, ψj⫹ ) where ψ j⫺ ⫽ ψ(x j⫹1, α j⫺ ), ψ j⫹ ⫽ ψ(x j⫹1, α j⫹ ),

冮

α⫺ j

α⫹ j

f(u|Ω j⫹1 )du ⫽ 1 ⫺ α

The Bayesian approach has the apparent advantage of being immediately operational in that it is not necessary to wait for patient heterogeneity before being able to assess to which level we should assign the successively entered patients. By modifying our prior and/or model, quantities that are to a large extent arbitrarily defined, we can alter these early operating characteristics to mimic the kind of behavior we would like, for instance, rapid or less rapid escalation. In principle we could fine tune prior and model parameters to achieve, for example, rapid initial escalation that is gradually dampened in the absence of observed toxicities. Such goals, however, are more readily accomplished via the two-stage designs of the following section. C.

Two-Stage Designs

It may be believed that we know so little before undertaking a given study that it is worthwhile to split the design into two stages: an initial exploratory escalation followed by a more refined homing in on the target. Such an idea was first proposed by Storer (28) in the context of the more classical up and down schemes. His idea was to enable more rapid escalation in the early part of the trial where we may be quite far from a level at which treatment activity could be anticipated. Moller (13) was the first to use this idea in the context of CRM designs. Her idea was to allow the first stage to be based on some variant of the usual up and down procedures. In the context of sequential likelihood estimation, the necessity of an initial stage was pointed out by O’Quigley and Shen (17) since the likelihood equation fails to have a solution on the interior of the parameter space unless some heterogeneity in the responses has been observed. Their suggestion was to work with any initial scheme, Bayesian CRM, or up and down, and for any reasonable scheme the operating characteristics appear relatively insensitive to this choice. However, we believe there is something very natural and desirable in two-

Continual Reassessment Method

49

stage designs and that currently they could be taken as the designs of choice. Early behavior of the method, in the absence of heterogeneity, (i.e., lack of toxic response), appears to be rather arbitrary. A decision to escalate after inclusion of three patients tolerating some level or after a single patient tolerating a level or according to some Bayesian prior, however constructed, is translating directly, less directly for the Bayesian prescription, the simple desire to try a higher dose because thus far we have encountered no toxicity. The use of a working model at this point, as occurs for Bayesian estimation, may be somewhat artificial, and the rate of escalation can be modified at will, albeit somewhat indirectly, by modifying our model parameters and/or our prior. Rather than lead the clinician into thinking that something subtle and carefully analytic is taking place, our belief is that it is preferable that he or she be involved in the design of the initial phase. Operating characteristics that do not depend on data ought be driven by clinical rather than statistical concerns. More importantly, the initial phase of the design, in which no toxicity has yet been observed, can be made much more efficient, from both the statistical and ethical angles, by allowing information on toxicity grade to determine the rapidity of escalation. Here we describe an example of a two-stage design that has been used in practice. There were many dose levels, and the first included patient was treated at a low level. As long as we observe very low-grade toxicities, then we escalate quickly, including only a single patient at each level. As soon as we encounter more serious toxicities then escalation is slowed down. Ultimately we encounter dose-limiting toxicities, at which time the second stage, based on fitting a CRM model, comes fully into play. This is done by integrating this information and that obtained on all the earlier non-dose-limiting toxicities to estimate the most appropriate dose level. It was decided to use information on low-grade toxicities in the first stage of a two-stage design to allow rapid initial escalation since it is possible that we be far below the target level. Specifically, we define a grade severity variable

Table 1 trial Severity 0 1 2 3 4

Toxicity ‘‘grades’’ (severities) for Degree of toxicity No toxicity Mild toxicity (non-dose limiting) Nonmild toxicity (non-dose limiting) Severe toxicity (non-dose limiting) Dose-limiting toxicity

50

O’Quigley

S(i) to be the average toxicity severity observed at dose level i, that is, the sum of the severities at that level divided by the number of patients treated at that level. The rule is to escalate, provided S(i) is less than 2. Furthermore, once we have included three patients at some level, then escalation to higher levels only occurs if each cohort of three patients does not experience dose-limiting toxicity. This scheme means that, in practice, as long as we see only toxicities of severities coded 0 or 1, then we escalate. The first severity coded 2 necessitates a further inclusion at this same level, and anything other than a 0 severity for this inclusion would require yet a further inclusion and a non-dose-limiting toxicity before being able to escalate. This design also has the advantage that should we be slowed down by a severe (severity 3) albeit non-dose-limiting toxicity, we retain the capability of picking up speed (in escalation) should subsequent toxicities be of low degree (0 or 1). This can be helpful to avoid being handicapped by an outlier or an unanticipated and possibly not drug-related toxicity. Once dose-limiting toxicity is encountered, this phase of the study (the initial escalation scheme) comes to a close and we proceed on the basis of CRM recommendation. Although the initial phase is closed, the information on both dose-limiting and non-dose-limiting toxicities thereby obtained is used in the second stage. D.

Grouped Designs

O’Quigley et al. (15) described the situation of delayed response in which new patients become available to be included in the study while the toxicity results are still outstanding on already entered patients. The suggestion was, in the absence of information on such recently included patients, that the logical course to take was to treat at the last recommended level. This is the level indicated by all the currently available information. The likelihood for this situation was written down by O’Quigley et al. (15) and, apart from a constant term not involving the unknown parameter, is just the likelihood we obtain were the subjects to have been included one by one. There is therefore, operationally, no additional work required to deal with such situations. The question does arise, however, as to the performance of CRM in such cases. The delayed response can lead to grouping or we can simply decide on the grouping by design. Goodman et al. (9) and O’Quigley and Shen (17) studied the effects of grouping. The more thorough study was that of Goodman et al. in which cohorts of one, two, and three were evaluated. Broadly speaking, the cohort size had little impact on operating characteristics and the accuracy of final recommendation. O’Quigley and Shen (17) indicated that for groups of three and relatively few patients (n ⫽ 16), when the correct level was the highest available level and we start out at the lowest or a low level, then we might anticipate some marked drop in performance when contrasted with, say, one-by-one inclusion.

Continual Reassessment Method

51

Simple intuition would tell us this, and the differences disappeared for samples of size 25. One-by-one inclusion tends to maximize efficiency, but should stability throughout the study be an issue, then this extra stability can be obtained through grouping. The cost of this extra stability in terms of efficiency loss appears to be generally small. The findings of Goodman et al. (9), O’Quigley and Shen (17) and O’Quigley (19) contradict the conjecture of Korn et al. (11) that any grouping would lead to substantial efficiency losses. E.

Illustration

This brief illustration is recalled from O’Quigley and Shen (17). The study concerned 16 patients. Their toxic responses were simulated from the known dose– toxicity curve. There were six levels in the study, maximum likelihood was used, and the first entered patients were treated at the lowest level. The design was two stage. The true toxic probabilities were R(d 1 ) ⫽ 0.03, R(d 2 ) ⫽ 0.22, R(d 3 ) ⫽ 0.45, R(d 4 ) ⫽ 0.6, R(d 5 ) ⫽ 0.8, and R(d 6 ) ⫽ 0.95. The working model was that given by Eq. (1) where α 1 ⫽ 0.04, α 2 ⫽ 0.07 α 3 ⫽ 0.20, α 4 ⫽ 0.35, α 5 ⫽ 0.55, and α 6 ⫽ 0.70. The targeted toxicity was given by θ ⫽ 0.2, indicating that the best level for the maximum tolerated dose (MTD) is given by level 2 where the true probability of toxicity is 0.22. A grouped design was used until heterogeneity in toxic responses was observed, patients being included, as for the classical schemes, in groups of three. The first three patients experienced no toxicity at level 1. Escalation then took place to level 2, and the next three patients treated at this level did not experience any toxicity either. Subsequently, two of the three patients treated at level 3 experienced toxicity. Given this heterogeneity in the responses, the maximum likelihood estimator for a now exists and, following a few iterations, could be seen to be equal to 0.715. We then have that Rˆ (d 1 ) ⫽ 0.101, Rˆ (d 2 ) ⫽ 0.149, Rˆ (d 3 ) ⫽ 0.316, Rˆ (d 4 ) ⫽ 0.472, Rˆ (d 5 ) ⫽ 0.652, and Rˆ (d 6 ) ⫽ 0.775. The 10th entered patient is then treated at level 2 for which Rˆ (d 2 ) ⫽ 0.149 since, from the available estimates, this is the closest to the target θ ⫽ 0.2. The 10th included patient does not suffer toxic effects, and the new maximum likelihood estimator becomes 0.759. Level 2 remains the level with an estimated probability of toxicity closest to the target. This same level is in fact recommended to the remaining patients so that after 16 inclusions the recommended MTD is level 2. The estimated probability of toxicity at this level is 0.212, and a 90% confidence interval for this probability is estimated as (0.07, 0.39).

IV. STATISTICAL PROPERTIES Recall that CRM is a class of methods rather than a single method, the members of the class depending on arbitrary quantities chosen by the investigator such as; the form of the model, the spacing between the doses, the starting dose, whether

52

O’Quigley

single or grouped inclusions, the initial dose escalation scheme in two stage designs or the prior density chosen for Bayesian formulations. The statistical properties described in this section apply broadly to all members of the class, the members nonetheless maintaining some of their own particularities. A.

Convergence

Convergence arguments are obtained from considerations of the likelihood. The same arguments apply to Bayesian estimation as long as the prior is other than degenerate, that is, all the probability mass is not put on a single point. Usual likelihood arguments break down since our models are misspecified. The maximum likelihood estimate, Rˆ (d i ) ⫽ ψ(d i, aˆ j ), exists as soon as we have some heterogeneity in the responses (27). We assume the dose toxicity function, ψ(x, a), to satisfy the conditions described in Section 11.A, in particular the condition that, for i ⫽ 1, . . ., k, there exists a unique a i such that ψ(d i, a i ) ⫽ R(d i ). Note that the a i s depend on the actual probabilities of toxicity and are therefore unknown. We also require 1.

For each 0 ⬍ t ⬍ 1 and each x, the function s(t, x, a): ⫽ t

2.

⫺ψ′ ψ′ (x, a) ⫹ (1 ⫺ t) (x, a) ψ 1⫺ψ

is continuous and is strictly monotone in a. The parameter a belongs to a finite interval [A, B].

The first condition is standard for estimating equations to have unique solutions. The second imposes no real practical restriction. We also require the true unknown dose toxicity function, R(x), to satisfy the following conditions: 1. 2. 3.

The probabilities of toxicity at d 1, . . . , d k satisfy 0 ⬍ R(d 1 ) ⬍, . . . , ⬍ R(d k ) ⬍ 1. The target dose level is x 0 ∈ (d 1, . . . , d k ), where |R (x 0 ) ⫺ θ| ⬍ |R(d i ) ⫺ θ|, (i ⫽ 1, . . . , k; x 0 ≠ d i ). Before writing down the third condition, note that since our model is misspecified, it will generally not be true that ψ(d i, a 0 ) ⬅ R(d i ) for i ⫽ 1, . . . , k. We nonetheless require that the working model is not ‘‘too distant’’ from the true underlying dose toxicity curve, and this can be made precise with the help of the set S(a 0 ) ⫽ {a: |ψ(x 0, a) ⫺ θ| ⬍ |ψ(x i, a) ⫺ θ|, for all d i ≠ x 0}

(7)

The condition we require is that, for i ⫽ 1, . . . , k, a i ∈ S(a 0 ). It can be shown (25), under the assumptions on R(d i ) and ψ(x i, a), that

Continual Reassessment Method

53

S(a 0 ) is an open and convex set. This result is the key to rewriting ˜I n (a) below in a way that is convenient. Letting 1 n

I n (a) ⫽

冱冤y ψ {x , a} ⫹ (1 ⫺ y ) 1 ⫺ ψ {x , a}冥 n

j

ψ′

j

⫺ψ′

j

j

j⫽1

and 1 I˜ n (a) ⫽ n

冱冢R{x } ψ {x , a} ⫹ [1 ⫺ R{x }] 1 ⫺ ψ {x , a}冣 n

ψ′

j

j

⫺ψ ′

j

j

j⫽1

then sup |I n (a) ⫺ ˜I n (a)| → 0, almost surely

(8)

a∈[A,B]

This convergence result follows intuitively and can be demonstrated rigorously in a number of ways. For instance, observe that for each dose level d i, (ψ′/ψ) (d i, ⋅) and {ψ′/(1 ⫺ ψ)}(d i, ⋅) are uniformly continuous in a over the finite interval [A, B]. Shen and O’Quigley (25) applied this result to a sufficiently fine partition of the interval [A, B] to bound by arbitrarily small quantities the above differences. The result then follows. The next important step is to consider the finite interval S 1 (a 0 ) ⫽ [a (1), a (k)] in which a (1) ⫽ min{a 1, . . . , a k} and a (k) ⫽ max{a 1, . . . , a k}. The third condition on R(x) and the convexity of the set S(a 0 ) imply that S 1 (a 0 ) ⊂ S(a 0 ). Define π n (d i ) ∈ [0, 1] to be the frequency that the level d i has been used by the first n experiments. Then we can rewrite I˜ n (a) as

冱π (d )冦R(d ) ψ (d , a) ⫹ (1 ⫺ R(d )) 1 ⫺ ψ (d , a)冧 k

˜I n (a) ⫽

n

i

i

ψ′

i

j

⫺ψ ′

i

(9)

i⫽1

Now, let a˜ n be the solution to the equation ˜I n (a) ⫽ 0, i.e., a˜ n solves

冱π (d )冦R (d ) ψ (d , a) ⫹ (1 ⫺ R(d )) 1 ⫺ ψ (d , a)冧 ⫽ 0 k

n

i

i

ψ′

i

⫺ψ′

j

i

(10)

i⫽1

For each 1 ⱕ i ⱕ k, the definition of a i and condition 1 on the dose–toxicity function indicate that a i is the unique solution to the equation

冦R(d ) ψψ′ (d , a) ⫹ (1 ⫺ R(d )) 1⫺ψ⫺ ψ′ (d , a)冧 ⫽ 0 i

i

i

i

It follows that a˜ n will fall into the interval S 1 (a 0 ). Since aˆ n solves I n (a) ⫽ 0, Eq. (8), and uniform continuity ensure that, almost surely, aˆ n ∈ S(a 0 ) for n sufficiently large. Hence, for large n, aˆ n satisfies |ψ(x 0, aˆ n ) ⫺ θ| ⬍ |ψ(d i, aˆ n ) ⫺ θ|, for i ⫽ 1, . . . , k, d i ≠ x 0

54

O’Quigley

Thus, for n large enough x n⫹1 ⬅ x 0 so that at this dose level x n⫹1 satisfies |x n⫹1 ⫺ x 0| ⬍ |x n ⫺ x 0| if x n ≠ x 0. In other words, x n converges to x 0 almost surely. Since there are only a finite number of dose levels, x n will stay at x 0 ultimately. To establish the consistency of aˆ n, observe that, as n tends to infinity, π n (d i ), (i ⫽ 1, . . . k) in Eq. (9) become negligible, except π n (x 0 ), which tends to 1. Thus, a˜ n, being the solution for Eq. (10), will tend to the solution to (x , a) ⫽ 0冧冦R(x ) ψψ′(x , a) ⫹ {1 ⫺ R(x )}1⫺ψ′ ⫺ψ 0

0

0

0

The solution to the above equation is a 0. Applying Eq. (8) again, we obtain the consistency of aˆ n and, further, that the asymptotic distribution of √n (aˆ n ⫺ a 0 ) is N(0, σ2 ), with σ 2 ⫽ {ψ′(x 0, a 0 )}⫺2θ 0 (1 ⫺ θ 0 ). B.

Efficiency

O’Quigley (14) proposes using θˆ n ⫽ ψ(x n⫹1, aˆ n ) to estimate the probability of toxicity at the recommended level x n⫹1, where aˆ n is the maximum likelihood estimate. An application of the δ method shows that the asymptotic distribution of √n{θˆ n ⫺ R(x 0 )} is N{0, θ 0 (1 ⫺ θ 0 )}. The estimate then provided by CRM is fully efficient. This is what our intuition would suggest given the convergence properties of CRM. What actually takes place in finite samples needs to be investigated on a case by case basis. Nonetheless, the relatively broad range of cases studied by O’Quigley (14) show a mean squared error for the estimated probability of toxicity at the recommended level under CRM to correspond well with the theoretical variance for samples of size n, were all subjects to be experimented at the correct level. Some cases studied showed evidence of superefficiency, translating nonnegligeable bias that happens to be in the right direction, whereas a few others indicated efficiency losses large enough to suggest the potential for improvement. C.

Safety

In any discussion on a phase I design the word safety will arise. This translates a central ethical concern. A belief that CRM would tend to treat the early included patients in a study at high-dose levels convinced many investigators that without some modification CRM was not ‘‘safe.’’ Safety is in fact a statistical property of any method. When faced with some potential realities or classes of realities, we can ask ourselves questions such as what is the probability of toxicity for a randomly chosen patient that has been included in the study or, say, what is the probability of toxicity for those patients entered into the study at the very beginning?

Continual Reassessment Method

55

Once we know the realities or classes of realities we are facing, the operating rules of the method, which are obvious and transparent for up and down schemes and less transparent for model-based schemes such as CRM, then in principle we can calculate the probabilities mentioned above. In practice, these calculations are involved, and we may simply prefer to estimate them to any desired degree of accuracy via simulation. Theoretical work and extensive simulations (1,15–17,19,23) indicate CRM to be a safer design than any of the commonly used up and down schemes in that for targets of less than θ ⫽ 0.30, the probability that a randomly chosen patient suffers a toxicity is lower. Furthermore, the probability of being treated at levels higher than the MTD was, in all the studied cases, higher with the standard designs than with CRM. If the definition of safety was to be widened to include the concept of treating patients at unacceptably low levels, that is, levels at which the probability of toxicity is deemed too close to zero, then CRM does very much better than the standard designs. This finding is logical given that the purpose of CRM is to concentrate as much experimentation as possible around the prespecified target. In addition, it ought be emphasized that we can adjust the CRM to make it as safe as we require by changing the target level. For instance, if we decreased the target from 0.20 to 0.10, the observed number of toxicities will be, on average, roughly halved. This is an important point since it highlights the main advantages of the CRM over the standard designs in terms of flexibility and the ability to be adapted to potentially different situations. An alternative way to enhance conservatism is rather than choose the closest available dose to the target, systematically take the dose immediately lower than the target or change the distance measure used to select the next level to recommend. Safety ought be improved, although the impact such an approach might have on the reliability of final estimation remains to be studied. Some study on this idea has been carried out by Babb et al. (5).

V.

MORE COMPLEX CRM DESIGNS

The different up and down designs amount to a collection of ad hoc rules for making decisions when faced with accumulating observations. The CRM leans on a model that, although not providing a broad summary of the true underlying probabilistic phenomenon, in view of its being underparametrized, does nonetheless provide structure enabling better control in an experimental situation. In principle at least, a model enables us to go further and accommodate greater complexity. Care is needed, but with some skill in model construction, we may hope to capture some other effects that are necessarily ignored by the rough and ready up and down designs. The following sections consider some examples.

56

A.

O’Quigley

Inclusion of a Stopping Rule

The usual CRM design requires that a given sample size is determined in advance. However, given the convergence properties of CRM, it may occur in practice that we appear to have settled on a level before having included the full sample size n of anticipated patients. In such a case we may wish to bring the study to an early close, thereby enabling the phase II study to be undertaken more quickly. One possible approach suggested by O’Quigley et al. (15) would be to use the Bayesian intervals, (ψ j⫺, ψ j⫹ ), for the probability of toxicity, ψ(x j⫹1, aˆ j ), at the currently recommended level and when this interval falls within some prespecified range, we stop the study. Another approach would be to stop after some fixed number of subjects have been treated at the same level. Such designs were used by Goodman et al. (9) and Korn et al. (11) and have the advantage of great simplicity. The properties of such rules remain to be studied, and we do not recommend their use at this point. One stopping rule that has been studied in detail (18) is described here. The idea is based on the convergence of CRM and that as we reach a plateau, the accumulating information can enable us to quantify this notion. Specifically, given Ω j we would like to say something about the levels at which the remaining patients, j ⫹ 1 to n, are likely to be treated. The quantity we are interested in is ᏼ j, n ⫽ Pr{x j⫹1 ⫽ x j⫹2 ⫽ ⋅ ⋅ ⋅ ⫽ x n⫹1|Ω j } In words, ᏼ j, n is the probability that x j⫹1 is the dose recommended to all remaining patients in the trial and is the final recommended dose. Thus, to find ᏼ j,n one needs to determine all the possible outcomes of the trial based on the results known for the first j patients. The following algorithm achieves the desired result. 1.

2.

3.

4.

Construct a complete binary tree on 2n⫺j⫹1 ⫺ 1 nodes corresponding to all possible future outcomes (y j⫹1, . . . , y n ). The root is labeled with x j⫹1. Assuming that y j⫹1 ⫽ 1, compute the value of aˆ j⫹1. Determine the dose level that would be recommended to patient j ⫹ 1 in this case. Label the left child of the root with this dose level. Repeat step 2, this time with y j⫹1 ⫽ 0. Determine the dose level that would be recommended to patient j ⫹ 1. Label the right child of the root with this dose level. Label the remaining nodes of the tree level by level, as in the preceding steps.

We use the notation T j,n to refer to the tree constructed with this algorithm. Each path in T j,n that starts at the root and ends at a leaf whose nodes all have the same label represents a trial where the recommended dose is unchanged be-

Continual Reassessment Method

57

tween the ( j ⫹ 1)st and the (n ⫹ 1)st patient. The probability of each such path is given by {R(x j⫹1 )}τ{1 ⫺ R(x j⫹1 )}n⫺j⫺τ where τ is the number of toxicities along the path. By exclusivity we can sum the probabilities of all such paths to obtain ᏼ j,n. Using ψ{x j⫹1, aˆ j}, the current estimate of the toxicity of x j⫹1, we may estimate the probability of each path by [ψ(x j⫹1, aˆ j )]τ[1 ⫺ ψ(x j⫹1, aˆ j )]n⫺j⫺τ Adding up these path estimates yields an estimate of ᏼj,n. Details are given in O’Quigley and Reiner (18). B. Patient Heterogeneity As in other types of clinical trials we are essentially looking for an average effect. Patients of course differ in the way they may react to a treatment, and although hampered by small samples, we may sometimes be in a position to specifically address the issue of patient heterogeneity. One example occurs in patients with acute leukemia where it has been observed that children will better tolerate more aggressive doses (standardized by their weight) than adults. Likewise, heavily pretreated patients are more likely to suffer from toxic side effects than lightly pretreated patients. In such situations we may wish to carry out separate trials for the different groups to identify the appropriate MTD for each group. Otherwise we run the risk of recommending an ‘‘average’’ compromise dose level, too toxic for a part of the population and suboptimal for the other. Usually, clinicians carry out two separate trials or split a trial into two arms after encountering the first dose limiting toxicities (DLTs) when it is believed that there are two distinct prognostic groups. This has the disadvantage of failing to use information common to both groups. A two-sample CRM has been developed so that only one trial is carried out based on information from both groups (20). A multisample CRM is a direct generalization, although we must remain realistic in terms of what is achievable in the light of the available sample sizes. Let I, taking value 1 or 2, be the indicator variable for the two groups. Otherwise, we use the same notation as previously defined. For clarity, we suppose that the targeted probability is the same in both groups and is denoted by θ, although this assumption is not essential to our conclusions. The dose–toxicity model is now the following: Pr(Y ⫽ 1|X ⫽ x, I ⫽ 1) ⫽ ψ 1 (x, a) Pr(Y ⫽ 1|X ⫽ x, I ⫽ 2) ⫽ ψ 2 (x, a, b)

58

O’Quigley

Parameter b measures to some extent the difference between the groups. The functions ψ 1 and ψ 2 are selected in such a way that for each θ ∈ (0,1) and each dose level x there exists (a 0, b 0 ), satisfying ψ 1 (x, a 0 ) ⫽ θ and ψ 2 (x, a 0, b 0 ) ⫽ θ. This condition is satisfied by many function pairs. The following model has performed well in simulations: ψ 1 (x, a) ⫽

exp(a ⫹ x) , 1 ⫹ exp(a ⫹ x)

ψ 2 (x, a, b) ⫽

b exp(a ⫹ x) 1 ⫹ b exp(a ⫹ x)

There are many other possibilities, an obvious generalization of the model of O’Quigley, et al. (15), arising from Eq. (1) in which Eq. (2) applies to group 1 and α i ⫽ {tanh(d i ⫹ b) ⫹ 1}/2 (i ⫽ 1, . . . , k)

(11)

to group 2. A non-zero value for b indicates group heterogeneity. Let z k ⫽ (x k, y k, I k ), k ⫽ 1, . . . , j be the outcomes of the first j patients, where I k indicates to which group the kth subject belongs, x k is the dose level at which the kth subject is tested, and y k indicates whether or not the kth subject suffered a toxic response. To estimate the two parameters, one can use a Bayesian estimate or maximum likelihood estimate as for a traditional CRM design. On the basis of the observations z k, (k ⫽ 1, . . . , j ) on the first j 1 patients in group 1 and j 2 patients in group 2 ( j 1 ⫹ j 2 ⫽ j), we can write down the likelihood as j1

ᏸ(a, b) ⫽

兿

ψ 1 (x i, a)yi (1 ⫺ ψ 1 (x i, a))1⫺y i

i⫽1

k

⫻

兿

ψ 2 (x i, a, b)y i (1 ⫺ ψ 2 (x i, a, b))1⫺yi

i⫽j1⫹1

If we denote by (aˆ j, bˆ j ) values of (a, b), maximizing this equation after the inclusion of j patients, then the estimated dose–toxicity relations are ψ 1 (x, aˆ j ) and ψ 2 (x, aˆ j, bˆ j ), respectively. If the ( j ⫹ 1)th patient belongs to group 1, he or she will be allocated at the dose level that minimizes |ψ 1 (x j⫹1, aˆ ) ⫺ θ| with x j⫹1 ∈ {d 1, . . . , d k}. On the other hand, if the ( j ⫹ 1)th patient belongs to group 2, the recommended dose level minimizes |ψ 2 (x j⫹1, aˆ j, bˆ j ) ⫺ θ|. The trial is carried out as usual: After each inclusion, our knowledge on the probabilities of toxicity at each dose level for either group is updated via the parameters. It has been shown that under some conditions, the recommendations will converge to the right dose level for both groups and the estimate of the true probabilities of toxicity at these two levels. Note that it is not necessary that the two sample sizes are balanced nor that entry into the study is alternating.

Continual Reassessment Method

59

Figure 4 Results of a simulated trial for two groups.

Figure 4 illustrates a simulated trial carried out with a two-parameter model. Implementation was based on likelihood estimation, necessitating nontoxicities and a toxicity in each group before the model could be fully fit. Before this, doselevel escalation followed an algorithm incorporating grade information paralleling that of Section III.C. The design called for shared initial escalation, that is, the groups were combined until evidence of heterogeneity began to manifest itself. The first DLT occured in group 2 for the fifth included patient. At this point, the trial was split into two arms, group 2 recommendation being based on ᏸ(aˆ, 0) and ψ 2 (x, aˆ, 0) and group 1 continuing without a model as in Section III.C. Note that there are many possible variants on this design. The first DLT in group 1 was encountered at dose level 6 and led to recommend a lower level to the next patient to be included. For the remainder of the study, allocation for both groups leaned on the model together with the minimization algorithms described above. C. Pharmacokinetic Studies Statistical modeling of the clinical situation of phase I dose-finding studies, such as takes place with the CRM, is relatively recent. Much more fully studied in the

60

O’Quigley

phase I context are pharmacokinetics and pharmacodynamics. Roughly speaking, pharmacokinetics deals with the study of concentration and elimination characteristics of given compounds in specified organ systems, most often blood plasma, whereas pharmacodynamics focuses on how the compounds affect the body. This is a vast subject referred to as PK/PD modeling. Clearly, such information will have a bearing on whether or not a given patient is likely to encounter dose-limiting toxicity or, in retrospect, why some patients and not others were able to tolerate some given dose. There are many parameters of interest to the pharmacologist, for example, the area under the concentration time curve, the rate of clearance of the drug, and the peak concentration. For our purposes, a particular practical difficulty arises in the phase I context in which any such information only becomes available once the dose has been administered. Most often then, the information will be of most use in terms of retrospectively explaining the toxicities. However, it is possible to have pharmacodynamic information and other patient characteristics relating to the patient’s ability to synthesize the drugs, available before selecting the level at which the patient should be treated. In principle we can write down any model we care to hypothesize, say one including all the relevant factors believed to influence the probability of encountering toxicity. We can then proceed to estimate the parameters. However, as in the case of patient heterogeneity, we must remain realistic in terms of what can be achieved given the maximum obtainable sample size. Some pioneering work has been carried out here by Piantadosi et al. (22), indicating the potential for improved precision by the incorporation of pharmacokinetic information. This is a large field awaiting further exploration. The strength of CRM is to locate with relatively few patients the target dose level. The remaining patients are then treated at this same level. A recommendation is made for this level. Further studies, following the phase I clinical study, can now be made, and this is where we see the main advantage of pharmacokinetics. Most patients will have been studied at the recommended level and a smaller amount at adjacent levels. At any of these levels, we will have responses and a great deal of pharmacokinetic information. The usual models, in particular the logistic model, can be used to see if this information helps explain the toxicities. If so, we may be encouraged to carry out further studies at higher or lower levels for certain patient profiles, indicated by the retrospective analysis to have probabilities of toxicity much lower or much higher than suggested by the average estimate. This can be viewed as the fine tuning and may itself give rise to new more highly focused phase I studies. At this point we do not see the utility of a model in which all the different factors are included as regressors. These further analyses are necessarily very delicate, requiring great statistical and/or pharmacological skill, and a mechanistic approach based on a catch-all model is probably to be advised against.

Continual Reassessment Method

61

D. Graded Toxicities Although we refer to dose-limiting toxicities as a binary (0,1) variable, most studies record information on the degree of toxicity, from 0, complete absence of side effects, to 4, life-threatening toxicity. The natural reaction for a statistician is to consider that the response variable, toxicity, has been simplified when going from five levels to two and that it may help to use models accomodating multilevel responses. In fact this is not the way we believe that progress is to be made. The issue is not that of modeling a response (toxicity) at five levels but of controlling for dose-limiting toxicity, mostly grade 4 but possibly also certain kinds of grade 3. Lower grades are helpful in that their occurence indicates that we are approaching a zone in which the probability of encountering a dose-limiting toxicity is becoming large enough to be of concern. This idea is used implicitly in the two-stage designs described in Section III. If we are to proceed more formally and hopefully extract yet more information from the observations, then we need models relating the occurence of dose-limiting toxicities to the occurence of lower grade toxicities. In the unrealistic situation in which we can accurately model the ratio of the probabilities of the different types of toxicity, we can make striking gains in efficiency since the more frequently observed lower grade toxicities carry a great deal of information on the potential occurence of dose-limiting toxicities. Such a situation would also allow gains in safety since, at least hypothetically, it would be possible to predict at some level the rate of occurence of dose-limiting toxicities without necessarily having observed very many, the prediction leaning largely on the model. At the opposite end of the model/hypothesis spectrum, we might decide we know nothing about the relative rates of occurrence of the different toxicity types and simply allow the accumulating observations to provide the necessary estimates. In this case it turns out that we neither lose nor gain efficiency, and the method behaves identically to one in which the only information we obtain is whether or not the toxicity is dose limiting. These two situations suggest a middle road, using a Bayesian prescription, in which very careful modeling can lead to efficiency improvements, if only moderate, without making strong assumptions. To make this more precise, let us consider the case of three toxicity levels, the highest being dose limiting. Let Y k denote the toxic response at level k, (k ⫽ 1, . . . , 3). The goal of the trial is still to identify a level of dose whose probability of severe toxicity is closest to a given percentile of the dose–toxicity curve. A working model for the CRM could be Pr(Y k ⫽ 3) ⫽ ψ 1 (x k, a) Pr(Y k ⫽ 2 or Y k ⫽ 3) ⫽ ψ 2 (x k, a, b) Pr(Y k ⫽ 1) ⫽ 1 ⫺ ψ 2 (x k, a, b)

62

O’Quigley

The contributions to the likelihood are 1 ⫺ ψ 2 (x k, a, b) when Y k ⫽ 1, ψ 1 (x k, a) when Y k ⫽ 3, and ψ 2 (x k, a, b) ⫺ ψ 1 (x k, a) when Y k ⫽ 2. With no prior information and maximizing the likelihood, we obtain exactly the same results as with the more usual one-parameter CRM. This is due to the parameter orthogonality. There is therefore no efficiency gain, although there is the advantage of learning about the relationship between the different toxicity types. Let us imagine that the parameter b is known precisely. The model need not be correctly specified, although b should maintain interpretation outside the model, for instance some simple function of the ratio of grade 3 to grade 2 toxicities. Efficiency gains are then quite substantial (21). This is not of direct practical interest since the assumption of no error in b is completely inflexible, that is, should we be wrong in our assumed value that this induces a noncorrectable bias. To overcome this we have investigated a Bayesian setup in which we use prior information to provide a ‘‘point estimate’’ for b but having uncertainty associated with it, expressed via a prior distribution. Errors can then be overwritten by the data. This work is incomplete at the time of this writing, but early results are encouraging.

VI. RELATED APPROACHES A number of other designs for phase I studies have been suggested. The modified continual reassessment method, as its name suggests, is clearly related to CRM. It is described below. Schemes that predate the CRM, such as that of Anbar (2– 4) and Wu (32,33), leaning on stochastic approximation, are in fact quite closely related. A.

Stochastic Approximation

The problem is considered from the angle of estimating the root of a regression function M(x) from observations (x 1, y 1 ), . . . , (x n, y n ), where (x i, y i ) have the following relation: y i ⫽ M(x i ) ⫹ σε i The errors ε 1, . . . , ε n are i.i.d. random samples with mean zero and unit variance. Let θ 0 be a real value and ξ 0 be the solution for M(x) ⫽ θ 0. We are interested in sequential determination of the design values x 1, . . . , x n, so that ξ 0 can be estimated consistently and efficiently from the corresponding observations y 1, . . . , y n. This problem has its application in both industrial experiments and medical research. Following the pioneering work of Robbins and Monro (24),

Continual Reassessment Method

63

stochastic approximation has been applied to this problem and has been studied by many authors. More precisely, the Robbins–Monro procedure calculates the design values sequentially according to c x n⫹1 ⫽ x n ⫺ (y n ⫺ θ 0 ) n

(12)

where c is some constant. Lai and Robbins (1979) pointed out the connection between Eq. (12) and the following procedure based on the ordinary linear regression applied to (x 1, y 1 ), . . . , (x n, y n ): x n⫹1 ⫽ x n ⫺ βˆ n⫺1(y n ⫺ θ 0 )

(13)

where βˆ n is the least-squared estimate of β. Wu (33) gave a heuristic argument that Eq. (12) is equivalent to Eq. (13). Under certain circumstances, stochastic approximation can be considered as fitting an ordinary linear model to the existing data and then treating the regression line as an approximation of M(⋅) and using it to calculate the next design point. Anbar (2,3) used Robbins–Monro in the context of phase I dose finding, sequentially estimating the slope. Considerably more observations were required to achieve relative stability than for the CRM (16), although it was clear that such designs were superior to the standard up and down design. As pointed out by Wu (32), the Robbins–Monro procedure is unstable at places where the function M(x) is flat. Wu (32) then proposed truncating βˆ n whenever it becomes too large. This stabilizes the estimate of β. We believe, however, that the intrinsic source of instability lies in the use of a model with two parameters (the intercept and the slope) in the estimation of a single root. Imagine that after many experiments the design points have concentrated around the root ξ 0. With relatively few points outside the small region around the root, there are infinitely many pairs that fit the data equally well, since for every given intercept we can find a slope passing through the observations. It is then quite possible for the estimate of β to be unstable if a regression line is fitted to data with both intercept and slope, irrespective whether M(⋅) is flat or not. It should be possible to estimate ξ 0 by fitting the data with a linear model without an intercept. Intuitively, a one-parameter model would be sufficient for determining ξ 0 if most of the design values are around it. The estimate of the slope then becomes relatively stable. The main concern is the consistency of the estimate since the model becomes less flexible and thus it may be more difficult for it to capture the nature of the data. However, consider the simple case that the data are generated according to an ordinary linear regression (M(x) ⫽ α ⫹ βx): Y ⫽ α ⫹ βX ⫹ σε,

α ⬎ 0, β ⬎ 0, X ⬎ 0

(14)

64

O’Quigley

Then ξ 0 is the solution for equation α ⫹ βx ⫽ θ 0 ⬎ 0. Suppose that we have collected data (x 1, y 1 ), . . . , (x n, y n ), then we can fit the underparameterized regression line going through the origin. This results in an estimate of the slope βˆ n ⫽ y n /x n, where x n ⫽ ∑x i /n and y n ⫽ ∑y i /n. Solving βˆ n x ⫽ θ 0 yields the recommended design value for the next experiment: θ θ x x n⫹1 ⫽ ˆ 0 ⫽ 0 n βn yn

(15)

This process is repeated after observation of y n⫹1. Note that the model used by the procedure is different from that generating the data. The design point x n and the average x n can both serve as estimates of ξ 0. These estimates are called consistent if they converge to ξ 0 almost surely when n goes to infinity. The definition of x n implies that its consistency is equivalent to that of x n. The conditions for consistency have been identified by Shen and O’Quigley (25) and the main arguments are close to those showing the consistency for the CRM. B.

Modified, Extended, and Restricted CRM

The operating characteristics of CRM depend, as has been outlined in Section II, on certain arbitrary specifications. For any situation or class of situations we can obtain the operating characteristics to decide on the most appropriate design within the class of CRM designs. Note that once the working model and so forth has been specified, the levels recommended by the method are entirely deterministic. It is enough to specify given paths of visited levels and the associated responses to know exactly which level will be recommended by CRM. If we wish to leave the paths as random, depending on the possible outcomes, then the appropriate tool to use is that of simulation. This would be analagous to calculating sample size, degree of balance, and stratification issues in the more common design context of randomized phase III studies. There is enough flexibility in the CRM to obtain any reasonable characteristics we might believe necessary. There is therefore no cause to be concerned about unanticipated, erratic, or aberrant behavior. Before we carry out any given trial, we can put ourselves in a position of knowing just what the behavior will be when faced with some particular circumstance. The modified continual reassessment method (7) was developed to deal with perceived problems in operating characteristics, in particular the possibility of jumping dose levels. However, such problems do not arise when CRM is correctly implemented (Section III.B). Correct implementation is preferable to using schemes with ad hoc design modifications resulting in potentially poor

Continual Reassessment Method

65

operating characteristics. Before considering this method, sometimes referred to as MCRM, it is important to underline that the perceived difficulties arise from an implementation of CRM differing from that described in this current work as well as in the original paper by O’Quigley et al. (15). CRM does not work with actual doses, as does Faries. As described by O’Quigley et al. (15), O’Quigley and Shen (17), and Shen and O’Quigley (25), the ‘‘doses’’ are conceptual, ordered by increasing probabilities of toxic reaction (see also Sect. II). In addition, if we are to use a Bayesian implementation of CRM and our prior knowledge is weak, then this must be reflected in the choice of prior. The priors selected by Faries (7), in the context of a one-parameter model where the dose is now the real dose, are informative. It should be no surprise at all that ‘‘CRM,’’ in such circumstances, does not behave as we might hope. The consequences of the particular setup for CRM by Faries (7) were twofold: It was possible to observe dose escalation after an observed toxicity and it was possible to observe large skips in the dose levels after nontoxicities. Faries suggested fixing these awkward operating characteristics by continuing with the same prescription but, rather than use the recommended level, introduce two further rules. The first of these is that dose escalation after an observed toxicity is not allowed. The method’s prescription is overruled, and we allocate at the same level. The second rule is that escalation should never be more than a single level so that ‘‘skipping’’ is also overruled when indicated by the method. The new modified method is called MCRM. For the standard case of six dose levels studied by O’Quigley et al. (15), MCRM fixes a problem that does not exist under the usual guidelines. The usual CRM will never recommend escalation after an observed toxicity and will never skip doses when escalating. We may conclude that there is then little to choose between CRM and MCRM. However, working with the actual doses as does MCRM can be problematic. MCRM, as can be seen by studying Fig. 1 of Faries (7), can be unstable. Essentially MCRM works with a different dose–toxicity function, lacking the flexibility we may need, and although large sample behavior should be similar to CRM for small finite samples, the relatively steep dose– toxicity working models of Section II.A may be required to dampen oscillations and reduce instability. Note that none of the comparative studies in Tables 4, 5, and 6 of Faries compare MCRM with CRM. A comparative study of MCRM and CRM, to our knowledge, has yet to be carried out. Our few limited studies indicate that MCRM performs poorly when compared with CRM. For studies with a large number of dose levels, 12 or more (see 13), the problem of skipping really could arise. However, the apparently alarming example that Moller quotes in which CRM recommends treating the second entered patient at level 10, following a nontoxicity for the first patient at level 1, would as with Faries, never arise had CRM been implemented according to O’Quigley

66

O’Quigley

et al. (15). To prevent such an occurrence, Moller also suggests the same rule as Faries. This design is called restricted CRM, but as described above for MCRM, it addresses problems equally well solved within the framework of the standard setup. It is easy to understand what is going on from Fig. 1 of Moller (13). The middle curve is the working model, and had this been chosen according to the prescription of O’Quigley et al. (15), then the horizontal line at the target θ ⫽ 0.4 would meet the lowest level rather than level 10 as in the figure. Essentially this particular Bayesian setup expresses the idea that we believe the correct level to be level 10, although there is some uncertainty in this quantified by our prior. It is therefore perfectly natural that integrating some further small piece of information (the first patient treated at the lowest level and tolerating the treatment), we then recommend level 10 to patient 2. The posterior and the prior are almost indistinguishable (see also Sect. III.B) and Ahn (1). The prior uncertainty should be quantified by the function g(a). In Moller’s example, the choice of an exponential prior would not seem a good one. The subsequent attempt to make the exponential prior noninformative by modifying the conceptual doses is certainly interesting albeit not the easiest way to go and would need further study before it could be recommended. The comparisons of the respective performances of restricted CRM and original CRM are based on this rather unusual setup, certainly not the CRM as described by O’Quigley et al. (15). Furthermore, these comparisons, unlike others in this area, assume the particular model to truly generate the data. This is not realistic. For these reasons, conclusions based on the comparisons should not be given very much weight. Skipping doses is an issue that necessitates some thought. It should be viewed as an operating characteristic of the method. As mentioned earlier, such operating characteristics depend on the choice of model and given design features such as the number of dose levels and the value of θ. The model chosen by O’Quigley, et al. in the context of six doses is such that it is not possible to skip doses when escalating. For larger number of doses or other choices of models, skipping could occur. Certainly we would not skip from level 1 to level 10, as happened in Moller (13), unless we made this a feature of the design. Typically, we would not work with such models. However, we may wish to use a model that could jump a level, for instance, between 12 and 20 levels and relatively few subjects. Indeed, since there is always the conceptual possibility of having intermediary levels, we are of course inevitably ‘‘jumping’’ dose levels. This is therefore not a real issue, and the relevant issue is to understand at the trial outset just how we proceed through the available levels, given different models. This can be relatively rapid or relatively slow, depending on initial design and the chosen model. If the experimenter wishes to modify these properties, then this

Continual Reassessment Method

67

can be done by changing the model. This is in our view to be preferred over ‘‘modified,’’ ‘‘restricted,’’ or other ‘‘improved’’ CRM designs where rules of thumb, having their inspiration largely from the old, and very poorly behaved, up and down schemes are grafted onto a methodology with a more solid foundation and whose operating characteristics can be anticipated. This is not to say that improvements cannot be made. But they are unlikely to come from ad hoc improvisations that have, unfortunately, dominated the area of phase I trial design for many years. Moller (13) also refers to extended CRM. This is a particular case of the two-stage designs described in Section III.C. As we pointed out there, we believe the two-stage design to be the most flexible and generally applicable design.

VII. BAYESIAN APPROACHES The CRM is often referred to as the Bayesian alternative to the classic up and down designs used in phase I studies. Hopefully Section III makes it clear that there is nothing especially Bayesian about the CRM (see also 30). In O’Quigley et al. (15), for the sake of simplicity, Bayesian estimators and vague priors were proposed. However, there is nothing to prevent us from working with other estimators, in particular the maximum likelihood estimator. More fully Bayesian approaches, and not simply a Bayesian estimator, have been suggested for use in the context of phase I trial designs. By more fully we mean more in the Bayesian spirit of inference, in which we quantify prior information, observed from outside the trial and that sollicited from clinicians and/or pharmacologists. Decisions are made more formally using tools from decision theory. Any prior information can subsequently be incorporated via the Bayes formula into a posterior density that also involves the actual current observations. Given the typically small sample sizes often used, a fully Bayesian approach has some appeal in that we would not wish to waste any relevant information at hand. Unlike the setup described by O’Quigley et al. (15) we could also work with informative priors. Gatsonis and Greenhouse (8) consider two-parameter probit and logit models for dose response and study the effect of different prior distributions. Whitehead and Williamson (31) carried out similar studies but with attention focusing on logistic models and beta priors. Whitehead and Williamson (31) worked with some of the more classic notions from optimal design for choosing the dose levels in a bid to establish whether much is lost by using suboptimal designs. O’Quigley et al. (15) ruled out criteria based on optimal design due to the ethical criterion of the need to attempt to assign the sequentially included

68

O’Quigley

patients at the most appropriate level for the patient. This same point was also emphasized by Whitehead and Williamson (31). Whitehead and Williamson (31) suggest that CRM could be viewed as a special case of their designs with their second parameter being assigned a degenerate prior and thereby behaving as a constant. Although in some senses this view is technically correct, it can be misleading in that for the single sample case, two-parameter CRM and one-parameter CRM are fundamentally different. Two-parameter CRM was seen to behave poorly (15) and is generally inconsistent (25). We have to view the single parameter as necessary in the homogeneous case and not simply a parametric restriction to facilitate numerical integration (see also Sect. VI.A). This was true even when the data were truly generated by the same two-parameter model, an unintuitive finding at first glance but one that makes sense in the light of the comments in Section VI.A. We do not believe it will be possible to demonstrate large sample consistency for either the Gatsonis and Greenhouse (8) approach or that of Whitehead and Williamson (31) as was done for CRM by Shen and O’Quigley (25). It may well be argued that large sample consistency is not very relevant in such typically small studies. However, it does provide some theoretical comfort and hints that for finite samples things might work out okay too. At the very least, if we fail to achieve large sample consistency, then we might carry out large numbers of finite samples studies, simulating most often under realities well removed from our working model. This was done by O’Quigley, et al. (15) for the usual underparametrized CRM. Such comparisons remain to be done for the Bayesian methods discussed here. Nonetheless, we can conclude that a judicious choice of model and prior, not running into serious conflict with the subsequent observations, may help inference in some special cases. A quite different Bayesian approach has been proposed by Babb et al. (5). The context is fully Bayesian. Rather than concentrate experimentation at some target level as does CRM, the aim here is to escalate as fast as possible toward the MTD while sequentially safeguarding against overdosing. This is interesting in that it could be argued that the aim of the approach translates in some ways more directly the clinician’s objective than does CRM. Model misspecification was not investigated and would be an interesting area for further research. The approach appears promising, and the methodology may be a useful modification of CRM when primary concern is on avoiding overdosing and we are in a position to have a prior on a two-parameter function. As above, there may be concerns about large sample consistency when working with a design that tends to settle on some level. A promising area for Bayesian formulations is one where we may have little overall knowledge of any dose–toxicity relationship but we may have some, possibly considerable, knowledge of some secondary aspect of the problem. Con-

Continual Reassessment Method

69

sider the two-group case. For the actual dose levels we are looking for we may know almost nothing. Uninformative Bayes or maximum likelihood would then seem appropriate. But we may well know that one of the groups, for example, a group weakened by extensive prior therapy, is most likely to have a level strictly less than that for the other group. Careful parametrization would enable this information to be included as a constraint. However, rather than work with a rigid and unmodifiable constraint, a Bayesian approach would allow us to specify the anticipated direction with high probability while enabling the accumulating data to override this assumed direction if the two run into serious conflict. Exactly the same idea could be used in a case where we believe there may be group heterogeneity but that it be very unlikely the correct MTDs differ by more than a single level. Incorporating such information will improve efficiency. As already mentioned in Section V, under some strong assumptions, we can achieve clear efficiency gains, when incorporating information on the graded toxicities. Such gains can be wiped out and even become negative, through bias, when the assumption are violated. Once again, a Bayesian setup opens up the possibility of compromise so that constraints become modifiable in the light of accumulating data.

VIII. RESEARCH DIRECTIONS There are many outstanding questions, both practical and theoretical, concerning CRM. We do not believe that ad hoc modifications such as MCRM will be fruitful, and we do not therefore see this as a useful research area. Nonetheless, operating characteristics depend to some extent on arbitrary specification parameters such as the chosen model, and it would clearly be helpful to have a better understanding of this. A deep understanding of how certain types of changes to the model correspond to certain types of changes in operating characteristics could ultimately lead to selecting models on the basis of some particularly desired behavior, for instance, relatively rapid escalation to the higher levels or models that require accumulating a lot of precision on the safety of some particular level before further escalation. Very often we do not know the number of levels that may be used. In such cases the model cannot be determined in advance. In our own applied work, we have improvised when faced with such situations, via two-stage designs, if the model being constructed at the beginning of the second stage. A more systematic approach may be useful. Regression models have been suggested to address issues of patient heterogeneity (26). This appears natural, but given the typically small sample sizes and the implementation algorithms via underparametrized models, great care is

70

O’Quigley

needed. Deeper studies showing how and in which situations advantage can be drawn from a regression model are needed. PK/PD information, some or all of which may not be available before deciding on the appropriate dose level for the patient, raise particular methodological questions. Within-patient dose escalation is frequently undertaken in practice but not analyzed as such. Indeed, regulatory agencies sometimes disallow the inclusion of information on doses other than that at which the patient was first treated. Nonetheless, the patient, even if off study, will continue to be treated to the clinician’s best ability and often at doses higher than that initially given. Modeling would necessarily be difficult in view of the complex potential relationships governing the outcomes at different levels for the same patient, whether toxicities are cumulative or whether cumulative treatment provides some kind of protection. Finally, Bayesian approaches, in conjunction with careful modeling, open up many possibilities. Some of these are mentioned in the above paragraph. There are numerous outstanding theoretical issues to be resolved: Conditions for convergence, inference under misspecified models, optimality, incorporation of random allocation, and efficiency of different stopping rules are just some of these. The most interesting problems, as always, are those arising from practical applications, and as the method is increasingly used these will require more attention from statisticians.

REFERENCES 1. Ahn C. An evaluation of phase I cancer clinical trial designs. Stat Med 1998; 17: 1537–1549. 2. Anbar D. The application of stochastic methods to the bioassay problem. J Stat Planning Inference 1977; 1:191–206. 3. Anbar D. A stochastic Newton-Raphson method. J Stat Planning Inference 1978; 2:153–163. 4. Anbar D. Stochastic approximation methods and their use in bioassay and Phase I clinical trials. Commun Statist 1984; 13:2451–2467. 5. Babb J, Rogatko A, Zacks S. Cancer Phase I clinical trials: efficient dose escalation with overdose control. Stat Med 1998; 17:1103–1120. 6. Chevret S. The continual reassessment method in cancer phase I clinical trials: a simulation study. Stat Med 1993; 12:1093–1108. 7. Faries D. Practical modifications of the continual reassessment method for phase I cancer clinical trials. J Biopharm Stat 1994; 4:147–164. 8. Gatsonis C, Greenhouse JB. Bayesian methods for phase I clinical trials. Stat Med 1992; 11:1377–1389. 9. Goodman S, Zahurak ML, Piantadosi S. Some practical improvements in the continual reassessment method for phase I studies. Statist Med 1995; 14:1149–1161.

Continual Reassessment Method

71

10. Huber PJ. The behavior of maximum likelihood estimates under nonstandard conditions. Proc 6th Berkeley Symp 1967; 1:221–233. 11. Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon R. A comparison of two Phase I trial designs. Stat Med 1994; 13:1799–1806. 12. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in Phase I clinical trials: evidence for increased precision. JNCI 1993; 85:217– 223. 13. Moller S. An extension of the continual reassessment method using a preliminary up and down design in a dose finding study in cancer patients in order to investigate a greater number of dose levels. Stat Med 1995; 14:911–922. 14. O’Quigley J. Estimating the probability of toxicity at the recommended dose following a Phase I clinical trial in cancer. Biometrics 1992; 48:853–862. 15. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for Phase I clinical trials in cancer. Biometrics 1990; 46:33–48. 16. O’Quigley J, Chevret S. Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stat Med 1991; 10:1647–1664. 17. O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Biometrics 1996; 52:163–174. 18. O’Quigley J, Reiner E. A stopping rule for the continual reassessment method. Biometrika 1998; 85:741–748. 19. O’Quigley J. Another look at two Phase I clinical trial designs (with commentary). Stat Med 1999; 18:2683–2692. 20. O’Quigley J, Shen L, Gamst A. Two sample continual reassessment method. J Biopharm Stat 1999; 9:17–44. 21. Paoletti X, O’Quigley J. Using Graded Toxicities in Phase I Trial Design. Technical Report 4, Biostatistics Group, University of California at San Diego. 1998. 22. Piantadosi S, Liu G. Improved designs for dose escalation studies using pharmacokinetic measurements. Stat Med 1996; 15:1605–1618. 23. Reiner E, Paoletti X, O’Quigley J. Operating characteristics of the standard phase I clinical trial design. Comp Stat Data Anal 1998; 30:303–315. 24. Robbins H, Monro S. A stochastic approximation method. Ann Math Stat 1951; 29: 351–356. 25. Shen LZ, O’Quigley J. Consistency of continual reassessment method in dose finding studies. Biometrika 1996; 83:395–406. 26. Simon R, Freidlin B, Rubinstein L, Arbuck S, Collins J, Christian M. Accelerated titration designs for Phase I clinical trials in oncology. JNCI 1997; 89:1138– 1147. 27. Silvapulle MJ. On the existence of maximum likelihood estimators for the binomial response models. JR Stat Soc B 1981; 43:310–313. 28. Storer BE. Design and analysis of Phase I clinical trials. Biometrics 1989; 45:925– 937. 29. Storer BE. Small-sample confidence sets for the MTD in a phase I clinical trial. Biometrics 1993; 49:1117–1125. 30. Storer BE. Phase I clinical trials. Encylopedia of Biostatistics. New York: Wiley, 1998.

72

O’Quigley

31. Whitehead J, Williamson D. Bayesian decision procedures based on logistic regression models for dose-finding studies. J Biopharm Stat 1998; 8:445–467. 32. Wu CFJ. Efficient sequential designs with binary data. J Am Stat Assoc 1985; 80: 974–984. 33. Wu CFJ. (1986). Maximum likelihood recursion and stochastic approximation in sequential designs. In: van Ryzin J, ed. Adaptive Statistical Procedures and Related Topics. Institute of Mathematical Statistics Monograph 8, Hayward, CA: Institute of Mathematical Statistics, pp 298–313. 34. Lai TL, Robbins H. Adaptive design and stochastic approximation. Annals of Statistics 1979; 7:1196–1221.

3 Choosing a Phase I Design Barry E. Storer Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

INTRODUCTION AND BACKGROUND

Although the term phase I is sometimes applied generically to almost any ‘‘early’’ trial, in cancer drug development it usually refers specifically to a dose-finding trial whose major end point is toxicity. The goal is to find the highest dose of a potential therapeutic agent that has acceptable toxicity; this dose is referred to as the maximum tolerable dose (MTD) and is presumably the dose that will be used in subsequent phase II trials evaluating efficacy. Occasionally, one may encounter trials that are intermediate between phase I and phase II, referred to as phase IB trials. This is a more heterogeneous group but typically includes trials evaluating some measure of biological efficacy over a range of doses that have been found to have acceptable toxicity in a phase I (or phase IA) trial. This chapter focuses exclusively on phase I trials with a toxicity end point. What constitutes acceptable toxicity of course depends on the potential therapeutic benefit of the drug. There is an implicit assumption with most anticancer agents of a positive correlation between toxicity and efficacy, but most drugs that will be evaluated in phase I trials will prove ineffective at any dose. The problem of defining an acceptably toxic dose is complicated by the fact that patient response is heterogenous: At a given dose, some patients may experience little or no toxicity, whereas others may have severe or even fatal toxicity. Since the response of the patient will be unknown before the drug is given, acceptable toxicity is typically defined with respect to the patient population as a whole. For example, given a 73

74

Storer

toxicity grading scheme ranging from 0 to 5 (none, mild, moderate, severe, life threatening, fatal), one might seek the dose where, on average, one out of three patients would be expected to experience a grade 3 or worse toxicity. The latter is referred to as ‘‘dose-limiting toxicity’’ (DLT) and does not need to correspond to a definition of unacceptable toxicity in an individual patient. When defined in terms of the presence or absence of DLT, the MTD can be defined as some quantile of a dose–response curve. Notationally, if Y is a random variable whose possible values are 1 and 0, respectively, depending on whether a patient does or does not experience DLT, and for dose d we have ψ(d ) ⫽ Pr(Y ⫽ 1|d), then the MTD is defined by ψ(dMTD ) ⫽ θ, where θ is the desired probability of toxicity.1 There are two significant constraints on the design of a phase I trial. The first is the ethical requirement to approach the MTD from below, so that one must start at a dose level believed almost certainly to be below the MTD and gradually escalate upward. The second is the fact that the number of patients typically available for a phase I trial is relatively small, say 15–30, and is not driven traditionally by rigorous statistical considerations requiring a specified degree of precision in the estimate of MTD. The pressure to use only small numbers of patients is large—literally dozens of drugs per year may come forward for evaluation, and each combination with other drugs, each schedule, and each route of administration requires a separate trial. Furthermore, the number of patients for whom it is considered ethically justified to participate in a trial with little evidence of efficacy is limited. The latter limitation also has implications for the relevance of the MTD in subsequent phase II trials of efficacy. Since the patient populations are different, it is not clear that the MTD estimated in one population will yield the same result when implemented in another.

II. DESIGNS FOR PHASE I TRIALS As a consequence of the above considerations, the traditional phase I trial design uses a set of fixed dose levels that have been specified in advance, that is, d ∈ {d 1, d 2, . . . , d K}. The choice of the initial dose level d 1, and the dose spacing, are discussed in more detail below. Beginning at the first dose level, small numbers of patients are entered, typically three to six, and the decision to escalate or not depends on a prespecified algorithm related to the occurrence of DLT. When a dose level is reached with unacceptable toxicity, then the trial is stopped. A.

Initial Dose Level and Dose Spacing

The initial dose level is generally derived either from animal experiments, if the agent in question is completely novel, or by conservative consideration of previ-

Choosing a Phase I Design

75

ous human experience, if the agent in question has been used before but with a different schedule, route of administration, or with other concomitant drugs. A common starting point based on the former is from 1/10 to 1/3 of the mouse LD 10, the dose that kills 10% of mice, adjusted for the size of the animal on a per kilogram basis or by some other method. Subsequent dose levels are determined by increasing the preceding dose level by decreasing multiples, a typical sequence being {d 1, d 2 ⫽ 2d 1, d 3 ⫽ 1.67d 2, d 4 ⫽ 1.5d 3,d 5 ⫽ 1.4d 4, and thereafter d k⫹1 ⫽ 1.33d k, . . .} Such sequences are often referred to as ‘‘modified Fibonacci.’’2 Note that after the first few increments, the dose levels are equally spaced on a log scale. With some agents, particularly biological agents, the dose levels may be determined by log, that is, {d 1, d 2 ⫽ 10d 1, d 3 ⫽ 10d 2,d 4 ⫽ 10d 3, . . .}, or approximate half-log, that is, {d 1, d 2 ⫽ 3d 1, d 3 ⫽ 10d 1, d 4 ⫽ 10d 2, d 5 ⫽ 10d 3, . . .}, spacing. B. Traditional Escalation Algorithms A wide variety of dose escalation rules may be used. For purposes of illustration, we describe the following, which is often referred to as the traditional ‘‘3 ⫹ 3’’ design. Beginning at k ⫽ 1, (A) Evaluate three patients at d k: (A1) If zero of three have DLT, then increase dose to d k⫹1 and go to (A). (A2) If one of three have DLT, then go to (B). (A3) If at least two of three have DLT, then go to (C). (B) Evaluate an additional three patients at d k: (B1) If one of six have DLT, then increase dose to d k⫹1 and go to (A). (B2) If at least two of six have DLT, then go to (C). (C) Discontinue dose escalation. If the trial is stopped, then the dose level below that at which excessive DLT was observed is the MTD. Some protocols may specify that if only three patients were evaluated at that dose level, then an additional three should be entered, for a total of six, and that process should procede downward, if necessary, so that the MTD becomes the highest dose level where no more than one toxicity is observed in six patients. The actual θ that is desired is generally not defined when such algorithms are used, but implicitly 0.17 ⱕ θ ⱕ 0.33, so we could take θ ⬇ 0.25. Another example of a dose escalation algorithm, referred to as the ‘‘bestof-5’’ design, is described here. Beginning at k ⫽ 1, (A) Evaluate three patients at d k: (A1) If zero of three have DLT, then increase dose to d k⫹1 and go to (A). (A2) If one or two of three have DLT, then go to (B). (A3) If three of three have DLT, then go to (D).

76

Storer

(B) Evaluate an additional one patient at d k: (B1) If one of four have DLT, then increase dose to d k⫹1 and go to (A). (B2) If two of four have DLT, then go to (C). (B3) If three of four have DLT, then go to (D). (C) Evaluate an additional one patient at d k: (C1) If two of five have DLT, then increase dose to d k⫹1 and go to (A). (C3) If three of five have DLT, then go to (D). (D) Discontinue dose escalation. Again, the value of θ is not explicitly defined, but we could take θ ⬇ 0.40. Although traditional designs reflect an empirical common sense approach to the problem of estimating the MTD under the noted constraints, only brief reflection is needed to see that the determination of MTD will have a rather tenuous statistical basis. Consider the outcome of a trial using the 3 ⫹ 3 design where the frequency of DLT for dose levels d 1, d 2, and d 3 is 0 of 3, 1 of 6, and 2 of 6, respectively. Ignoring the sequential nature of the escalation procedure, the pointwise 80% confidence intervals for the rate of DLT at the three dose levels are, respectively, 0–0.54, 0.02–0.51, 0.09–0.67. Although the middle dose would be taken as the estimated MTD, there is not even reasonably precise evidence that the toxicity rate for any of the three doses is either above or below the implied θ of approximately 0.25. Crude comparisons among traditional dose escalation algorithms can be made by examining the level-wise operating characteristics of the design, that is, the probability of escalating to the next dose level given an assumption regarding the underlying probability of DLT at the current dose level. Usually, this calculation is a function of simple binomial success probabilities. For example, in the 3 ⫹ 3 algorithm described above, the probability of escalation is B(0, 3; ψ(d )) ⫹ B(1, 3; ψ(d )) B(0, 3; ψ(d )), where B(τ, n; ψ(d )) is the probability of r successes (toxicities) out of n trials (patients) with underlying success probability at the current dose level ψ(d ). The probability of escalation can then be plotted over a range of ψ(d ), as is done in Fig. 1 for the two algorithms described above. Although it is obvious from such a display that one algorithm is considerably more aggressive than another, the level-wise operating characteristics do not provide much useful insight into whether or not a particular design will tend to select an MTD that is close to the target. More useful approaches to choosing among traditional designs and the other designs described below are discussed in Section III. C.

A Bayesian Approach: The Continual Reassessment Method

The small sample size and low information content in the data derived from traditional methods have suggested to some the usefulness of Bayesian methods

Choosing a Phase I Design

77

Figure 1 Level-wise operating characteristics of two traditional dose escalation algorithms. The probability of escalating to the next higher dose level is plotted as a function of the true probability of DLT at the current dose.

to estimate the MTD. In principle, this approach allows one to combine any prior information available regarding the value of the MTD with subsequent data collected in the phase I trial to obtain an updated estimate reflecting both. The most clearly developed Bayesian approach to phase I design is the continual reassessment method (CRM) proposed by O’Quigley and colleagues (1,2). From among a small set of possible dose levels, say {d 1, . . ., d 6}, experimentation begins at the dose level that the investigators believe, based on all available information, is the most likely to have an associated probability of DLT equal to the desired θ. It is assumed that there is a simple family of monotone dose–response functions ψ such that for any dose d and probability of toxicity p there exists a unique a where ψ(d, a) ⫽ p and in particular that ψ(d MTD, a 0 ) ⫽ θ. An example of such a function is ψ(d, a) ⫽ [(tanh d ⫹ 1)/2]a. Note that ψ is not assumed to be necessarily a dose–response function relating a characteristic of the dose levels to the probability of toxicity. That is, d does not need to correspond literally to the dose of a drug. In fact, the treatments at each of the dose levels may be completely unrelated, as long as the probability of toxicity increases from each dose level to the next; in this case d could be just the index

78

Storer

of the dose levels. The uniqueness constraint implies in general the use of oneparameter models and explicitly eliminates popular two-parameter dose–response models like the logistic. In practice, the latter have a tendency to become ‘‘stuck’’ and oscillate between dose levels when any data configuration leads to a large estimate for the slope parameter. A prior distribution g(a) is assumed for the parameter a such that for the ∞ initial dose level, for example d 3, either ∫ 0 ψ(d 3, a)g(a)da ⫽ θ or, alternatively, ∞ ψ(d 3, µ a ) ⫽ θ, where µ a ⫽ ∫ 0 ag(a)da. The particular prior used should also reflect the degree of uncertainty present regarding the probability of toxicity at the starting dose level; in general, this will be quite vague. After each patient is treated and the presence or absence of toxicity observed, the current distribution g(a) is updated along with the estimated probabilities of toxicity at each dose level, calculated by either method above (1). The next patient is then treated at the dose level minimizing some measure of the distance between the current estimate of the probability of toxicity and θ. After a fixed number n of patients has been entered sequentially in this fashion, the dose level selected as the MTD is the one that would be chosen for a hypothetical n ⫹ 1st patient. An advantage of the CRM design is that it makes full use of all the data at hand to choose the next dose level. Even if the dose–response model used in updating is misspecified, CRM will tend eventually to select the dose level that has a probability of toxicity closest to θ (3), although its practical performance should be evaluated in the small sample setting typical of phase I trials. A further advantage is that unlike traditional algorithms, the design is easily adapted to different values of θ. In spite of these advantages, some practitioners object philosophically to the Bayesian approach, and it is clear in the phase I setting that the choice of prior can have a measurable effect on the estimate of MTD (4). However, the basic framework of CRM can easily be adapted to a non-Bayesian setting and can conform in practice more closely to traditional methods (5). For example, there is nothing in the approach that prohibits one from starting at the same low initial dose as would be common in traditional trials or from updating after groups of three patients rather than single patients. Allowing for some ad hoc deterministic rules to start the trial off, the Bayesian prior can be abandoned entirely and the updating after each patient can be fully likelihood based.3 D.

Storer’s Two-Stage Design

Storer (6,7) explored a combination of more traditional methods implemented in such a way as to minimize the numbers of patients treated at low dose levels and to focus sampling around the MTD; these methods also use an explicit dose– response framework to estimate the MTD.

Choosing a Phase I Design

79

The design has two stages and uses a combination of simple dose-escalation algorithms. The first stage assigns single patients at each dose level and escalates upward until a patient has DLT or downward until a patient does not have DLT. Algorithmically, beginning at k ⫽ 1, (A) Evaluate one patient at d k: (A1) If no patient has had DLT, then increase dose to d k⫹1 and go to (A). (A2) If all patients have had DLT, then decrease dose to d k ⫺ 1 and go to (A). (A3) If at least one patient has and has not had DLT, then if the current patient has not had DLT, go to (B); otherwise, decrease the dose to d k ⫺ 1 and go to (B). Note that the first stage meets the requirement for heterogeneity in response needed to start off a likelihood-based CRM design and could be used for that purpose. The second stage incorporates a fixed number of cohorts of patients. If θ ⫽ 1/3, then it is natural to use cohorts of size three, as follows: (B) Evaluate three patients at d k: (B1) If zero of three have DLT, then increase dose to d k⫹1 and go to (B). (B2) If one of three have DLT, then go to (B). (B3) If at least two of three have DLT, then decrease dose to d k⫺ 1 and go to (B). After completion of the second stage, a dose–response model is fit to the data and the MTD estimated by maximum likelihood or other method. For example, one could use a logistic model where logit (ψ(d)) ⫽ α ⫹ β log(d ), whence the estimated MTD is log(d MTD ) ⫽ (logit(θ) ⫺ αˆ )/βˆ . A two-parameter model is used here to make fullest use of the final sample of data; however, as noted above, two-parameter models have undesirable properties for purposes of dose escalation. To obtain a meaningful estimate of the MTD, one must have 0 ⬍ βˆ ⬍ ∞. If this is not the case, then one needs either to add additional cohorts of patients or substitute a more empirical estimate, such as the last dose level or hypothetical next dose level. As noted, the algorithm described above is designed with a target θ ⫽ 1/3 in mind. Although other quantiles could be estimated from the same estimated dose–response curve, a target θ different from 1/3 would probably lead one to use a modified second-stage algorithm. Extensive simulation experiments using this trial design compared with more traditional designs demonstrated the possibility of reducing the variability of point estimates of the MTD and reducing the proportion of patients treated at very low dose levels, without markedly increasing the proportion of patients treated at dose levels where the probability of DLT is excessive. Storer (7) also evaluated different methods of providing confidence intervals for the MTD. Standard likelihood-based methods that ignore the sequential sampling scheme (a delta method, a method based on Fieller’s theorem, and a likelihood ratio method) are often markedly anticonservative. More accurate confidence sets can be con-

80

Storer

structed by simulating the distribution of any of those test statistics at trial values of the MTD; however, the resulting confidence intervals are often extremely wide. Furthermore, the methodology is purely frequentist and may be unable to account for minor variations in the implementation of the design when a trial is conducted. With some practical modifications, the two-stage design described above has been implemented in a real phase I trial (8). The major modifications included a provision to add additional cohorts of three patients, if necessary, until the estimate of β in the fitted logistic model becomes positive and finite; a provision that if the estimated MTD is higher than the highest dose level at which patients have actually been treated, the latter will be used as the MTD; and a provision to add additional intermediate dose levels if, in the judgement of the protocol chair, the nature or frequency of toxicity at a dose level precludes further patient accrual at that dose level. E.

Continuous Outcomes

Although not common in practice, it is useful to consider the case where the major outcome defining toxicity is a continuous measurement, for example, the nadir white blood count (WBC). This may or may not involve a fundamentally different definition of the MTD in terms of the occurrence of DLT. For example, suppose that DLT is determined by the outcome Y ⬍ c, where c is a constant, and we have Y ⬃ N(α ⫹ βd, σ2 ). Then d MTD ⫽ (c ⫺ α ⫺ Φ⫺1(θ)σ)/β has the traditional definition that the probability of DLT is θ. The use of such a model in studies with small sample size makes some distributional assumption imperative. Some sequential design strategies in this context have been described by Eichhorn et al. (9). Alternatively, the MTD might be defined in terms of the mean response, that is, the dose where E(Y) ⫽ c. For the same simple linear model above, we then have that d MTD ⫽ (c ⫺ α)/β. Fewer distributional assumptions are needed to estimate d MTD, and stochastic approximation techniques might be applied in the design of trials with such an end point (10). Nevertheless, the use of a mean response to define MTD is not generalizable across drugs with different or multiple toxicities and consequently has received little attention in practice. A recent proposal for a design incorporating a continuous outcome is that of Mick and Ratain (11). This is also a two-stage design, which for a hypothetical study of etoposide assumes a simple regression model relating dose to the WBC nadir. The model is log (WBC) ⫽ α ⫹ β 1 log(WBC pre ) ⫹ β 2d, where WBC pre is the pretreatment WBC. The first phase uses cohorts of two patients. Ad hoc rules for dose escalation are determined by the toxicity experience in the current cohort; however, the model is fit each time and cohorts of two are added until at least eight patients have been treated and βˆ 2 is significantly different from 0 at the 0.05 level of significance. In the second stage of the study, the dose for

Choosing a Phase I Design

81

the next cohort of two patients is determined by fitting the regression model to the accumulated data and estimating the dose that leads to a mean nadir WBC of 2.5, that is, d k⫹1 ⫽ (log(2.5) ⫺ αˆ ⫺ βˆ 1 log(WBC pre ))/βˆ 2. This continues until at least eight patients have been treated and βˆ 2 is significantly different from 0 at the 0.001 level of significance. Simulation studies of this design using a pharmakinetic model and historic database demonstrated a clear increase in precision in the MTD estimated from the model-based dose-escalation method, as compared with the MTD estimated from a more traditional design. The average sample size was also measurably smaller. Though such results are promising, the method applies only to situations where the DLT is a single continuous outcome. Furthermore, the simulation studies that are needed to establish the usefulness of the method in specific situations often require the use of human pharmacokinetic data that might not be available at the time the study was being planned.

III. CHOOSING A PHASE I DESIGN As noted above, only limited information regarding the suitability of a phase I design can be gained from the levelwise operating characteristics shown in Fig. 1. Furthermore, for designs like CRM, which depend on data from prior dose levels to determine the next dose level, it is not even possible to specify a levelwise operating characteristic. Useful evaluations of phase I designs must involve the entire dose–response curve, which of course is unknown. Many simple designs for which the level-wise operating characteristics can be specified can be formulated as discrete Markov chains (6). The states in the chain refer to treatment of a patient or group of patients at a dose level, with an absorbing state corresponding to the stopping of the trial. For various assumptions about the true dose–response curve, one can then calculate exactly many quantities of interest, such as the number of patients treated at each dose level, from the appropriate quantities determined from successive powers of the transition probability matrix P. Such calculations are fairly tedious, however, and do not accommodate designs with nonstationary transition probabilities, such as CRM. Nor do they allow one to evaluate any quantity derived from all of the data, such as the MTD estimated after following Storer’s two-stage design. For these reasons, simulations studies are the only practical tool for evaluating phase I designs. As with exact computations, one needs to specify a range of possible dose–response scenarios and then simulate the outcome of a large number of trials under each scenario. Here we give an example of such a study to illustrate the kinds of information that can be used in the evaluation and some of the considerations involved in the design of the study.

82

A.

Storer

Specifying the Dose–Response Curve

We follow the modified Fibonacci spacing described in Section II. For example, in arbitrary units, we have d 1 ⫽ 100.0, d 2 ⫽ 200.0, d 3 ⫽ 333.3, d 4 ⫽ 500.0, d 5 ⫽ 700.0, d 6 ⫽ 933.3, d 7 ⫽ 1244.4, . . . . We also define hypothetical dose levels below d 1 that successively halve the dose above, that is, d 0 ⫽ 50.0, d ⫺1 ⫽ 25.0, . . . . The starting dose is always d 1, and we assume that the true MTD is four dose levels higher, at d 5, with θ ⫽ 1/3. To define a range of dose–response scenarios, we vary the probability of toxicity at d 1 from 0.01 (0.01) 0.20, and graph our results as a function of that probability. The true dose–response curve is determined by assuming that a logistic model holds on the log scale.4 Varying the probability of DLT at d 1 while holding the probability at d 5 fixed at θ results in a sequence of dose–response curves ranging from relatively steep to relatively flat. An even greater range could be encompassed by also varying the number of dose levels between the starting dose and the true MTD, which of course need not be exactly at one of the predetermined dose levels. The point is to study the sensitivity of the designs to features of the underlying dose–response curve, which obviously is unknown. B.

Specifying the Designs

This simulation will evaluate the two traditional designs described above, Storer’s two-stage design, and a non-Bayesian CRM design. It is important to make the simulation as realistic as possible in terms of how an actual clinical protocol would be implemented or at least to recognize what differences might exist. For example, the simulation does not place a practical limit on the highest dose level, although it is rare for any design to escalate beyond d 10. An actual protocol might have an upper limit on the number of dose levels, with a provision for how to define the MTD if that limit is reached. Similarly, the simulation always evaluates a full cohort of patients, whereas in practice, where patients are more likely entered sequentially than simultaneously, a 3 ⫹ 3 design might, for example, forego the last patient in a cohort of three if the first two patients had experienced DLT. 1. Traditional 3 ⫹ 3 Design This design is implemented as described in Section II. In the event that excessive toxicity occurs at d 1, the MTD is taken to be d 0. Although this is an unlikely occurrence in practice, a clinical protocol should specify any provision to decrease dose if the stopping criteria are met at the first dose level. 2. Traditional Best-of-5 Design Again implemented as described in Section II, with the same rules applied to stopping at d 1.

Choosing a Phase I Design

83

3. Storer’s Two-Stage Design Implemented as described in Section II, with a second-stage sample size of 24 patients. A standard logistic model is fit to the data. If it is not the case that 0 ⬍ βˆ ⬍ ∞, then the geometric mean of the last dose level used and the dose level that would have been assigned to the next cohort is used as the MTD. In either case, if that dose is higher than the highest dose at which patients have actually been treated, then the latter is taken as the MTD. 4. Non-Bayesian CRM Design We start the design using the first stage of the two-stage design as described above. Once heterogeneity has been achieved, 24 patients are entered in cohorts of three. The first cohort is entered at the same dose level as for the second stage of the two-stage design; after that successive cohorts are entered using likelihood based updating of the dose–response curve. For this purpose we use a single parameter logistic model—a two-parameter model with β fixed at 0.75.5 After each updating, the next cohort is treated at the dose level with estimated probability of DLT closest in absolute value to θ; however, the next level cannot be more than one dose level higher than the current highest level at which any patients have been treated. The level that would be chosen for a hypothetical additional cohort is the MTD; however, if this dose is above the highest dose at which patients have been treated, the latter is taken as the MTD.

C. Simulation and Results The simulation is performed by generating 5000 sequences of patients and applying each of the designs to each sequence for each dose–response curve being evaluated. The sequence of patients is really a sequence of psuedo-random numbers generated to be Uniform (0, 1). Each patient’s number is compared with the hypothetical true probability of DLT at the dose level the patient is entered at for the dose–response curve being evaluated. If the number is less than that probability, then the patient is taken to have experienced DLT. Figure 2 displays results of the simulation study above that relate to the estimate dˆ MTD. Since the dose scale is arbitrary, the results are presented in terms of ψ(dˆ MTD ). Figure 2(a) displays the mean probability of DLT at the estimated MTD. The horizontal line at 1/3 is a point of reference for the target θ. Although none of the designs is unbiased, all except the conservative 3 ⫹ 3 design perform fairly well across the range of dose–response curves. The precision of the estimates, taken as the root MSE of the probabilities ψ(dˆ MTD ), is shown in Fig. 2 (b). In this regard the CRM and two-stage designs perform better than the bestof-5 design over most settings of the dose–response curve. One should also note

84

Storer

Figure 2 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The true MTD is fixed at four dose levels above the starting dose, with θ ⫽ 1/3. Results are expressed in terms of p(MTD) ⫽ ψ(dˆ MTD ).

Choosing a Phase I Design

85

that, in absolute terms, the precision of the estimates is not high even for the best designs. In addition to the average properties of the estimates, it is also relevant to look at the extremes. Figure 2(c) and (d) present the fraction of trials where ψ(dˆ MTD ) ⬍ 0.20 or ψ(dˆ MTD ) ⬎ 0.50, respectively.6 The cutoff of 0.20 is the level at which the odds of DLT are half that of θ. Although this may not be an important consideration, to the extent that the target θ defines a dose with some efficacy in addition to toxicity, the fraction of trials below this arbitrary limit may represent cases in which the dose selected for subsequent evaluation in efficacy trials is ‘‘too low.’’ Because of their common first-stage design that uses single patients at the initial dose levels, the two-stage and CRM designs do best in this regard. Conversely, the cutoff used in Fig. 2 (d) is the level at which the odds of toxicity are twice that of θ. Although the occurrence of DLT in and of itself is not necessarily undesirable, as the probability of DLT increases there is likely a corresponding increase in the probability of very severe or even fatal toxicity. Hence, the trials where the probability of DLT is above this arbitrary level may represent cases in which the dose selected as the MTD is ‘‘too high.’’ In this case there are not large differences among the designs, and in particular we find that the two designs that perform the best in Fig. 2(c) do not carry an unduly large penalty. One of course could easily evaluate other limits if desired. Some results related to the outcome of the trials themselves are presented in Fig. 3. Panels (a) and (b) present the overall fraction of patients that are treated below and above, respectively, the same limits as for the estimates in Fig. 2. The two-stage and CRM designs perform best at avoiding treating patients at the lower dose levels; the two-stage design is somewhat better than the CRM design at avoiding treating patients at higher dose levels, although of course it does not do as well as the very conservative 3 ⫹ 3 design. Sample size considerations are evaluated in Fig. 3, (c) and (d). Panel (c) shows the mean number of patients treated. Because they share a common first stage and use the same fixed number of patients in the second stage, the twostage and CRM designs yield identical results. The 3 ⫹ 3 design uses the smallest number of patients, but this is because it tends to stop well below the target. On average, the best-of-5 design uses six to eight fewer patients than the two-stage or CRM design. Figure 3(d) displays the mean number of ‘‘cycles’’ of treatment that are needed to complete the trial, where a cycle is the period of time over which a patient or group of patients needs to be treated and evaluated before a decision can be made as to the dose level for the next patient or group. For example, the second stage in the two-stage or CRM designs above always uses eight cycles; each dose level in the 3 ⫹ 3 design uses either one or two cycles, and so on. This is a consideration only for situations where the time needed to complete a phase I trial is not limited by the rate of patient accrual but by the

86

Storer

Figure 3 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The true MTD is fixed at four dose levels above the starting dose, with θ ⫽ 1/3.

Choosing a Phase I Design

87

time needed to treat and evaluate each group of patients. In this case the results are qualitatively similar to that of Fig. 3(c). D. Summary and Conclusion Based only on the results above, one would likely eliminate the 3 ⫹ 3 design from consideration. The best-of-5 design would probably also be eliminated as well due to the lower precision and greater likelihood that the MTD will be well below the target. On the other hand, the best-of-5 design uses fewer patients. If small patient numbers are a priority, it would be reasonable to consider an additional simulation in which the second-stage sample size for the two-stage and CRM designs is reduced to, say, 18 patients. This would put the average sample size for those designs closer to that of the best-of-5, and one could see whether they continued to maintain an advantage in the other aspects. Between the twostage and CRM designs, there is perhaps a slight advantage to the former in terms of greater precision and a smaller chance that the estimate will be too far above the target; however, the difference is likely not important in practical terms and might vary under other dose–response conditions.7 A desirable feature of the results shown is that both the relative and absolute properties of the designs do not differ much over the range of dose–response curves. Additional simulations could be carried out that would vary also the distance between the starting dose and the true MTD or place the true MTD between dose levels instead of exactly at a dose level. To illustrate further some of the features of phase I designs and the necessity of studying each situation on a case by case basis, we repeated the simulation study above using a target θ ⫽ 0.20. Exactly the same dose–response settings are used, so that the results for the two traditional designs are identical to those shown previously. The two-stage design is modified to use five cohorts of five patients but follows essentially the same rule for selecting the next level described above with ‘‘three’’ replaced by ‘‘five.’’ Additionally, the final fitted model estimates the MTD associated with the new target, and of course the CRM design selects the next dose level based on the new target. The results for this simulation are presented in Fig. 4 and 5. In this case the best-of-5 design is clearly eliminated as too aggressive. However, and perhaps surprisingly, the 3 ⫹ 3 design performs nearly as well, or better, than the supposedly more sophisticated two-stage and CRM designs. There is a slight disadvantage in terms of precision, but given that the mean sample size with the 3 ⫹ 3 design is nearly half that of the other two, this may be a reasonable trade-off. Of course, it could also be the case in this setting that using a smaller secondstage sample size would not adversely affect the two-stage and CRM designs. Finally, we reiterate the point that the purpose of this simulation was to demonstrate some of the properties of phase I designs and of the process of

88

Storer

Figure 4 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The dose–response curves are identical to those used for Fig. 2 but with θ ⫽ 0.20. Results are expressed in terms of p(MTD) ⫽ ψ(dˆ MTD ).

simulation itself, not to advocate any particular design. Depending on the particulars of the trial at hand, any one of the four designs might be a reasonable choice. An important point to bear in mind is that traditional designs must be matched to the desired target quantile and will perform poorly for other quantiles. CRM designs are particulary flexible in this regard; the two-stage design can only be modified to a lesser extent.

Choosing a Phase I Design

89

Figure 5 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The dose–response curves are identical to those used for Fig. 3 but with σ ⫽ 0.20.

ENDNOTES 1. Alternately, one could define Y to be the random variable representing the threshold dose at which a patient would experience DLT. The distribution of Y is referred to as a tolerance distribution and the dose–response curve is the cumulative distribution function for Y, so that the MTD would be defined by Pr(Y ⱕ d MTD ) ⫽ θ. For a given sample size, the most effective way of estimating this quantile would be from a sample of threshold doses. Such data are nearly impossible to gather, however, as it is imprac-

90

2. 3.

4.

5.

6.

7.

Storer tical to give each patient more than a small number of discrete doses. Further, the data obtained from sequential administration of different doses to the same patient would almost surely be biased, as one could never distinguish the cumulative effects of the different doses from the acute effects of the current dose level. Extended washout periods between doses are not a solution, since the condition of the patient and hence the response to the drug is likely to change rapidly for the typical patient in a phase I trial. For this reason, almost all phase I trials involve the administration of only a single dose level to each patient and the observation of the frequency of DLT in all patients treated at the same dose level. In a true Fibonacci sequence, the increments would be approximately 2, 1.5, 1.67, 1.60, 1.63, and then 1.62 thereafter, converging on the golden ratio. Without a prior, the dose–response model cannot be fit to the data until there is some heterogeneity in outcome, i.e., at least one patient with DLT and one patient without DLT. Thus, some simple rules are needed to guide the dose escalation until heterogeneity is achieved. Also, one may want to impose rules that restrict one from skipping dose levels during escalation, even if the fitted dose–response model would lead one to select a higher dose. The usual formulation of the logistic dose–response curve would be that logitψ(d) ⫽ α ⫹ βlog d. In the above setup, we specify d 1, ψ(d 1 ), and that ψ(d 5 ) ⫽ 1/3, whence β ⫽ (logit(1/3) ⫺ logit(ψ(d 1 ))/∆, where ∆ ⫽ log d 5 ⫺ log d 1, and α ⫽ logit(ψ(d 1 )) ⫺ β log d 1. This value does have to be tuned to the actual dose scale but is not particularly sensitive to the precise value. That is, similar results would be obtained with β in the range 0.5–1. For reference, on the natural log scale the distance log (d MTD ⫺ log(d 1 ) ⬇ 2, and the true value of β in the simulation ranges from 2.01 to 0.37 as ψ(d 1 ) ranges from 0.01 to 0.20. The see-saw pattern observed for all but the two-stage design is caused by changes in the underlying dose–response curve, as the probability of DLT at particular dose levels moves over or under the limit under consideration. Since the three designs select discrete dose levels as dˆ MTD, this will result in a corresponding decrease in the fraction of estimates beyond the limit. The advantage of the two-stage design may seem surprising, given that the next dose level is selected only on the basis of the outcome at the current dose level, and ignores the information that CRM uses from all prior patients. However, the two-stage design also incorporates a final estimation procedure for the MTD that uses all the data and uses a richer family of dose–response models. This issue is examined in Storer (12).

REFERENCES 1. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for Phase I clinical studies in cancer. Biometrics 1990; 46:33–48. 2. O’Quigley J, Chevret S. Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stat Med 1991; 10:1647–1664.

Choosing a Phase I Design

91

3. Shen LZ, O’Quigley J. Consistency of continual reassessment method under model misspecification. Biometrika 1996; 83:395–405. 4. Gatsonis C, Greenhouse JB. Bayesian methods for Phase I clinical trials. Stat Med 1992; 11:1377–1389. 5. O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Biometrics 1996; 52:673–684. 6. Storer B. Design and analysis of Phase I clinical trials. Biometrics 1989; 45:925– 937. 7. Storer B. Small-sample confidence sets for the MTD in a Phase I clinical trial. Biometrics 1993; 49:1117–1125. 8. Berlin J, Stewart JA, Storer B, Tutsch KD, Arzoomanian RZ, Alberti D, Feierabend C, Simon K, Wilding G. Phase I clinical and pharmacokinetic trial of penclomedine utilizing a novel, two-stage trial design. Clin Oncol 1998; 16:1142–1149. 9. Eichhorn BH, Zacks S. Sequential search of an optimal dosage. J Am Stat Assoc 1973; 68:594–598. 10. Anbar D. Stochastic approximation methods and their use in bioassay and Phase I clinical trials. Commun Stat Theory Methods 1984; 13:2451–2467. 11. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in Phase I clinical trials: evidence for increased precision. J Nat Cancer Inst 1993; 85: 217–223. 12. Storer B. An evaluation of Phase I designs for continuous dose response. Stat Med (In press.)

4 Overview of Phase II Clinical Trials Stephanie Green Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

DESIGN

Standard phase II studies are used to screen new regimens for activity and to decide which ones should be tested further. To screen regimens efficiently, the decisions generally are based on single-arm studies using short-term end points (usually tumor response in cancer studies) in limited numbers of patients. The problem is formulated as a test of the null hypothesis H 0: p ⫽ p 0 versus the alternative hypothesis H A: p ⫽ p A, where p is the probability of response, p 0 is the probability which, if true, would mean that the regimen was not worth studying further, and p A is the probability which, if true, would mean it would be important to identify the regimen as active and to continue studying it. Typically, p 0 is a value at or somewhat below the historical probability of response to standard treatment for the same stage of disease, and p A is typically somewhat above. For ethical reasons studies of new regimens usually are designed with two or more stages of accrual, allowing early stopping due to inactivity of the regimen. A variety of approaches to early stopping has been proposed. Although several of these include options for more than two stages, only the two-stage versions are discussed in this chapter. (In typical clinical settings it is difficult to manage more than two stages.) An early approach, due to Gehan (1), suggested stopping if 0/N responses were observed, where the probability of 0/N was less than 0.05 under a specific alternative. Otherwise accrual was to be continued until the sample size was large enough for estimation at a specified level of precision. 93

94

Green

In 1982, Fleming (2) proposed stopping when results are inconsistent either with H 0 or H AA: p ⫽ p′, where H 0 is tested at level α and p′ is the alternative for which the procedure has power 1 ⫺ α. The bounds for stopping after the first stage of a two-stage design are the nearest integer to N 1 p′ ⫺ Z 1⫺α{Np′(1 ⫺ p′)}1/2 (for concluding early that the regimen should not be tested further) and the nearest integer to N 1 p 0 ⫹ Z 1⫺α{Np 0(1 ⫺ p 0)}1/2 ⫹ 1 (for concluding early that the regimen is promising), where N 1 is the first stage sample size and N is the total after the second stage. At the second stage, H 0 is accepted or rejected according to the normal approximation for a single-stage design. Since then other authors, rather than proposing tests, have proposed choosing stopping boundaries to minimize the expected number of patients required, subject to level and power specifications. Chang et al. (3) proposed minimizing the average expected sample size under the null and alternative hypotheses. Simon (4), recognizing the ethical imperative of stopping when the agent is inactive, recommended stopping early only for unpromising results and minimizing the expected sample size under the null or, alternatively, minimizing the maximum sample size. A problem with these designs is that sample size has to be accrued exactly for the optimality properties to hold, so in practice they cannot be carried out faithfully in many settings. Particularly in multi-institution settings, studies cannot be closed after a specified number of patients have been accrued. It takes time to get a closure notice out, and during this time more patients will have been approached to enter the trial. Patients who have been asked and have agreed to participate in a trial should be allowed to do so, and this means there is a period of time during which institutions can continue registering patients even though the study is closing. Furthermore, some patients may be found to be ineligible after the study is closed. It is rare to end up with precisely the number of patients planned, making application of fixed designs problematic. To address this problem, Green and Dahlberg (5) proposed designs allowing for variable attained sample sizes. The approach is to accrue patients in two stages with about the same number of patients per stage, to have level approximately 0.05 and power approximately 0.9, and to stop early if the agent appears unpromising. Specifically, the regimen is concluded unpromising and the trial is stopped early if the alternative is rejected at the 0.02 level after the first stage of accrual and the agent is concluded promising if H 0 is rejected at the 0.055 level after the second stage of accrual. The level 0.02 was chosen to balance the concern of treating the fewest possible patients with an inactive agent against the concern of rejecting an active agent due to treating a chance series of poor risk patients. Level 0.05 and power 0.9 are reasonable for solid tumors due to the modest percent of agents found to be active in this setting (6); less conservative values might be appropriate in more responsive diseases.

Phase II Trials

95

The design has the property that stopping at the first stage occurs when the estimate of the response probability is less than approximately p 0, the true value that would mean the agent would not be of interest. At the second stage the agent is concluded to warrant further study if the estimate of the response probability is greater than approximately (p A ⫹ p 0)/2, which typically would be equal to or somewhat above the historical probability expected from other agents and a value at which one might be expected to be indifferent to the outcome of the trial. However, there are no optimality properties. Chen and Ng (7) proposed a different approach to flexible design by optimizing (with respect to expected sample size under p 0) across possible attained sample sizes. They assumed a uniform distribution over sets of eight consecutive N 1s and eight consecutive Ns; presumably if information is available on the actual distribution in a particular setting, then the approach could be used for a better optimization. Herndon (8) described another variation on the Green and Dahlberg designs. To address the problem of temporary closure of studies, an alternative approach is proposed that allows patient accrual to continue while results of the first stage are reviewed. Temporary closures are disruptive, so this approach might be reasonable for cases when accrual is relatively slow with respect to submission of information (if too rapid, the ethical aim of stopping early due to inactivity is lost). Table 1 illustrates several of the design approaches mentioned above for level 0.05 and power 0.9 tests, including Fleming designs, Simon minimax designs, Green and Dahlberg designs, and Chen and Ng optimal design sets. Powers and levels are reasonable for all approaches. (Chen and Ng designs have correct level on average, although individual realizations have levels up to 0.075 among the tabled designs.) Of the four approaches, Green and Dahlberg designs are the most conservative with respect to early stopping for level 0.05 and power 0.9, whereas Chen and Ng designs are the least. In another approach to phase II design, Storer (9) suggested a procedure similar to two-sided testing instead of the standard one-sided test. In this approach, the phase II is considered negative (H A: p ⱖ p A is rejected) if the number of responses is sufficiently low, positive (H 0: p ⱕ p 0 is rejected) if sufficiently high, and equivocal if intermediate (neither hypothesis rejected). For a value p m between p 0 and p A, upper and lower rejection bounds (r U and r L) are chosen such that P(x ⱖ r U | p m) ⬍ γ and P(x ⱕ r L |p m) ⬍ γ, with p m and sample size chosen to have adequate power to reject H A under p 0 or H 0 under p A. When p 0 ⫽ 0.1 and p A ⫽ 0.3, an example of a Storer design is to test p m ⫽ 0.193 with γ ⫽ 0.33 and power 0.8 under p 0 and p A. For a two-stage design, N 1, N, r L1, r U1, r L2, and r U2 are 18, 29, 1, 6, 4, and 7, respectively. If the final result is equivocal (5 or 6 responses in 29 for this example), the conclusion is that other information is necessary to make a decision.

Examples of Designs H0 vs. HA

Fleming

Simon

Green

Chen

96

Table 1

N1

a1

b1

N

a2

b2

0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.2

20 20 25 30 29 22 24 24 20 20 25 30 17–24

0 2 5 9 1 2 5 7 0 1 4 8 1

4 6 10 16 — — — — — — — — —

0.1 vs. 0.3

12–14 15–19 18–20 21–24 25 19–20 21–23 24–26

1 2 4 5 6 6 7 8

— — — — — — — — —

40 35 45 55 38 33 45 53 40 35 45 55 41–46 47–48 36–39 40–43 48 49–51 52–55 55 56–58 59–60 61–62

4 6 13 22 4 6 13 21 4 7 13 22 4 5 6 7 13 14 15 21 22 23 24

5 7 14 23 5 7 14 22 5 8 14 23 5 6 7 8 14 15 16 22 23 24 25

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

0.2 vs. 0.4

0.3 vs. 0.5

Level (average and range for Chen)

Power (average and range for Chen)

0.052 0.053 0.055 0.042 0.039 0.041 0.048 0.047 0.047 0.020 0.052 0.041 0.046 0.022–0.069 0.050 0.029–0.075 0.050 0.034–0.073

0.92 0.92 0.91 0.91 0.90 0.90 0.90 0.90 0.92 0.87 0.91 0.91 0.90 0.845–0.946 0.90 0.848–0.938 0.90 0.868–0.937

0.050 0.035–0.064

0.90 0.872–0.929

Green

N1 is the sample size for the first stage of accrual, N is the total sample size after the second stage of accrual, ai is the bound for accepting H0 at stage i, and bi is the bound for rejecting H0 at stage i (i ⫽ 1, 2). Designs are listed for Fleming (2), Simon (4), and Green and Dahlberg (5); the optimal design set is listed for Chen and Ng (7).

Phase II Trials

97

II. ANALYSIS OF STANDARD PHASE II DESIGNS As noted in Storer, the hypothesis testing framework used in phase II studies is useful for developing designs and determining sample sizes. The resulting decision rules are not always meaningful, however, except as tied to hypothetical follow-up trials that in practice may or may not be done. Thus, it is important to present confidence intervals for phase II results, which can be interpreted appropriately regardless of the nominal ‘‘decision’’ made at the end of the trial as to whether further study of the regimen is warranted. The main analysis issue after a multistage trial is how to generate a confidence interval, since the usual procedures assuming a single-stage design are biased. Various approaches to generating intervals have been proposed. These involve ordering the outcome space and inverting tail probabilities or test acceptance regions, as in estimation following single-stage designs; however, with multistage designs, the outcome space does not lend itself to any simple ordering. Jennison and Turnbull (10) order the outcome space by which boundary is reached, by the stage stopped at, and by the number of successes (stopping at stage i is considered more extreme than stopping at stage i ⫹ 1 regardless of the number of successes). A value p is not in the 1 ⫺ 2α confidence interval if the probability under p of the observed result or one more extreme according to this ordering is less than α (either direction). Chang and O’Brien (11) order the sample space instead based on the likelihood principle. For each p, the sample space for a two-stage design is ordered according to L(x, N*) ⫽ (x/N*)x(1 ⫺ p)N*⫺x /px{(N* ⫺ x)/N*}N*⫺x, where N* is N 1 if x can only be observed at the first stage and N if at the second (x is the number of responses). P is in the confidence ‘‘interval’’ if one half of the probability of the observed outcome plus the probability of a more extreme outcome according to this ordering is α or less. The confidence set is not always strictly an interval, but the authors state that the effect of discontinuous points is negligible. Chang and O’Brien intervals were shorter than those of Jennison and Turnbull, although this in part would be due to the fact that Jennison and Turnbull did not adjust for discreteness by assigning only 1/2 of the probability of the observed value to the tail as Chang and O’Brien did. Duffy and Santner (12) recommend ordering the sample space by success percent and also develop intervals of shorter length than Jennison and Turnbull intervals. Although they produce shorter intervals, these last two approaches have the major disadvantage of requiring knowledge of the final sample size to calculate an interval for a study stopped at the first stage; as noted above, this typically will be random. The Jennison and Turnbull approach can be used since it only requires knowledge up to the stopping time. However, it is not entirely clear how important it is to adjust confidence intervals for the multistage nature of the design. From the point of view of appro-

98

Green

priately reflecting the activity of the regimen tested, the usual interval assuming a single stage design may be sufficient. In this setting the length of the confidence interval is not of primary importance (sample sizes are small and all intervals are long). The primary concern is that the interval appropriately reflects the activity of the regimen. Similar to Storer’s idea, it is assumed that if the confidence interval excludes p 0, the regimen is considered active, and if it excludes p A, the regimen is considered insufficiently active. If it excludes neither, results are equivocal; this seems reasonable whether or not continued testing is recommended for the better equivocal results. For Green and Dahlberg designs, the differences between Jennison and Turnbull and unadjusted tail probabilities are 0 if the trial stops at the first stage, ⫺∑ a01bin(i, N 1, p)∑ Nx⫺i2 bin( j, N ⫺ N 1, p) for the upper tail if stopped at the second 2 stage, and ⫹ ∑ a01bin(i, N 1, p) ∑ Nx⫺i⫹1 bin( j, N ⫺ N 1, p) for the lower tail if stopped at the second stage. (A 1 is the stopping bound for accepting H 0 at the first stage.) Both the upper and lower confidence bounds are shifted to the right for Jennison and Turnbull intervals. These therefore will more often appropriately exclude p 0 when p A is true and inappropriately include p A when p 0 is true compared with the unadjusted interval. However, the tail differences are generally small, resulting in small differences in the intervals. The absolute value of the upper tail difference is less than approximately 0.003 when the lower bound of the unadjusted interval is p 0 (normal approximation), whereas the lower tail difference is constrained to be ⬍0.02 for p ⬎ p A due to the early stopping rule. Generally, the shift in a Jennison and Turnbull interval is noticeable only for small x at the second stage. As Rosner and Tsiatis (13) note, such results (activity in the first stage, no activity in the second) are unlikely, possibly suggesting the independent identically distributed assumption was incorrect. For example, consider a common design for testing H 0: p ⫽ 0.1 versus H A: p ⫽ 0.3: stop in favor of H 0 at the first stage if 0 or 1 responses are observed in 20 patients and otherwise continue to a total of 35. Of the 36 possible trial outcomes (if planned sample sizes are achieved), the largest discrepancy in the 95% confidence intervals occurs if two responses are observed in the first stage and none in the second. For this outcome, the Jennison and Turnbull 95% confidence interval is from 0.013 to 0.25, whereas the unadjusted interval is from 0.007 to 0.19. Although not identical, both intervals lead to the same conclusion: The alternative is ruled out. For the Fleming and Green and Dahlberg designs listed in Table 1, Table 2 lists the probabilities that the 95% confidence intervals lie above p 0 (evidence the regimen is active), below p A (evidence the regimen has insufficient activity to pursue), or cover both p 0 and p A (inconclusive). (In no case are p 0 and p A both excluded.) Probabilities are calculated for p ⫽ p A and p ⫽ p 0 and for adjusted (by the Jennison and Turnbull method) and unadjusted intervals. For the Green and Dahlberg designs considered, probabilities for the Jennison and Turnbull and unadjusted intervals are the same in most cases. The only

Phase II Trials

99

Table 2 Probabilities Under p 0 and p A for Unadjusted and Jennison-Turnbull (J-T) Adjusted 95% Confidence Intervals Probability 95% CI is above p0 when p ⫽

0.05 vs. 0.2

0.1 vs. 0.3

0.2 vs. 0.4

0.3 vs. 0.5

Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted

Probability 95% CI is below pA when p ⫽

Probability 95% CI includes p0 and pA when p⫽

p0

pA

p0

pA

p0

pA

0.014 0.014 0.024 0.024 0.020 0.020 0.025 0.025 0.025 0.025 0.023 0.034 0.022 0.022 0.025 0.025

0.836 0.836 0.854 0.854 0.866 0.866 0.866 0.866 0.856 0.856 0.802 0.862 0.859 0.859 0.860 0.860

0.704 0.704 0.704 0.704 0.747 0.747 0.392 0.515 0.742 0.833 0.421 0.654 0.822 0.822 0.778 0.837

0.017 0.017 0.017 0.017 0.014 0.014 0.008 0.011 0.016 0.027 0.009 0.022 0.020 0.020 0.025 0.030

0.282 0.282 0.272 0.272 0.233 0.233 0.583 0.460 0.233 0.142 0.556 0.312 0.156 0.156 0.197 0.138

0.147 0.147 0.129 0.129 0.120 0.120 0.126 0.123 0.128 0.117 0.189 0.116 0.121 0.121 0.115 0.110

discrepancy occurs for the 0.2 versus 0.4 design when the final outcome is 11 of 45 responses. In this case the unadjusted interval is from 0.129 to 0.395, whereas the Jennison and Turnbull interval is from 0.131 to 0.402. There are more differences between adjusted and unadjusted probabilities for Fleming designs, the largest for ruling out p A in the 0.2 versus 0.4 and 0.1 versus 0.3 designs. In these designs, no second-stage Jennison and Turnbull interval excludes the alternative, making this probability unacceptably low under p 0. The examples presented suggest that adjusted confidence intervals do not necessarily result in more sensible intervals in phase II designs and in some cases are worse than not adjusting.

III. OTHER PHASE II DESIGNS A. Multiarm Phase II Designs Occasionally, the aim of a phase II study is not to decide whether a particular regimen should be studied further but to decide which of several new regimens

100

Green

should be taken to the next phase of testing (assuming they cannot all be). In these cases selection designs are used, often formulated as follows: Take on to further testing the treatment arm observed to be best by any amount, where the number of patients per arm is chosen to be large enough such that if one treatment is superior by ∆ and the rest are equivalent, the probability of choosing the superior treatment is p. Simon et al. (14) published sample sizes for selection designs with response endpoints, whereas Liu et al. (15) provide sample sizes for survival end points. For survival the approach is to choose the arm with the smallest estimated β in a Cox model. Sample size is chosen so that if one treatment is superior with β ⫽ ⫺ln (1 ⫹ ∆) and the others have the same survival, then the superior treatment will be chosen with probability p. Theoretically this all fine, but in reality the designs are not strictly followed. If response is poor in all arms, the conclusion is to pursue none of the regimens (not an option allowed in these designs). If a ‘‘striking’’ difference is observed, then the temptation is to bypass the confirmatory phase III. In a follow-up to the survival selection paper, Liu et al. (16) noted that the probability of a an observed β ⬍ ⫺ln (1.7), which cancer investigators consider striking, is not trivial—with two to four arms the probabilities are 0.07–0.08 when in fact there are no differences in the treatment arms. B.

Phase II Designs with Multiple End Points

The selected primary end point of a phase II trial is just one consideration in the decision to pursue a new agent. Other end points (such as survival and toxicity if response is primary) must also be considered. For instance, a trial with a sufficient number of responses to be considered active may still not be of interest if too many patients experience life-threatening toxicity or if they all die quickly. On the other hand, a trial with an insufficient number of responses but a good toxicity profile and promising survival might still be considered for future trials. Designs have been proposed to incorporate multiple end points explicitly into phase II studies. Bryant and Day (17) proposed an extension of Simon’s approach, identifying designs that minimize the expected accrual when the regimen is unacceptable either with respect to response or toxicity. Their designs are terminated at the first stage if either the number of responses is C R1 or less or the number of patients without toxicity is C T1 or less (or both). The regimen is concluded useful if the number of patients with responses and the number without toxicity are greater than C R2 and C T2 respectively, at the second stage. N 1, N, C R1, C T1, C R2, and C T2 are chosen such that the probability of recommending the regimen when the probability of no toxicity is acceptable (p T ⫽ p T1) but response is unacceptable (p R ⫽ p R0) is less than or equal to α R, the probability of recommending the regimen when response is acceptable (p R ⫽ p R1) but toxicity is unaccept-

Phase II Trials

101

able (p T ⫽ p T0) is less than or equal to α T, and the probability of recommending the regimen when both are acceptable is 1 ⫺ β or better. The constraints are applied either uniformly over all possible correlations between toxicity and response or assuming independence of toxicity and response. Minimization is done subject to the constraints. For many practical situations, minimization assuming independence produces designs that perform reasonably well when the assumption is incorrect. Conaway and Petroni (18) proposed similar designs assuming that a particular relationship between toxicity and response, an optimality criterion, and a fixed total sample size are all specified. Design constraints proposed include limiting the probability of recommending the regimen to α or less when both response and toxicity are unacceptable (p R ⫽ p R0 and p T ⫽ p T0) and to γ or less anywhere else in the null region (p R ⱕ p R0 or p T ⱕ p T0). The following year, Conaway and Petroni (19) proposed boundaries allowing for trade-offs between toxicity and response. Instead of dividing the parameter space as in Fig. 1a, it is divided according to investigator specifications, such as in Fig. 1b, allowing for fewer patients with no toxicity when the response probability is higher (and the reverse). The test proposed is to accept H 0 when T(x) ⬍ c 1 at the first stage or ⬍c 2 at the second, subject to maximum level α over the null region and power at least 1 ⫺ β when p R ⫽ p R1 and p T ⫽ p T1 for an assumed value for the association between response and toxicity. Here, T(x) ⫽ ∑p*ij ln( p*ij /pˆij), where ij indexes the cells of the 2 ⫻ 2 response–toxicity table, pˆs are the usual probability estimates, and the p i*j s are the values achieving infH0 ∑pij ln( pij /pˆij). (T(x) can be interpreted in some sense as a distance from pˆ to H 0). Interim stopping bounds are chosen to satisfy optimality criteria (the authors’ preference is minimization of expected sample size under the null).

Figure 1 Division of parameter space for two approaches to bivariate phase II design. (a) An acceptable probability of response and an acceptable probability of no toxicity are each specified. (b) Acceptable probabilities are not fixed at one value for each but instead allow for a trade-off between toxicity and response.

102

Green

There are a number of practical problems with these designs. As for other designs relying on optimality criteria, they generally cannot be done faithfully in realistic settings. Even when they can be carried out, defining toxicity as a single yes–no variable is problematic, since typically several toxicities of various grades are of interest. Perhaps the most important issue is that of the response– toxicity trade-off. Any function specified is subjective and cannot be assumed to reflect the preferences of either investigators or patients in general.

IV. DISCUSSION Despite the precise formulation of hypotheses and decision rules, phase II trials are not as objective as we would like. The small sample sizes used cannot support decision making based on all aspects of interest in a trial. Trials combining more than one aspect (such as toxicity and response) are fairly arbitrary with respect to the relative importance placed on each end point (including the 0 weight placed on the end points not included), so are subject to about as much imprecision in interpretation as results of single end point trials. Furthermore, a phase II trial would rarely be considered on its own. By the time a regimen is taken to phase III testing, multiple phase II trials have been done and the outcomes of the various trials weighed and discussed. Perhaps statistical considerations in a phase II design are most useful in keeping investigators realistic about how limited such studies are. For similar reasons, optimality considerations both with respect to design and confidence intervals are not particularly compelling in phase II trials. Sample sizes in the typical clinical setting are small and variable, making it more important to use procedures that work reasonably well across a variety of circumstances rather than optimally in one. Also, there are various characteristics it would be useful to optimize; compromise is often in order. A final practical note—choices of null and alternative hypotheses in phase II trials are often made routinely, with little thought, but phase II experience should be reviewed occasionally. As definitions and treatments change, old historical probabilities do not remain applicable.

REFERENCES 1. Gehan EA. The determination of number of patients in a follow up trial of a new chemotherapeutic agent. J Chronic Dis 1961; 13:346–353. 2. Fleming TR. One sample multiple testing procedures for Phase II clinical trials. Biometrics 1982; 38:143–151.

Phase II Trials

103

3. Chang MN, Therneau TM, Wieand HS, Cha SS. Designs for group sequential Phase II clinical trials. Biometrics 1987; 43:865–874. 4. Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clin Trials 1989; 10:1–10. 5. Green SJ, Dahlberg, S. Planned vs attained design in Phase II clinical trials. Stat Med 1992; 11:853–862. 6. Simon R. How large should a Phase II trial of a new drug be? Cancer Treatment Rep 1987; 71:1079–1085. 7. Chen T, Ng T-H. Optimal flexible designs in Phase II clinical trials. Stat Med 1998; 17:2301–2312. 8. Herndon J. A design alternative for two-stage, Phase II, multicenter cancer clinical trials. Controlled Clin Trials 1998; 19:440–450. 9. Storer B. A class of Phase II designs with three possible outcomes. Biometrics 1992; 48:55–60. 10. Jennison C, Turnbull BW. Confidence intervals for a binomial parameter following a multistage test with application to MIL-STD 105D and medical trials. Technometrics 1983; 25:49–58. 11. Chang MN, O’Brien PC. Confidence intervals following group sequential tests. Controlled Clin Trials 1986; 7:18–26. 12. Duffy DE, Santner TJ. Confidence intervals for a binomial parameter based on multistage tests. Biometrics 1987; 43:81–94. 13. Rosner G, Tsiatis AA. Exact confidence intervals following a group sequential trial: a comparison of methods. Biometrika 1988; 75:723–729. 14. Simon R, Wittes R, Ellenberg S. Randomized Phase II clinical trials. Cancer Treatment Rep 1985; 69:1375–1381. 15. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival endpoints. Biometrics 1993; 49:391–398. 16. Liu PY, LeBlanc M, Desai M. False positive rates of randomized Phase II designs. Controlled Clin Trials 1999; 20:343–352. 17. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage Phase II clinical trials. Biometrics 1995; 51:1372–1383. 18. Conaway M, Petroni G. Bivariate sequential designs for Phase II trials. Biometrics 1995; 51:656–664. 19. Conaway M, Petroni G. Designs for phase II trials allowing for a trade-off between response and toxicity. Biometrics 1996; 52:1375–1386.

5 Designs Based on Toxicity and Response Gina R. Petroni and Mark R. Conaway University of Virginia, Charlottesville, Virginia

I.

INTRODUCTION

In principle, phase II trials evaluate whether a new agent is sufficiently promising to warrant a comparison with the current standard of treatment. An agent is considered sufficiently promising based on the proportion of patients who ‘‘respond,’’ that is, experience some objective measure of disease improvement. The toxicity of the new agent, usually defined in terms of the proportion of patients experiencing severe side effects, has been established in a previous phase I trial. In practice, the separation between establishing the toxicity of a new agent in a phase I trial and establishing the response rate in a phase II trial is artificial. Most phase II trials are conducted not only to establish the response rate but also to gather additional information about the toxicity associated with the new agent. Conaway and Petroni (1) and Bryant and Day (2) cite several reasons why toxicity considerations are important for phase II trials: 1. Sample sizes in phase I trials. The number of patients in a phase I trial is small and the toxicity profile of the new agent is estimated with little precision. As a result, there is a need to gather more information about toxicity rates before proceeding to a large comparative trial. 2. Ethical considerations. Most phase II trials are designed to terminate the study early if it does not appear that the new agent is sufficiently 105

106

Petroni and Conaway

3.

promising to warrant a comparative trial. These designs are meant to protect patients from receiving substandard therapy. Patients should be protected also from receiving agents with excessive rates of toxicity, and consequently phase II trials should be designed with the possibility of early termination of the study if an excessive number of toxicities are observed. This consideration is particularly important in studies of intensive chemotherapy regimens, where it is hypothesized that a more intensive therapy induces a greater chance of a response but also a greater chance of toxicity. The characteristics of the patients enrolled in the previous phase I trials may be different from those of the patients to be enrolled in the phase II trial. For example, phase I trials often enroll patients for whom all standard therapies have failed. These patients are likely to have a greater extent of disease than patients who will be accrued to the phase II trial.

With these considerations, several proposals have been made for designing phase II trials that formally incorporate both response and toxicity end points. Conaway and Petroni (1) and Bryant and Day (2) propose methods that extend the two-stage designs of Simon (3). In each of these methods, a new agent is considered sufficiently promising if it exhibits both a response rate that is greater than that of the standard therapy and a toxicity rate that does not exceed that of the standard therapy. Conaway and Petroni (4) consider a different criterion, based on a trade-off between response and toxicity rates. In these designs, a new agent with a greater toxicity rate might be considered sufficiently promising if it also has a much greater response rate than the standard therapy. Thall et al. (5,6) propose a Bayesian method for monitoring response and toxicity that can also incorporate a trade-off between response and toxicity rates.

II. DESIGNS FOR RESPONSE AND TOXICITY Conaway and Petroni (1) and Bryant and Day (2) present multistage designs that formally monitor response and toxicity. As a motivation for the multistage designs, we first describe the methods for a fixed sample design, using the notation in Conaway and Petroni (1). In this setting, binary variables representing response and toxicity are observed in each of N patients. The data are summarized in a 2 ⫻ 2 table where X ij is the number of patients with response classification i and toxicity classification j (Table 1). The observed number of responses is X R ⫽ X 11 ⫹ X 12 and the observed number of patients experiencing a severe toxicity is X T ⫽ X 11 ⫹ X 21. It is assumed that the cell counts in this table, (X 11, X 12 , X 21 , X 22 ), have a multinomial distribution with underlying probabilities, (p11 , p12 , p21 , p22 ).

Designs Based on Toxicity and Response Table 1

107

Classification of Patients by Response and Toxicity Toxicity

Response Yes No Total

Yes

No

Total

X 11 X 21 XT

X 12 X 22 N ⫺ XT

XR N ⫺ XR N

That is, in the population of patients to be treated with this new agent, a proportion, pij , would have response classification i and toxicity classification j (Table 2). With this notation the probability of a response is pR ⫽ p11 ⫹ p12 and the probability of a toxicity is pT ⫽ p11 ⫹ p21. The design is based on having sufficient power to test the null hypothesis that the new treatment is ‘‘not sufficiently promising’’ to warrant further study against the alternative hypothesis that the new agent is sufficiently promising to warrant a comparative trial. Conaway and Petroni (1) and Bryant and Day (2) interpret the term ‘‘sufficiently promising’’ to mean that the new treatment has a greater response rate than the standard and that the toxicity rate with the new treatment is no greater than that of the standard treatment. Defining pRo as the response rate with the standard treatment and pTo as the toxicity rate for the standard treatment, the null hypothesis can be written as H o: pR ⱕ pRo or pT ⱖ pTo H a: pR ⬎ pRo and pT ⬍ pTo The null and alternative regions are displayed in Fig. 1. A statistic for testing H o versus H a is (X R , X T ), with a critical region of the form C ⫽ {(X R , X T ): X R ⱖ c R and X T ⱕ c T}. We reject the null hypothesis and

Table 2 Population Proportions for Response and Toxicity Classifications Toxicity

Response Yes No Total

Yes

No

Total

p11 p21 pT

p12 p22 1 ⫺ pT

pR 1 ⫺ pR 1

108

Petroni and Conaway

Figure 1 Null and alternative regions.

declare the treatment sufficiently promising if we observe many responses and few toxicities. We do not reject the null hypothesis if we observe too few responses or too many toxicities. Conaway and Petroni (1) choose the sample size, N, and critical values (c R , c T ) to constrain three error probabilities to be less than prespecified levels α, γ, and 1 ⫺ β, respectively. The three error probabilities are 1.

2.

3.

The probability of incorrectly declaring the treatment promising when the response and toxicity rates for the new therapy are the same as those of the standard therapy. The probability of incorrectly declaring the treatment promising when the response rate for the new therapy is no greater than that of the standard or the toxicity rate for the new therapy is greater than that of the standard therapy. The probability of declaring the treatment not promising at a particular point in the alternative region. The design should yield sufficient power to reject the null hypothesis for a specific response and toxicity rate,

Designs Based on Toxicity and Response

109

where the response rate is greater than that of the standard therapy and the toxicity rate is less than that of the standard therapy. Mathematically, these error probabilities are: P(X R ⱖ c R , X T ⱕ c T | pR ⫽ pRo , pT ⫽ pTo , θ) ⱕ α,

(1)

P(X R ⱖ c R , X T ⱕ c T | pR , pT, θ) ⱕ γ,

(2)

sup

pR ⱕpRo or pT ⱖpTo

P(X R ⱖ c R , X T ⱕ c T | pR ⫽ pRa , pT ⫽ pTa , θ) ⱕ 1 ⫺ β

(3)

where these probabilities are computed for a prespecified value of the odds ratio, θ ⫽ (p11 p22 )/(p12 p21 ), in Table 2. the point (pRa , pTa ) is a prespecified point in the alternative region, with pRa ⬎ pRo and pTa ⬍ pTo. Conaway and Petroni (1) compute the sample size and critical values by enumerating the distribution of (X R , X T ) under particular values for (pR , pT , θ). As an example, Conaway and Petroni (1) present a proposed phase II trial of high-dose chemotherapy for patients with non-Hodgkin’s lymphoma. Results from earlier studies for this patient population have indicated that standard therapy results in an estimated response rate of 50% with approximately 30% of patients experiencing life-threatening toxicities. In addition, previous results indicated that approximately 35–40% of the patients who experienced a complete response also experienced life-threatening toxicities. The odds ratio, θ, is determined by the assumed response rate, toxicity rate, and the conditional probability of experiencing a life-threatening toxicity given that patient had a complete response. Therefore, (pRo , pTo ) is assumed to be (0.50, 0.30) and the odds ratio is assumed to be 2.0. Conaway and Petroni (1) chose values α ⫽ 0.05, γ ⫽ 0.30, and β ⫽ 0.10. The trial is designed to have approximately 90% power at the alternative determined by (pRa , pTa ) ⫽ (0.75, 0.15). The extension to multistage designs is straightforward. The multistage designs allow for the early termination of a study if early results indicate that the treatment is not sufficiently effective or is too toxic. Although most phase II trials are carried out in at most two stages, for the general discussion, Conaway and Petroni (1) assume that the study is to be carried out in K stages. At the end of the kth stage, a decision is made whether to enroll patients for the next stage or to stop the trial. If the trial is stopped early, the treatment is declared not sufficiently promising to warrant further study. At the end of the kth stage, the decision to continue or terminate the study is governed by the boundaries (c Rk , c Tk ), k ⫽ 1, . . . , K. The study continues to the next stage if the total number of responses observed up to and including the kth stage is at least as great as c Rk and the total number of toxicities up to and including the kth stage is no greater than c Tk. At the final stage, the null hypothesis that the treatment is not sufficiently promising to warrant further study is rejected if there are a sufficient number of observed

110

Petroni and Conaway

responses (at least c RK ) and sufficiently few observed toxicities (no more than c TK ). In designing the study, the goal is to choose sample sizes for the stages m 1 , m 2 , . . . , m K and boundaries (c R1 , c T1 ), (c R2 , c T2 ), . . . , (c RK, c TK ) satisfying the error constraints listed above. For a fixed total sample size, N ⫽ ∑k m k , there may be many designs that satisfy the error requirements. An additional criterion, such as one of those proposed by Simon (3) in the context of two-stage trials with a single binary end point, can be used to select a design. The stage sample sizes and boundaries can be chosen to give the minimum expected sample size at the response and toxicity rates for the standard therapy (pRo , pTo ) among all designs that satisfy the error requirements. Alternatively, one could choose the design that minimizes the maximum expected sample size over the entire null hypothesis region. Conaway and Petroni (1) compute the ‘‘optimal’’ designs for these criteria for two-stage and three-stage designs using a fixed prespecified value for the odds ratio, θ. Through simulations, they evaluate the sensitivity of the designs to a misspecification of the value for the odds ratio. Bryant and Day (2) also consider the problem of monitoring binary end points representing response and toxicity. They present optimal designs for twostage trials that extend the designs of Simon (3). In the first stage, N 1 patients are accrued and classified by response and toxicity; Y R1 patients respond and Y T1 patients do not experience toxicity. At the end of the first stage, a decision to continue to the next stage or to terminate the study is made according to the following rules, where N 1 , C R1 , and C T1 are parameters to be chosen as part of the design specification: 1. 2. 3. 4.

If If If If

Y R1 Y R1 Y R1 Y R1

ⱕ ⬎ ⱕ ⬎

C R1 C R1 C R1 C R1

and and and and

Y T1 Y T1 Y T1 Y T1

⬎ ⱕ ⱕ ⬎

C T1 , C T1 , C T1 , C T1 ,

terminate due to inadequate response. terminate due to excessive toxicity. terminate due to both factors. continue to the second stage.

In the second stage, N 2 ⫺ N 1 patients are accrued. At the end of this stage, the following rules govern the decision whether or not the new agent is sufficiently promising, where N 2 , C R2 , and C T2 are parameters to be determined by the design: 1. 2. 3. 4.

If Y R2 ⱕ C R2 and Y T2 ⬎ C T2 , ‘‘not promising’’ due to inadequate response. If Y R2 ⬎ C R2 and Y T2 ⱕ C T2 , ‘‘not promising’’ due to excessive toxicity. If Y R2 ⱕ C R2 and Y T2 ⱕ C T2 , ‘‘not promising’’ due to both factors. If Y R2 ⬎ C R2 and Y T2 ⬎ C T2 , ‘‘sufficiently promising.’’

The principle for choosing the stage sample sizes and stage boundaries is the same as in Conaway and Petroni (1). The design parameters are determined from prespecified error constraints. Although the methods differ in the particular

Designs Based on Toxicity and Response

111

constraints considered, the motivation for these error constraints is the same. One would like to limit the probability of recommending a treatment that has an insufficient response rate or excessive toxicity rate. Similarly, one would like to constrain the probability of failing to recommend a treatment that is superior to the standard treatment in terms of both response and toxicity rates. Finally, among all designs meeting the error criteria, the optimal design is the one that minimizes the average number of patients treated with an ineffective therapy. In choosing the design parameters, Q ⫽ (N 1 , N 2 , C R1 , C R2 , C T1 , C T2 ), Bryant and Day (2) specify an acceptable (P R1 ) and an unacceptable (P R0 ) response rate along with an acceptable (P T1 ) and unacceptable (P T0 ) rate of nontoxicity. Under any of the four combinations of acceptable or unacceptable rates of response and nontoxicity, Bryant and Day (2) assume that the association between response and toxicity is constant. The association between response and toxicity is determined by the odds ratio, ϕ, in the 2 ⫻ 2 table cross-classifying response and toxicity, ϕ⫽

P(No Response, Toxicity) ⫻ P(Response, No toxicity) P(No Response, No Toxicity) ⫻ P(Response, Toxicity)

Bryant and Day (2) parameterize the odds ratio in terms of response and no toxicity so ϕ corresponds to 1/θ in the notation of Conaway and Petroni (1). For a design, Q, and an odds ratio, ϕ, let α ij (Q, ϕ) be the probability of recommending the treatment, given that the true response rate equals PRi and the true nontoxicity rate equals PTj , i ⫽ 0, 1; j ⫽ 0, 1. Constraining the probability of recommending a treatment with an insufficient response rate leads to α 01 (Q, ϕ) ⱕ α R , where α R is a prespecified constant. Constraining the probability of recommending a treatment with an insufficient response rate leads to α 10 (Q, ϕ) ⱕ α T , and ensuring a sufficiently high probability of recommending a truly superior treatment requires α 11 (Q, ϕ) ⱖ 1 ⫺ β, where α T and β are prespecified constants. Bryant and Day (2) note that α 00 (Q, ϕ) is less than either α 01(Q, ϕ) or α 10 (Q, ϕ), so that an upper bound on α 00(Q, ϕ) is implicit in these constraints. There can be many designs that meet these specifications. Among these designs, Bryant and Day (2) define the optimal design to be the one that minimizes the expected number of patients in a study of a treatment with an unacceptable response or toxicity rate. Specifically, Bryant and Day (2) choose the design, Q, that minimizes the maximum of E 01 (Q, ϕ) and E 10 (Q, ϕ), where E ij is the expected number of patients accrued when the true response rate equals P Ri and the true nontoxicity rate equals P Tj , i ⫽ 0, 1; j ⫽ 0, 1. The expected value E 00 (Q, ϕ) does not play a role in the calculation of the optimal design because it is less than both E 01 (Q, ϕ) and E 10(Q, ϕ). The stage sample sizes and boundaries for the optimal design depend on the value of the nuisance parameter, ϕ. For an unspecified odds ratio, among all

112

Petroni and Conaway

designs that meet the error constraints, the optimal design minimizes the maximum expected patient accruals under a treatment with an unacceptable response or toxicity rate, max ϕ {max(E 01 (Q, ϕ), E 10 (Q, ϕ))}. Assumptions about a fixed value of the odds ratio lead to a simpler computational problem; this is particularly true if response and toxicity are assumed to be independent (ϕ ⫽ 1). Bryant and Day (2) provide bounds that indicate that the characteristics of the optimal design for an unspecified odds ratio do not differ greatly from the optimal design found by assuming that response and toxicity are independent. By considering a number of examples, Conaway and Petroni (1) came to a similar conclusion. Their designs are computed under a fixed value for the odds ratio, but different values for the assumed odds ratio led to similar designs.

III. DESIGNS THAT ALLOW A TRADE-OFF BETWEEN RESPONSE AND TOXICITY The designs for response and toxicity proposed by Conaway and Petroni (1) and Bryant and Day (2) share a number of common features, including the form for the alternative region. In these designs, a new treatment must show evidence of a greater response rate and a lesser toxicity rate than the standard treatment. In practice, a trade-off could be considered in the design, since one may be willing to allow greater toxicity to achieve a greater response rate or may be willing to accept a slightly lower response rate if lower toxicity can be obtained. Conaway and Petroni (4) propose two-stage designs for phase II trials that allow early termination of the study if the new therapy is not sufficiently promising and allow a trade-off between response and toxicity. The hypotheses are the same as those considered for the bivariate designs of the previous section. The null hypothesis is that the new treatment is not sufficiently promising to warrant further study, either due to an insufficient response rate or excessive toxicity. The alternative hypothesis is that the new treatment is sufficiently effective and safe to warrant further study. The terms ‘‘sufficiently safe’’ and ‘‘sufficiently effective’’ are relative to the response rate, pRo , and the toxicity rate, pTo , for the standard treatment. One of the primary issues in the design is how to elicit the trade-off specification. Ideally, the trade-off between safety and efficacy would be summarized as a function of toxicity and response rates that defines a treatment as ‘‘worthy of further study.’’ In practice this can be difficult to elicit. A simpler method for obtaining the trade-off information is for the investigator to specify the maximum toxicity rate, pT,max , that would be acceptable if the new treatment were to produce responses in all patients. Similarly, the investigator would be asked to specify the minimum response rate, pR,min , that would be acceptable if the treatment produced no toxicities. Figure 2 illustrates the set of values for the true response

Designs Based on Toxicity and Response

113

Figure 2 Null and alternative regions for trade-offs.

rate (pR ) and true toxicity rate (p T ), which satisfy the null and alternative hypotheses. The values chosen for Fig. 2 are p Ro ⫽ 0.5, pTo ⫽ 0.2, pR,min ⫽ 0.4, and pT,max ⫽ 0.7. The line connecting the point (pRo , pTo ) and (1, pT,max ) is given by the equation pT ⫽ pTo ⫹ tan(ψT )(pR ⫺ pRo ), where tan(ψT ) ⫽ (pT,max ⫺ pTo )/(1 ⫺ pRo ). Similarly, the equation of the line connecting (pRo , pTo ) and (1, pR,min ) is given by the equation pT ⫽ pTo ⫹ tan(ψR )(pR ⫺ pRo ), where tan(ψR ) ⫽ pTo /(pRo ⫺ pR,min ). With ψT ⱕ ψR , the null hypothesis is H o : pT ⱖ pTo ⫹ tan(ψT )(pR ⫺ pRo ) or pT ⱖ pTo ⫹ tan(ψR )(pR ⫺ pRo ) and the alternative hypothesis is H a : pT ⬍ pTo ⫹ tan(ψT )(pR ⫺ pRo ) and pT ⬍ pTo ⫹ tan(ψR )(pR ⫺ pRo ) The forms of the null and alternative are different for the case where ψT ⱖ ψR , although the basic principles in constructing the design and specifying the trade-

114

Petroni and Conaway

off information remain the same (cf. Conaway and Petroni [4]). Special cases of these hypotheses have been used previously: ψT ⫽ 0 and ψR ⫽ π/2 yield the critical regions of Conaway and Petroni (1) and Bryant and Day (2). ψR ⫽ ψT ⫽ 0 yield hypotheses in terms of toxicity alone, and ψR ⫽ ψT ⫽ π/2 yield hypotheses in terms of response alone. To describe the trade-off designs for a fixed sample size, we use the notation and assumptions for the fixed sample size design described in Section II. As in their earlier work, Conaway and Petroni (4) determine sample size and critical values under an assumed value for the odds ratio between response and toxicity. The sample size calculations require a specification of a level of type I error, α, and power, 1 ⫺ β, at a particular point pR ⫽ pRa and pT ⫽ pTa. The point (pRa , pTa ) satisfies the constraints defining the alternative hypothesis and represents the response and toxicity rates for a treatment considered to be superior to the standard treatment. The test statistic is denoted by T(p), where p ⫽ (1/ N)(X 11 , X 12 , X 21 , X 22 ), is the vector of sample proportions in the four cells of Table 1 and is based on computing an ‘‘I-divergence measure’’ (cf. Robertson et al. [7]). The test statistic has the intuitively appealing property of being roughly analogous to a ‘‘distance’’ from p to the region H o. Rejection of the null hypothesis results when the observed value of T(p) is ‘‘too far’’ from the null hypothesis region. A vector of observed proportions p leads to rejection of the null hypothesis if T(p) ⱖ c. For an appropriate choice of sample size (N), significance level (α), and power (1 ⫺ β), the value c can be chosen to constrain the probability of recommending a treatment that has an insufficient response rate relative to the toxicity rate and ensure a high probability of recommending a treatment with response rate pRa and toxicity rate pTa. The critical value c is chosen to meet the error criteria: sup P(T(p) ⱖ c | p R , pT , θ) ⱕ α (pR, pT )∈Ho

and P(T(p) ⱖ c | pRa , pTa , θ) ⱖ 1 ⫺ β These probabilities are computed for a fixed value of the odds ratio, θ, by enumerating the value of T(p) for all possible realizations of the multinominal vector (X 11 , X 12 , X 21 , X 22 ). The trade-off designs can be extended to two-stage designs that allow early termination of the study if the new treatment does not appear to be sufficiently promising. In designing the study, the goal is to choose the stage sample sizes (m 1 , m 2 ) and decision boundaries (c 1 , c 2 ) to satisfy error probability constraints similar to those in the fixed sample size trade-off design.

Designs Based on Toxicity and Response

115

sup P(T1 (p1 ) ⱖ c 1 , T2 (p1 , p2 ) ⱖ c 2 | pR , pT , θ) ⱕ α (pR, p T )∈Ho

and P(T 1 (p 1 ) ⱖ c 1 , T 2 (p 1 , p 2 ) ⱖ c 2 | pRa , p Ta , θ) ⱖ 1 ⫺ β where T 1 is the test statistic computed on the stage 1 observations and T 2 is the test statistic computed on the accumulated data in stages 1 and 2. As in the fixed sample size design, these probabilities are computed for a fixed value of the odds ratio and are found by enumerating all possible outcomes of the trial. In cases where many designs meet the error requirements, an optimal design is found according to the criterion in Bryant and Day (2) and Simon (3). Among all designs that meet the error constraints, the chosen design minimizes the maximum expected sample size under the null hypothesis. Through simulations, Conaway and Petroni (4) investigate the effect of fixing the odds ratio on the choice of the optimal design. They conclude that unless the odds ratio is badly misspecified, the choice of the odds ratio has little effect on the properties of the optimal design. The critical values for the test statistic are much harder to interpret than the critical values in Conaway and Petroni (1) or Bryant and Day (2), which are counts of the number of observed responses and toxicities. We recommend two plots, similar to Figs. 2 and 3 in Conaway and Petroni (4) to illustrate the characteristics of the trade-off designs. The first is a display of the power of the test, so that the investigators can see the probability of recommending a treatment with true response rate pR and true toxicity rate pT. The second plot displays the rejection region, so that the investigators can see the decision about the treatment that will be made for specific numbers of observed responses and toxicities. With these plots, the investigators can better understand the implications of the tradeoff being proposed. The trade-off designs of Conaway and Petroni (4) were motivated by the idea that a new treatment could be considered acceptable even if the toxicity rate for the new treatment is greater than that of the standard treatment, provided the response rate improvement is sufficiently large. This idea also motivated the Bayesian monitoring method of Thall et al. (5,6). They note that, for example, a treatment that improves the response rate by 15 percentage points might be considered promising, even if its toxicity rate is 5 percentage points greater than the standard therapy. If, however, the new therapy increases the toxicity rate by 10 percentage points, it might not be considered an acceptable therapy. Thall, et al. (5,6) outline a strategy for monitoring each end point in the trial. They define, for each end point in the trial, a monitoring boundary based on prespecified targets for an improvement in efficacy and an unacceptable increase in the rate of adverse events. In the example given above for a trial

116

Petroni and Conaway

with a single response end point and a single toxicity end point, the targeted improvement in response rate is 15% and the allowance for increased toxicity is 5%. Thall et al. (5,6) take a Bayesian approach that allows for monitoring each end point on a patient by patient basis. Although their methods allow for a number of efficacy and adverse event end points, we simplify the discussion by considering only a single efficacy event (response) and a single adverse event (toxicity) end point. Before the trial begins, they elicit a prior distribution on the cell probabilities in Table 2. Under the standard therapy, the cell probabilities are denoted PS ⫽ (pS11 , pS12 , pS21 , pS22 ); under the new experimental therapy, the cell probabilities are denoted PE ⫽ (pE11 , pE12 , pE21 , pE22 ). Putting a prior distribution on the cell probabilities (p G11 , p G12 , p G21 , p G22 ) induces a prior distribution on p GR ⫽ p G11 ⫹ p G12 and on p GT ⫽ p G11 ⫹ p G21 , where G stands for either S or E. A Dirichlet prior for the cell probabilities is particularly convenient in this setting, since this induces a beta prior on p GR and p GT , for G ⫽ S or E. In addition to the prior distribution, Thall et al. (5,6) specify a target improvement, δ(R), for response, and a maximum allowable difference, δ(T), for toxicity. The monitoring of the end points begins after a minimum number of patients, m, have been observed. It continues until either a maximum number of patients, M, have been accrued or a monitoring boundary has been crossed. In a typical phase II trial, in which only the new therapy is used, the distribution on the probabilities under the standard therapy remains constant throughout the trial, whereas the distribution on the probabilities under the new therapy is updated each time a patient’s outcomes are observed. After the response and toxicity classification on j patients, X j , have been observed, there are several possible decisions one could make. If there is strong evidence that the new therapy does not meet the targeted improvement in response rate, then the trial should be stopped and the new treatment declared ‘‘not sufficiently promising.’’ Alternatively, if there is strong evidence that the new treatment is superior to the standard treatment in terms of the targeted improvement for response, the trial should be stopped and the treatment declared ‘‘sufficiently promising.’’ In terms of toxicity, the trial should be stopped if there is strong evidence of an excessive toxicity rate with the new treatment. Thall et al. (5,6) translate these rules into statements about the updated (posterior) distribution [pE | X j ] and the prior distribution pS , using prespecified cutoff for what constitutes ‘‘strong evidence.’’ For m ⱕ j ⱕ M, the monitoring boundaries are P[P ER ⫺ P SR ⬎ δ(R) | X j] ⱕ pL(R) P[P ER ⬎ P SR|X j] ⱖ pU(R) P[pET ⫺ pST ⬎ δ(T) | X j] ⱖ pU(T)

Designs Based on Toxicity and Response

117

where pL (R), pU (R), and pU (T) are prespecified probability levels. Numerical integration is required to compute these probabilities, but the choice of the Dirichlet prior makes the computations relatively easy.

IV. SUMMARY All methods discussed in this chapter have advantages in monitoring toxicity in phase II trials. None of the methods use asymptotic approximations for distributions and are well suited for the small sample sizes encountered typically in phase II trials. The bivariate designs of Conaway and Petroni (1) and Bryant and Day (2) have critical values that are based on the observed number of responses and the observed number of toxicities; these statistics are easily calculated and interpreted by the investigators. Although there is no formal trade-off discussion in these article, the general methods can be adapted to the kind of trade-off discussed in Thall et al. (5,6). To do this, one needs to modify the hypotheses to be tested. For example, the null and alternative hypothesis could be changed to Ho : pR ⱕ pRo ⫹ δR or pT ⱖ pTo ⫹ δ T Ha : pR ⬎ pRo ⫹ δ R and pT ⬍ pTo ⫹ δ T for some prespecified δ R and δ T. The trade-off designs of Conaway and Petroni (4) have a trade-off strategy that permits the allowable level of toxicity to increase with the response rate. In contrast, in the trade-off example of Thall et al. (5,6), a 5% increase in toxicity would be considered acceptable for a treatment with a 15% increase in response. Because the allowance in toxicity is prespecified, this means that only a 5% increase in toxicity is allowable even if the response rate with the new treatment is as much as 30%. With the trade-off of Conaway and Petroni (4), the standard for ‘‘allowable toxicity’’ is greater for a treatment with a 30% improvement than for one with a 15% improvement. The methods of Thall et al. (5,6) have advantages in terms of being able to monitor outcomes on a patient by patient basis. At each monitoring point, the method can provide graphical representations of the probability associated with each of the decision rules.

REFERENCES 1. Conaway MR, Petroni GR. Bivariate sequential designs for phase II trials. Biometrics 1995; 51:656–664. 2. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics 1995; 51:1372–1383.

118

Petroni and Conaway

3. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clin Trials 1989; 10:1–10. 4. Conaway MR, Petroni GR. Designs for phase II trials allowing for trade-off between response and toxicity. Biometrics 1996; 52:1375–1386. 5. Thall PF, Simon RM, Estey EH. Bayesian sequential monitoring designs for singlearm clinical trials with multiple outcomes. Stat Med 1995; 14:357–379. 6. Thall PF, Simon RM, Estey EH. New statistical strategy for monitoring safety and efficacy in single-arm clinical trials. J Clin Oncol 1996; 14:296–303. 7. Robertson T, Dykstra RL, Wright FT. Order Restricted Statistical Inference. Chichester: John Wiley and Sons, Ltd., 1988.

6 Phase II Selection Designs P. Y. Liu Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

BASIC CONCEPT

When there are multiple promising new therapies in a disease setting, it may not possible to test all of them against the standard treatment in a definitive phase III trial. The sample sizes required for a phase III study with more than three arms could be prohibitive (1). In addition, the analysis can be highly complex and prone to errors due to the large number of possible comparisons in a multiarm study (see Chap. 4). An alternative strategy is to screen the new therapies first and choose one to test against a standard treatment in a simple two-arm phase III trial. Selection designs can be used in such circumstances. Simon et al. (2) first introduced statistical methods for ranking and selection to the oncology literature. In a selection design, patients are randomized to treatments involving new combinations or schedules of known active agents or new agents for which activity against the disease in question has already been demonstrated in some limited setting. In other words, the regimens under testing have already shown promise. Now the aim is to narrow down the choice for formal comparisons with the standard therapy. In this approach, one always selects the observed best treatment for further study, however small the advantage over the others may appear to be. Hypothesis tests are not performed. Sample sizes are established so that if a treatment exists for which the underlying efficacy is superior to the others by a specified amount, it will be selected with a high probability. The required sample sizes are usually similar to those associated with pilot phase II trials. 119

120

Liu

Although the statistical principles for the selection design are simple, its application can be rather slippery. Falsely justified by the randomized treatment assignment, the major abuse of the design is to treat the observed ranking as conclusive and forego the subsequent phase III testing. This practice is especially dangerous when a standard arm is included as the basis for comparison or when all treatment arms are experimental and a standard treatment does not exist for the particular disease. A ‘‘treatment of choice’’ is often erroneously concluded in such situations. It is of vital importance to emphasize up front that a selection design serves merely as a precursor to the requisite definitive phase III comparison. Because of the design’s moderate sample sizes and lack of control for falsepositive and false-negative findings, the observed best treatment could be truly superior or it could appear so simply due to chance with no real advantage over the other treatments. The approach presumes the subsequent conduct of definitive III trials and makes no attempt, and therefore has no power, in distinguishing the former from the latter at the selection step. Results from selection trials are error prone when treated as ends in themselves. Yet, one treatment can look substantially better than the others, and there is often great temptation to treat the unproved results as final (3). Therefore, unless the follow-on phase III study is ensured by some external mechanism such as government regulations for new drug approval, selection designs can do more harm than good by their propensity for being misused. The false-positive rate of the misapplication is discussed in more detail in Section IV.

II. SAMPLE SIZE REQUIREMENTS A.

Binary Outcomes

Table 1 shows sample size requirements for binary outcomes with K ⫽ two, three, and four groups from Simon et al. (2). With the listed N per group and true response rates, the correct selection probability should be approximately 0.90. The sample sizes were presumably derived by normal approximations to binomial distributions. A check by exact probabilities indicates that the actual correct selection probability ranges from 0.89 down to 0.86 when N is small. Increasing the sample size per group by five raises the correct selection probability to 0.90 in all cases and may be worth considering when N is less than 30. Except in extreme cases, Table 1 indicates the sample size to be relatively insensitive to baseline response rates (i.e., response rates of groups 1 through K ⫺ 1). Since precise knowledge of the baseline rates is often not available, a common conservative approach is to always use the largest sample size for each K, that is, 37, 55, and 67 patients per group for K ⫽ 2, 3, and 4, respectively. Although a total N of 74 for two groups is in line with large phase II studies, the total number of patients required for four groups, that is, close to 270, could render the design impractical for many applications.

Phase II Selection Designs

121

Table 1 Sample Size per Treatment for Binary Outcomes and 0.90 Correct Selection Probability Response rates P1, . . . , PK⫺1 10% 20% 30% 40% 50% 60% 70% 80%

N per Group PK

K⫽2

K⫽3

K⫽4

25% 35% 45% 55% 65% 75% 85% 95%

21 29 35 37 36 32 26 16

31 44 52 55 54 49 39 24

37 52 62 67 65 59 47 29

From Ref. 2.

B. Survival Outcomes For censored survival data, Liu et al. (4) suggested fitting the Cox proportional hazards model, h(t, z) ⫽ h0(t)exp(β′z), to the data where z is the (K ⫺ 1) dimensional vector of treatment group indicators and β ⫽ (β1, . . . βk⫺1) is the vector of log hazard ratios. We proposed selecting the treatment with the smallest βˆ i (where βˆ K ⬅ 0) for further testing. Sample sizes for 0.90 correct selection probability were calculated based on the asymptotic normality of the βˆ . The requirements for exponential survival and uniform censoring are reproduced in Table 2. Simulation studies of robustness of the proportional hazards assumption found the correct selection probabilities to be above 0.80 for moderate departures from the assumption. As with binary outcomes, the design becomes less practical when the hazard ratio between the worst and the best groups is smaller than 1.5 or when there are more than three groups. Table 2 covers scenarios where the patient enrollment period is similar to the median survival of the worst groups. It does not encompass situations where the two are different. Since the effective sample size for exponential survival distributions is the number of uncensored observations, the actual number of expected events are the same for the different rows in Table 2. For 0.90 correct selection probability, Table 3 gives the approximate number of events needed per group for the worst groups. With ∫IdF as the proportion of censored observations, where I and F are the respective cumulative distribution functions for censoring and survival times, some readers may find the expected event count more flexible for planning purposes.

122

Liu

Table 2 Sample Size per Treatment for Exponential Survival Outcomes with 1-Year Accrual and 0.90 Correct Selection Probability K⫽2 Median† 0.5

0.75

1

K⫽3

K⫽4

Follow‡

1.3*

1.4*

1.5*

1.3*

1.4*

1.5*

1.3*

1.4*

1.5*

0 0.5 1 0 0.5 1 0 0.5 1

115 71 59 153 89 70 192 108 82

72 44 36 96 56 44 121 68 51

51 31 26 69 40 31 87 48 36

171 106 88 229 133 104 287 161 122

107 66 54 143 83 65 180 101 76

76 46 38 102 59 46 128 72 54

206 127 106 275 160 125 345 194 147

128 79 65 172 100 78 216 121 92

91 56 46 122 70 55 153 86 65

* Hazard ratio of groups 1 through K ⫺ 1 vs. group K. † Median survival in years for groups 1 through K ⫺ 1. ‡ Additional follow-up in years after accrual completion. From Ref. 4.

Table 3 Event Count per Group for the Worst Groups for Exponential Survival and 0.90 Correct Selection Probability HR* K

1.3

1.4

1.5

2 3 4

54 80 96

34 50 60

24 36 43

* Hazard ratio of groups 1 through K ⫺ 1 vs. group K.

III. VARIATIONS OF THE DESIGN A.

Designs with Toxicity Acceptance Criteria

Toxicity or side effects often are major concerns for cancer treatments. If the toxicity profiles of the treatments under evaluation are not well known, it is recommended that formal acceptance criteria should be established for them. Selection then takes place only among those with acceptable toxicity. Treatments could even be stopped early to guard against excessive toxicity. As an example, the

Phase II Selection Designs

123

Southwest Oncology Group study S8835 investigated mitoxantrone or floxuridine administered intraperitoneally in patients with minimal residual ovarian cancer after the second-look laparotomy (5). The study was designed with 37 patients per arm; the treatment with a higher percent of patients free of disease progression or relapse at 1 year would be selected for more evaluation. However, patient accrual to either treatment could be stopped early if unacceptable toxicity was observed. Forty percent or more of the patients not tolerating the initial dose for at least two courses of treatment was considered not acceptable. A treatment would be dropped if 13 or more of the first 20 patients on that arm cannot tolerate at least two courses of treatment at the starting dose, since the probability of 13 or more out of 20 is only 0.02 if the true proportion is 40%. B. Designs with Minimum Activity Requirements Though the selection design is most appropriate when acceptable levels of treatment efficacy are no longer in question, the idea of selection is sometimes applied to randomized phase II trials when anticancer activities have not been previously established for the treatments involved. Alternatively, the side effects of the treatments could be sufficiently severe that a certain activity level must be met to justify the therapy. In such cases, each treatment arm is designed as a stand-alone phase II trial with the same acceptance criteria for all arms. When more than one treatment arm is accepted, the observed best arm is selected for further study. Statistical properties of this approach have not been formally quantified. However, designing the study with the larger sample size between what is required for a standard phase II and that for a selection phase II would generally give a reasonable result. For example, with binary data, if a 20% response rate would not justify further investigating a treatment regimen whereas a 40% response rate definitely would, a standard phase II design based on the binomial distribution requires 44 patients in a single-stage study. Fourteen or more responses out of 44 would represent sufficient activity for continued pursuit (6). This design has a type I error of 0.04 (chance of observing ⱖ14 responses out of 44 when the true response rate is 20%) and a power of 0.90 (chance of observing ⱖ14 responses out of 44 when the true response rate is 40%). On the other hand, for a selection design to have 0.90 correct selection probability when the response in one treatment is higher than the others by an absolute 15%, a sample size of 37 or 55 per group is required for K ⫽ 2 or 3, respectively, per Table 1. Therefore, one would design the study with 44 patients per arm when K ⫽ 2 and 55 patients per arm when K ⫽ 3. The acceptance criterion for N ⫽ 55 is adjusted to 17 or more responses. When activity levels are low for all treatments, the minimum response requirement would dominate and reject all treatments. When there are more than one treatment with acceptable response levels and one particularly superior treatment, the combined minimum

124

Liu

activity/selection criteria would also result in the correct selection with high probability. Limited simulations were conducted for K ⫽ 2, N ⫽ 44 and K ⫽ 3, N ⫽ 55, with minimum acceptance criteria stated above and one superior arm for which the response rate is higher than the rest by 15%. The results indicate the probability of the best arm meeting the minimum requirement and being selected is close to 0.20 when the true best response rate ( p K) is 25%. The correct selection probability is in the low to mid 0.70 range when p K ⫽ 35% and approximately 0.90 when p K is 45% or higher. Alternatively, for outcomes available in a short time, the trial could proceed in stages. The larger sample size of the two-stage design at full accrual and the selection design should still be used. The sample sizes at different stages and acceptance criteria could be adapted if needed (6). If more than one treatment is accepted at the completion of the second stage, the observed best treatment would be selected for further study. Some authors suggest designing the activity screening with standard two-stage design sample sizes, with more patients enrolled for the selection purpose only when more than one treatment is accepted. This approach is not recommended because the decision to enroll more patients (or not) could be influenced by observed outcome data. C.

Designs for Ordered Treatments

When the K (ⱖ3) treatments under consideration consist of increasing dose schedules of the same agents, the design can take advantage of this inherent order. A simple method is to fit regression models to the outcomes with treatment groups coded in an ordered manner according to the increasing dose levels. Logistic regression for binary outcomes and the Cox model for survival are obvious choices. A single independent variable with equally spaced scores for the treatments would be included in the regression (e.g., 1, 2, and 3 for three groups). If the sign of the observed slope is in the expected direction, the highest dose with acceptable toxicity is selected for further study. Otherwise, the lowest dose schedule would be selected. Compared with the nonordered design, this approach should require smaller sample sizes for the same correct selection probability. Limited simulations were conducted with the following results. For binary data with K ⫽ 3, p1 ⫽ 40%, p3 ⫽ 55%, p1 ⱕ p2 ⱕ p3, approximately N ⫽ 35 per arm is needed for a 0.90 chance that the slope from the logistic regression is positive. Compared with N ⫽ 55 given in Table 1, this is a substantial reduction in sample size. Similarly, for K ⫽ 4, p1 ⫽ 40%, p4 ⫽ 55%, p1 ⱕ p2 ⱕ p3 ⱕ p4, 40 patients per arm are needed instead of 67. For exponential survival data with a 1.5 hazard ratio between the worst groups and the best group, approximately 28 and 32 events per group are need for the worst groups for K ⫽ 3 and 4, respectively, as compared with 34 and 41 given in Table 3.

Phase II Selection Designs

125

IV. MISAPPLICATIONS AND RESULTING FALSE-POSITIVE RATES As mentioned in the beginning, the principal misuse of the selection design is to treat the results as ends in themselves without the required phase III investigations. Liu et al. (3) previously published the false-positive rates of this misapplication. Briefly, for binary outcomes, when the true response rates are the same in all treatments and in the 10–20% range, the chance of observing an absolute 10% or higher difference between two treatments is roughly 0.20 to 0.35 when K ⫽ 2 and 0.30 to 0.45 when K ⫽ 3. When the shared response rate is in the 30–60% range, the chance of observing a 15% or greater difference is close to 0.20 for K ⫽ 2 and 0.20 to 0.30 for K ⫽ 3. There is a 0.10 chance for a greater than 20% difference if the true rate is 50% or 60% for either K. These are impressive looking differences arising with high frequencies purely out of chance. Similarly, with Table 2 sample sizes and the same exponential survival distribution for all groups, the chances of observing a hazard ratio greater than 1.3 and 1.5 are 0.37 and 0.16, respectively, for K ⫽ 2 and 0.52 and 0.21, respectively, for K ⫽ 3. Again, observed hazard ratios of 1.3 to 1.5 often represent true treatment advances in large definitive phase III studies. But with selection sample sizes they appear with alarmingly high probabilities when there are no survival differences between treatments at all. Some have proposed changing the selection criterion so that the observed best treatment will be further studied only when the difference over the worst group is greater than some positive ∆; otherwise, none of the treatments will be pursued. Although this approach may seem appealing at face value, the sample size requirement for the same correct selection probability increases quickly as ∆ increases. To illustrate, for binary data with K ⫽ 2, p1 ⫽ 50% and p2 ⫽ 65%, 36 patients per treatment are required for ∆ ⫽ 0 and a 0.90 correct selection probability per Table 1. With the same configuration and 0.90 correct selection probability, the required number of patients per group is 40, 55, 79, and 123 when ∆ ⫽ 1%, 3%, 5%, and 7%, respectively. Clearly when ∆ ⬎ 5% the sample size required is impractical. Even with ∆ ⫽ 5% and 79 patients per group, by design the correct selection probability remains 0.90 when the true response rates are 50% and 65%, yet the results are by no means definitive when a greater than 5% absolute difference is observed. When p1 ⫽ p2 ⫽ 50% and n1 ⫽ n2 ⫽ 79, the chances of observing |pˆ1 ⫺ pˆ2| ⬎ 5% and 10% are approximately 0.53 and 0.21. Therefore, this approach is not recommended because it may incur a false sense of confidence that a true superior treatment has been found when the observed results are positive. We (3) also pointed out that performing hypothesis tests post-hoc changes the purpose of the design. If the goal is to reach definitive answers, then a phase III comparison should have been designed with appropriate analyses and error

126

Liu

rates. Testing hypotheses with selection sample sizes could be likened to conducting the initial interim analysis for phase III trials. It is well known that extremely stringent p values are required to ‘‘stop the trial’’ at this early stage. Finally, the inclusion of a standard or control treatment in a selection design is especially dangerous. Without a control arm, any comparison between the standard and the observed best treatment from a selection trial is recognized as informal because the limitations of historical comparisons are widely accepted. When a control arm is included for randomization, the legitimacy for comparison is established and there is great temptation to interpret the results literally and ‘‘move on.’’ If there are no efficacy difference between treatments, the chance of observing an experimental treatment better than the control is (K ⫺ 1)/K, that is, 1/2 for K ⫽ 2, 2/3 for K ⫽ 3, and so on. The chance of an impressive difference between an experimental treatment and the control treatment is as discussed above for K ⫽ 2 and higher than 2/3 of those discussed above for K ⫽ 3. In this case, false-negative conclusions are as damaging as false-positive ones because a new treatment with similar efficacy as the control but less severe side effects could be dismissed as ineffective.

V.

CONCLUDING REMARKS

The statistical principles of the selection design are simple and adaptable to various situations present in cancer clinical research. Applied correctly, the design can serve a useful function in the long and arduous process of new treatment discovery. However, perhaps due to the time and resources involved, there can be a tremendous tendency to stop short of the phase III testing and declare winners at less than a quarter of the distance of this marathon race. The resulting errors can send research into lengthy detours and do great disservice to cancer patients. Unless follow-on definitive evaluations are ensured by external means, the absence of a control treatment may be the only imperfect safeguard for the design.

REFERENCES 1. Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival endpoints. Controlled Clin Trials 1995; 16:119–130. 2. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treatment Rep 1985; 69:1375–1381. 3. Liu PY, LeBlanc M, Desai M. False positive rates of randomized phase II designs. Controlled Clin Trials 1999; 20:343–352. 4. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival. Biometrics 1993; 49:391–398.

Phase II Selection Designs

127

5. Muggia FM, Liu PY, Alberts DS, et al. Intraperitoneal mitoxantrone or floxuridine: effects on time-to-failure and survival in patients with minimal residual ovarian cancer after second-look laparotomy—a randomized phase II study by the Southwest Oncology Group. Gynecol Oncol 1996; 61:395–402. 6. Green S, Dahlberg S. Planned versus attained design in phase II clinical trials. Stat Med 1992; 11:853–862. 7. Gibbons JD, Olkin I, Sobel M. Selecting and Ordering Populations: A New Statistical Methodology. New York: Wiley, 1977.

7 Power and Sample Size for Phase III Clinical Trials of Survival Jonathan J. Shuster University of Florida, Gainesville, Florida

I.

INTRODUCTION

This chapter is devoted to two treatment 1-1 randomized trials where the outcome measure is survival or, more generally, time until an adverse event. In practical terms, planning requires far more in the way of assumptions than trials whose results are accrued quickly. This chapter is concerned with the most common set of assumptions, proportional hazards, which implies that the ratio of the instantaneous probability of death at a given instant (treatment A: treatment B), for patients alive at time T, is the same for all values of T. Sample size guidelines are presented first for situations in which there is no sequential monitoring. These are modified by an inflation factor to allow for O’Brien–Fleming type sequential monitoring (1). The methodology is built on a two-piece exponential model initially, but this is later relaxed to cover proportional hazards in general. One of the most important aspects of this chapter centers on the nonrobustness of the power and type I error properties when proportional hazards are violated. This is especially problematic when sequential methods are used. Statisticians are faced with the need for such methods on the one hand but with the reality that they are based on questionable forecasts. As a partial solution to this problem, a ‘‘leap-frog’’ approach is proposed, where a trial accrues a block of patients and is then put on hold. 129

130

Shuster

In Section II, a general sample size formulation for tests that are inversions of confidence intervals is presented. In Section III, results for exponential survival and estimation of the difference between hazard rates will be given. In Section IV, the connection between the exponential results and the logrank test as described in Peto and Peto (2) and Cox regression (3) are given. Section V is devoted to a generalization of the results for the exponential distribution to the two-piece exponential distribution with proportional hazards and to a numerical example. In Section VI, the necessary sample size is developed for the O’Brien–Fleming method of pure sequential monitoring. In that section, it is shown that the maximum sample size inflation over trials with no monitoring is typically of the order of 7% or less. For group sequential plans, this inflation factor is even less. Section VII is devoted to ways of obtaining the ‘‘information fraction time,’’ an essential ingredient for sequential monitoring. The fraction of failures observed (interim analysis to final planned analysis) is the recommended measure. If that measure is indeed used and if the trial is run to a fixed number of failures, then the piecewise exponential assumption can be further relaxed to proportional hazards. Section VIII deals with application of the methods to more complex designs, including multiple treatments and 2 ⫻ 2 factorial designs. Section IX is concerned with competing losses. Finally, Section X gives the practical conclusions, including major cautions about the use of sequential monitoring.

II. A BASIC SAMPLE SIZE FORMULATION The following formulation is peculiar looking but very useful (see Shuster (4) for further details). Suppose the following Eq. (a) and (b) hold under the null (H0 ) and alternative hypothesis (H1 ), respectively: H0 : ∆ ⫽ ∆ 0 vs. H1 : ∆ ⫽ ∆1 ⬎ ∆ 0 P[(∆ˆ ⫺ ∆ 0 )/SE ⬎ Zα] ⬇ α

(a)

P[W ⫹ (∆ˆ ⫺ ∆1 )/SE ⬎ ⫺ Zβ] ⬇ 1 ⫺ β

(b)

where W ⫽ (Zα ⫹ Zβ )(S ⫺ SE)/SE. SE is a standard error estimate, calculated from the data only, valid under both the null and alternate hypothesis, S is a function of the parameters (including sample size) under the alternate hypothesis, and S satisfies the implicit sample size equation: S ⫽ (∆1 ⫺ ∆ 0 )/(Zα ⫹ Zβ )

(1)

Then it follows under Hαˆ , P[(∆ˆ ⫺ ∆ 0 )/SE ⬎ Z α] ⬇ 1 ⫺ β. That is, the approximate α level test has approximate power 1⫺β.

(2)

Power and Sample Size in Phase III Trials

131

Note that Z α and Z β are usually (but need not always be) the 100α and 100β upper percentiles of the standard normal cumulative distribution function (CDF). Suppose you have a ‘‘confidence interval inversion’’ test, with normal distributions after standardizing by the standard error statistic. To validate the implicit sample size formula, you need only show that under Ha, S/SE converges to one in probability. For two-sided tests, set up so that ∆1 ⬎ ∆ 0, α is replaced by α/2. Binomial Example: (one-sided test) For binomial trials with success rates P1 and P2 and equal sample size N per group, S ⫽ sqrt[{P1(1 ⫺ P1 ) ⫹ P2 (1 ⫺ P2 )}/N] ∆1 ⫽ (P1 ⫺ P2 ) ∆0 ⫽ 0 and hence from Eq. (1), each treatment will have sample size N ⫽ [P1 (1 ⫺ P1 ) ⫹ P2 (1 ⫺ P2 )]{(Z α ⫹ Z β )/(P1 ⫺ P2 )}2 This formula is useful for determining the sample size based on the Kaplan– Meier statistic (5). For the two-sided test, we replace α by α/2 in the above expression.

III. EXPONENTIAL SURVIVAL If the underlying distribution is exponential, it can be shown that the stochastic process that plots the number of deaths observed (Y–axis) versus total accumulated time on test (X-axis) is a homogeneous Poisson process with hazard rate equal to the constant hazard of the exponential distribution. For treatments i ⫽ 1, 2 let λi ⫽ hazard rate for treatment i and Fi ⫽ total number of failures to total accumulated time on test Ti. λˆ i ⫽ Fi /Ti ⬃ Asy N[λi, λ 2i /E(Fi )] Let ∆ˆ ⫽ λˆ 1 ⫺ λˆ 2 and ∆ ⫽ λ1 ⫺ λ2. In the notation of the previous section, one has SE ⫽ [(λˆ 21 /F1 ) ⫹ (λˆ 22 /λ 2 )]0.5

132

Shuster

and S ⫽ [λ21 /E(F1 ) ⫹ λ22 /E(F2 )]0.5

(3)

If patients are accrued uniformly (Poisson arrival) over calendar time (0,X) and followed until death or to the closure (time X ⫹ Y) (Y being the minimum follow-up time), then the probability of death for a random patient assigned to treatment i is easily obtained as Pi ⫽ 1 ⫺ Q i

(4)

where Q i ⫽ exp(⫺λ iY)[1 ⫺ exp(⫺λ i X)]/(λ i X) The expected number of failures on treatment i is E(F i ) ⫽ 0.5ΨXP i

(5)

where Ψ is the accrual rate for the study (half assigned to each treatment). If the allocations are unequal with γ i assigned to treatment i (γ 1 ⫹ γ 2 ⫽ 1), simply replace the 0.5 in Eq. (5) by γ i. This is useful in planning studies of prognostic factors or clinical trials where the experimental treatment is very costly compared with the control. After substituting Eq. (3) in Eq. (1), the resulting equation must be solved iteratively (bisection is the method of choice) to identify the accrual period, X, required for given planning values of the accrual rate, Ψ, minimum follow-up period Y, and values λ 1 and λ 2 (and hence of ∆) under the alternate hypothesis. Similar methods for exponential survival have been published by Bernstein and Lagakos (6), George and Desu (7), Lachin (8), Morgan (9), Rubinstein et al. (10), and Schoenfeld (11). These methods use various transformations and thus yield slightly different though locally equivalent results. Schoenfeld allowed the incorporation of covariates, whereas Bernstein and Lagakos allowed one to incorporate stratification.

IV. APPLICATIONS TO THE LOGRANK TEST AND COX REGRESSION Two important observations extend the utility of the above sample size formulation to many settings under ‘‘proportional hazards.’’ 1.

Peto and Peto (2) demonstrated the full asymptotic local efficiency of the logrank test when the underlying survival distributions are exponential. This implies that the power and sample size formulas of Section III apply directly to the logrank test (as well as the likelihood-

Power and Sample Size in Phase III Trials

133

based test, used above.) See Appendix I for further discussion of the efficiency of the logrank test. For two treatments, the test for no treatment effect in Cox regression (with no other covariates) is equivalent to the logrank test. 2. Two distributions have proportional hazards if and only if there exists a continuous strictly monotonic increasing transformation of the time scale that simultaneously converts both to exponential distributions. This means that an investigator can plan any trial that assumes proportional hazards, as long as a planning transformation that converts the outcomes to exponential distributions is prespecified. This can be approximated well if historical control data are available. The only problems are to redefine the λ i and to evaluate E(F i ), taking the transformation into account, since accrual may not be uniform in the transformed time scale. Three articles offer software that can do the calculations: Halperin and Brown (12), Cantor (13), and Henderson et al. (14). These programs (or papers as in the original) can also investigate the robustness of the sample size to deviations from distributional assumptions. Another approach, that is quite robust to failure in correctly identifying the transformation, presumes a two-piece exponential model. This is discussed in the next section along with a numerical example.

V.

PLANNING A STUDY WITH PIECE-WISE EXPONENTIAL SURVIVAL

A. Parameters to Identify 1. Input Parameters Ψ ⫽ Annual planning accrual rate (total) (e.g., 210 patients per year are planned, 50% per arm) Y ⫽ Minimum follow-up time (e.g., Y ⫽ 3 years) R 1 ⫽ Planning Y-year survival under the control treatment (e.g., R 1 ⫽ 0.60) R 2 ⫽ Planning Y-year survival under the experimental treatment (e.g., R 2 ⫽ 0.70) Sidedness of test (one or two) (e.g., one) α ⫽ type 1 error (e.g., 0.05) π ⫽ 1 ⫺ β, the power of the test (e.g., 0.80) ρ ⫽ Posttime Y:pretime Y hazard ratio (e.g., 0.5) ρ represents the piecewise exponential. Before the minimum follow-up Y, if the hazard is λ i on treatment i, it is ρλ i on treatment i after Y. Note that λ i ⫽ ⫺ln(R i )/Y.

134

Shuster

2. Output Parameters X X⫹Y ΨX E(F 1 ) ⫹ E(F 2 )

⫽ ⫽ ⫽ ⫽

Accrual period (e.g., 2.33 years) Total study duration (e.g., 5.33 years) Total accrual required (e.g., 490 patients) Total expected failures (e.g., 196)

Based on note 2, Section IV, the calculation of E(F i ) for Eq. (3) and (5) is straightforward. For a randomly selected patient, the time at risk is uniformly distributed over the period from Y to (X ⫹ Y) and the probability of death, D i (x), conditional on a time at risk, Y ⫹ x, is the probability of death by time Y, (1 ⫺ R i ), plus the probability of surviving to time Y, but dying between times Y and x ⫹ Y, that is, D i (x) ⫽ (1 ⫺ R i ) ⫹ R i (1 ⫺ exp(ρλ ix))] The unconditional probability of death for a randomly selected patient on treatment i is found by taking the expectation of D i (x) over the uniform distribution for x from 0 to X. For this piecewise exponential model and Poisson accrual process, the value of Q i to be used in Eqs. (4) and (5) is Q i ⫽ R i[1 ⫺ exp(⫺ ρλ i X)]/(ρλ iX)

(6)

Note that λ i is defined via the equation R i ⫽ exp(⫺ λ iY)

(7)

and hence for exponential data (ρ ⫽ 1) the value of Q i agrees with that below Eq. (4). B.

Key Observations

1. The expected failures on treatment i in the first Y years of patient risk is 0.5ΨX(1 ⫺ R i ), where Ψ is the annual accrual rate, X is the accrual period, and R i is the planning Y-year survival on Treatment i. 2. If one transformed the subset of the time scale (0, Y) onto (0, Y), with a strictly monotonic increasing function, the expected number of failures that occurred before patient time Y would be unchanged. This in turn implies that the exponential assumption over the interval (0, Y) can be relaxed to require proportional hazards only. (From Eq. (7), the definition of λ i ⫽ ⫺ln(R i )/Y in Eq. (3), depends only on the fixed Y-year survival on treatment i, R i, and is not affected by the transformation.) 3. All other things being equal, Eq. (3) implies that the larger the expected number of failures, the greater the power of the test. Since the larger the value of ρ, the greater the hazard post-Y-years, power increases as ρ increases. 4. If hazard rates are low after time Y, the exponentiality assumption postY is not important (although proportional hazards may be). It is convenient to

Power and Sample Size in Phase III Trials

135

think of ρ as an average hazard ratio (post:pre-year Y), since it enters the sample size equation only through the expected number of failures. In fact, if one used the above sample size methodology to approximate the accrual duration and minimum follow-up but actually ran the trial until the total failures equaled the total expected in the study: E(F 1 ) ⫹ E(F 2 ), then the power holds up under proportional hazards, without regard to the piece-wise exponential assumption. To further illustrate this point, in the above example, the planning hazard ratio (HR) for survival rates of 60% versus 70% at 3 years is 0.698. Based on Shuster (15), if each failure is considered to approximate an independent binomial event, which under the null hypothesis has a 50% probability of falling into each treatment and under the alternative a HR/(1 ⫹ HR) ⫽ 0.698/1.698 ⫽ 0.411 probability of being in the experimental arm versus 0.589 probability of being in the control arm, then from the results of Section II, the study should accrue until 189 failures have occurred (nearly an identical number to the 196 expected failures derived from the piece-wise exponential model). 5. The variance (the square of the right-hand side of formula (3)) decreases more rapidly than (1/X), the reciprocal of the accrual duration, because although the number of patients increase linearly, the longer time at risk increases the probability of death for each entrant. However, for small increments in X, the rate of change is approximately proportional to the derivative of (1/X), namely ⫺(1/X2 ). This will be used to apply an inflation factor for sequential monitoring (Section VI). 6. The effect of ρ in the above example, where accrual is 2.33 years and minimum follow-up is 3 years, is relatively modest. Under exponentiality, 448 patients would be required (ρ ⫽ 1), the actual case ρ ⫽ 0.5, 490 patients, whereas if ρ ⫽ 0.0, 561 patients would be needed. The effect of ρ is much more striking if accrual is slow. Under the same parametrization, but with an accrual of 60 patients per year instead of 210, the patient requirements are 353, 407, and 561 for ρ ⫽ 1, 0.5, and 0.0, respectively. 7. Software for these calculations is available in Shuster (5), on a Windows platform. A Macintosh version exists but must invoke 16-bit architecture. Appendix II contains a SAS macro that also can be used for the calculations. The equal patient allocation of 50% to each arm is nearly optimal in terms of minimizing the smallest sample size, but the macro allows for both equal or unequal allocation

VI. SEQUENTIAL MONITORING BY THE O’BRIEN–FLEMING METHOD In this section, from elementary asymptotic considerations of the Poisson processes obtained in exponential survival, one can use asymptotic Brownian motion to connect full sequential monitoring power to no monitoring power.

136

Shuster

First, define a ‘‘time parameter’’ θ, 0 ⬍ θ ⬍ 1, with θ ⫽ 1 being the maximum allowable time of completion, such that ∆ˆ θ is asymptotically N(∆, S 2 /θ) θ represents the ratio of variances of the estimate of effect size (final to interim analysis) and S 2 is the variance of the estimate of effect size at the planned final analysis (θ ⫽ 1), calculated per Eq. (3). From asymptotic considerations of the Poisson process (with the notation the same as the section on the exponential, except that the index θ was added to delineate the time scale) θ ∆ˆ θ is asymptotically a Brownian motion process with drift ∆ and diffusion constant S 2. From first passage time considerations (see Cox and Miller (16) for example), on the basis of the rejection region for testing ∆ ⫽ 0 versus ∆ ⬎ 0, the power function for the O’Brien–Fleming test that rejects the null hypothesis if at any time θ θ ∆ˆ θ ⬎ SZ α/2 is Π(∆, S, α) ⫽ Φ[(∆/S) ⫺ Z α/2] ⫹ exp[(2∆Z α/2 )/S]Φ[⫺(∆/S)⫺Z α/2] where Φ ⫽ standard normal CDF and Z p is the upper 100p percentile of Φ. Note that Π(0, S, α) ⫽ α. For no monitoring, the power function is defined by Π no (∆, S, α) ⫽ Φ[(∆/S)⫺Z α] For example, if investigator 1 ran a study sensitive to an effect size ∆/S ⫽ 2.486 and required no monitoring, and investigator 2 ran a slightly larger study sensitive to an effect size of ∆/S ⫽ 2.576 but used continuous O’Brien–Fleming bounds for sequential monitoring, the two studies would both have 80% power

Table 1 Sensitivity of No Monitoring (None) vs. Continuous Monitoring by O’Brien–Fleming (OF) α 0.010 0.025 0.050 0.010 0.025 0.050

Power

None ∆/S

O-F ∆/S

0.80 0.80 0.80 0.90 0.90 0.90

3.168 2.801 2.486 3.608 3.242 2.926

3.238 2.881 2.576 3.688 3.332 3.026

∆/S represents the detectable effect size.

Power and Sample Size in Phase III Trials

137

at α ⫽ 0.05. There is almost no penalty for this type of monitoring, even if indeed the study runs to completion. Since as remarked in Section V, note 5, increasing the accrual duration slightly means that the variance changes proportional to the change in the reciprocal of the accrual duration, an approximate measure for the increase in accrual mandated by continuous O’Brien monitoring is approximately the square of the ratio of the entries in Table 1. The inflation factors would be 4% (α ⫽ 0.01), 6% (α ⫽ 0.025), and 7% (α ⫽ 0.05). Group sequential monitoring by the O’Brien–Fleming method would require a smaller inflation factor for the maximum sample size, since it will fall between no monitoring and continuous monitoring. The software package ‘‘East’’ (17) does not handle nonexponential data but can be used to derive an approximate inflation factor for true group sequential designs. But experience has shown it to be not much smaller than those derived for continuous monitoring. In the example above, where 490 patients were needed for a trial conducted without sequential monitoring, an additional 7% would bring the necessary accrual to 524 (34 more entrants), if the trial was to be sequentially monitored by the O’Brien–Fleming method. This represents an accrual duration of at most 2.49 years with monitoring versus 2.33 years without monitoring.

VII. EVALUATION OF THE INFORMATION FRACTION It is of interest to note that for the piece-wise exponential model, one can derive the information fraction as the ratio of the final to interim variance as in the strict definition (see Eq. (3)). All one needs to compute is the expected number of failures at an interim analysis. Others use the ratio of expected failures (interim to final), whereas still others use the ratio of actual failures to total expected. Although the use of variance ratios Eq. (3) appears to be quite different from the ratio of expected failures when the hazard ratios are not close to one, the fact is that for studies planned with survival differences of the order of 15% or less, where the expected total failures is 50 or higher, the two are almost identical. In the numerical example, above, where planning accrual is 210 per year, planning difference is a 10% improvement in 3-year survival from 60% (control) to 70% (experimental) and monitoring is handled by a continuous O’Brien–Fleming approach, it was concluded that 2.49 years of accrual plus 3 years of minimal follow-up (total maximum duration of 5.49 years) were needed. The information fractions, θ, for calendar times 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, and 5.49 years are, respectively, for Eq. (3) and for expected failures ratios in []: 1.0 years 0.0685 [0.0684]; 1.5 years 0.1504 [0.1502]; 2.0 years 0.2611 [0.2607]; 2.5 years 0.3985 [0.3979]; 3.0 years 0.5424 [0.5418]; 3.5 years 0.6704 [0.6698]; 4.0 years 0.7785 [0.7780]; 5.49 years 100% [100%]. These numbers are impressively close, de-

138

Shuster

spite the fact that under the alternate hypothesis, the planning hazard ratio (experimental to control) is 0.698, hardly a local alternative to the null value of 1.00. As noted above, the most robust concept is to use actual failures and terminate at a maximum of the exected number, which can be shown to be 211 for this continuous O’Brien–Fleming design.

VIII. MULTIPLE TREATMENTS AND STRATIFICATION The methods can be applied to pair-wise comparisons in multiple treatment trials. If one wishes to correct for multiple comparisons (a controversial issue), then one should use a corrected level of α to deal with this but keep the original power. Note that the accrual rate would be computed for the pair of treatments being compared. For example, if the study is a three-armed comparison and accrual is estimated to be 300 patients per year, 200 per year would be accrued to each pair-wise comparison. The definition of power applies to each pair-wise comparison. In most applications, it is my opinion that the planning α level should not be altered for multiple comparisons. This is because the inference about A versus B should be the same, for the same data, whether or not there was a third arm C. In other words, had a hypothetical trial of only A versus B been run and accrued the same data, one could reach a different conclusion if the actual analysis had been corrected for the number of pair-wise comparisons in the original trial. In general, the use of stratification on a limited scale can increase the precision of the comparison. However, the improvement over a completely randomized design is difficult to quantify. Hence, the nonstratified plan represents a conservative estimate of patient needs when the design and analysis are in fact stratified. If a 2 ⫻ 2 factorial study is conducted, this could be analyzed as a stratified logrank test, stratifying for the concomitant treatment. Interventions should be carefully selected to minimize the potential for qualitative interaction, a situation where the superior treatment depends on which of the concomitant treatments is used. If the assumption of no major interaction in the hazard ratios is reasonable, a study planned as if it was a two-treatment nonstratified study will generally yield a sample size estimate slightly larger (but unquantifiably so) than needed in the stratified analysis, presuming proportional hazards within the strata hold. If a qualitative interaction is indeed anticipated, then the study should be designed as a four-treatment trial for the purposes of patient requirements.

IX. COMPETING LOSSES In general, competing losses, patients censored for reasons other than being alive at the time of the analysis, are problematic unless they are treatment uninforma-

Power and Sample Size in Phase III Trials

139

tive (the reason for the loss is presumed to be independent of the treatment assignment and, at least conceptually, unrelated to the patient’s prognosis). Although it is possible to build in these competing losses in a quantified way for sample size purposes [see (10)], a conservative approach, preferred by this author, is to use a second ‘‘inflation factor’’ for competing losses. For example, if L ⫽ 0.10 (10%) are expected to be lost to competing reasons, the sample size would be inflated by dividing the initial sample size calculation by (1 ⫺ L) ⫽ 0.9 to obtain a final sample size.

X.

PRACTICAL CONCLUSIONS

The methods proposed herein allow a planner to think in terms of fixed term survival rather than hazard rates or hazard ratios. The use of a fixed number of failures in implementation also allows for simplistic estimates of the information fraction, as the failures to date divided by the total failures that would occur if the study is run to its completion. Irrespective of the proportional hazards assumption, the actual test (logrank or Cox) is valid for testing the null hypothesis that the survival curves are identical. However, the sample size calculation and power are sensitive to violations of the proportional hazards assumption and especially to ‘‘crossing hazards.’’ If a superior treatment is associated with a high propensity for early deaths, an early significance favoring the wrong treatment may emerge, causing the study to be stopped early and reaching the incorrect conclusion. For example, if at an early interim analysis, one treatment (e.g., bone marrow transplant) is more toxic and appears to have inferior outcome when compared with the other (chemotherapy alone), it might be very tempting but unwise to close the study for lack of efficacy. Although sequential analysis is often an ethical necessity, it is also an ethical dilemma, since statisticians are really being asked to forecast the future, where they have little in the way of reliable information to work with. It is recommended for studies where proportional hazards is considered to be a reasonable assumption that the study be planned as if the piece-wise exponential assumptions hold. If the plan is to monitor the study by the O’Brien– Fleming method, whether it is group sequential or pure sequential, apply the small inflation factor by taking the square of the ratio of the entry in the ∆/S columns per Table 1. Next, apply Eq. (3) using expected values in Eq. (4), (5), and (6) for the piece-wise model to obtain the expected number of failures to be seen in the trial. Conduct the trial as planned until this many failures occur or until the study is halted for significance by crossing an O’Brien–Fleming bound, where the information fraction is calculated as actual failures/maximum number of failures that would occur at the planned final analysis. One possible recommendation, for survival trials where there is concern about the model assumptions, is to conduct accrual in stages. For example, it

140

Shuster

might be prudent to accrue patients onto a trial (A) for a period of 1 year and, irrespective of any interim results, close the trial temporarily and begin accrual on a new trial (B) for a period of 1 year. A decision as to whether to continue accrual to trial (A) for another year or begin yet another trial (C) for a year could be made with more mature data than otherwise possible. This leap-frog approach would continue in 1-year increments. This process might slow down completion of trial A but also might speed up completion of trials B and C. In addition, for safety purposes, there would be a much smaller disparity between calendar time and information time. In the numerical example initiated in Section V, for the O’Brien–Fleming design, the information fraction at 1 year (about 40% of accrual) would be only 7%, whereas the information fraction at 2 years (about 80% of accrual completed) would be 26% (19% coming from the first year of accrual plus only 7% coming from the second year of accrual). In fact, if one had a 1year ‘‘time out’’ for accrual but ran the second stage for 1.2 years, analyzing the data 3.0 years after the end of accrual, the study would have the same power as the 2.49 year accrual study (both with continuous O’Brien–Fleming bounds). Completion would occur after 6.2 years instead of 5.49 years, presuming the study ran to completion. Offsetting this difference, trial B would be completed earlier by the leap-frog approach by about 0.8 years. The slowdown, generated by the leap-frog approach, would also allow the analyst to have a better handle on model diagnostics, while fewer patients were at risk.

ACKNOWLEDGMENT Supported in part by grant 29139 from the National Cancer Institute.

APPENDIX I: LOGRANK VS. EXPONENTIAL ESTIMATION— LOCAL OPTIMALITY For treatment i and day j, let N ij and F ij represent the number at risk at the start of the day and number of failures on that day, respectively. For the logrank test, the contribution to the observed minus expected for day j (to the numerator of the statistic, where the denominator is simply the standard error of the numerator) is F 1j ⫺ N 1j (F 1j ⫹ F 2j )/(N 1j ⫹ N 2j ) ⫽ {(F 1j /N 1j ) ⫺ (F 2j /N 2j )}/(N 1j⫺1 ⫹ N 2j⫺1 ) The day j estimate of each hazard, F ij /N ij, is weighted inversely proportional to (N 1j⫺1 ⫹ N 2j⫺1 ), that is, proportional to N 1j N 2j /(N 1j ⫹ N 2j ). The weight is zero if either group has no patient at risk starting day j. For the exponential test, the contributions to the estimates of the hazards for day j are

Power and Sample Size in Phase III Trials

141

(N ij /N i.)(F ij /N ij ) where N i. is the total days on test for treatment i. Under the null hypothesis, and exponentially, both weight the information inversely proportional to the variance, and as such, both tests are locally optimal. Using the approximation that the N ij are fixed rather than random, a relative efficiency measure can easily be obtained for nonlocal alternatives, and in practice, the relative efficiency for studies planned as above with 30 or more failures will generally be only trivially less than 1.00 (typically 0.99 or higher). Note that there are two things helping the logrank test. First, if the ratio of N 1j /N 2j remains fairly constant over time j ⫽ 1, 2, and so on, until relatively few are at risk, the weights will be very similar except when both become very small. Second, stratified estimation is robust against modest departures from the ‘‘optical allocation,’’ which is represented by the exponential weights. Comment: The mathematical treatment of this chapter is somewhat unconventional in that it uses the difference in hazards for the exponential rather than the hazard ratio or log hazard ratio. This enables a user to directly investigate the relative efficiency issue of the logrank to exponential test. On the negative side, by not using a transformation, the connection to expected failures is not as direct as it would be under the transform to a hazard ratio. Comment: SAS (Statistical Analysis Systems) macros for the logrank and stratified logrank tests are included in Appendix III.

APPENDIX II: SAS MACRO FOR SAMPLE SIZE AND ACCRUAL DURATION Note that this macro can handle unequal allocation of patients, something useful for planning studies for the prognostic significance of yes/no covariates. Usage %ssize(ddsetx,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu); %ssize(a,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu); ddsetx⫽user supplied name for data set containing the planning parameters. See Section V. Alloc⫽Fraction allocated to control group (Use .5 for 1-1 randomized trials) Psi⫽Annual Accrual Rate Y⫽Minimum Follow-up R1⫽Control Group planned Y-Year Survival R2⫽Experimental Group planned Y-Year Survival

142

side⫽1 (one-sided) or 2(two-sided) alpha⫽size of type I error pi⫽Power rho⫽Average Post:pre Y-Year hazard ratio lfu⫽Fraction expected to be lost to follow-up. %MACRO ssize(ddsetx,alloc,psi,y,r1,r2,side,alpha,pi,rho,lfu); options ps⫽60 ls⫽75; OPTIONS NOSOURCE NONOTES; DATA DDSETX;set &ddsetx; alloc⫽&alloc;psi⫽ψy⫽&y;r1⫽&r1;r2⫽&r2pha⫽α pi⫽πrho⫽ρlfu⫽&lfu; lam1 ⫽ log(r1)/y; lam2⫽log(r2)/y; del⫽abs(lam1⫺lam2); za⫽probit(alpha/side); zb⫽probit(pi); s⫽del/(za⫹zb); x⫽0;inc⫽1; aa:x⫽x⫹inc; q1⫽r1*(1⫺exp(⫺rho*lam1*x))/(rho*lam1*x); q2⫽r2*(1⫺exp(⫺rho*lam2*x))/(rho*lam2*x); p1⫽1⫺q1; p2⫽1⫺q2; v1⫽((lam1**2)/(alloc*psi*x*p1)); v2⫽((lam2**2)/((1⫺alloc)*psi*x*p2)); s2⫽sqrt(v1⫹v2); if s2⬎s and inc⬎.9 then goto aa; if inc⬎.9 then do;inc⫽.01;x⫽x⫺1;goto aa;end; if s2⬎2 then goto aa; n⫽int(.999⫹psi*x); na⫽n/(1⫺1fu);na⫽int(na⫹.999); ex fail⫽psi*x*(alloc*p1⫹(1⫺alloc)*p2); data ddsetx;set ddsetx; label alloc⫽‘Allocated to Control’ rho⫽‘Post to Pre y-Year Hazard’ psi⫽‘Accrual Rate’ y⫽‘Minimum Follow-up (MF)’ r1⫽‘Planned control Survival at MF’ r2⫽‘Planned Experimental Survival at MF’ side⫽‘Sidedness’ alpha⫽‘P-Value’

Shuster

Power and Sample Size in Phase III Trials

143

pi⫽‘power’ lfu⫽‘Loss to Follow-up Rate’ x⫽‘Accrual Duration’ n⫽‘Sample size, no losses’ na⫽‘Sample size with losses’ ex fail⫽‘Expected Failures’; proc print label; var alloc psi y r1 r2 side alpha pi rho lfu x n na ex fail; %mend; To produce the example in Section V, this program was run. data ddsetx; input alloc psi y r1 r2 side alpha pi rho lfu; cards; .5 210 3 .6 .7 1 .05 .8 1 .1 .5 210 3 .6 .7 1 .05 .8 .5 .1 .5 210 3 .6 .7 1 .05 .8 .000001 .1 .5 60 3 .6 .7 1 .05 .8 1 .1 .5 60 3 .6 .7 1 .05 .8 .5 .1 .5 60 3 .6 .7 1 .05 .8 .000001 .1 %ssize(ddsetx, alloc, psi, y, r1, r2, side, alpha, pi, rho, lfu); APPENDIX III: SAS MACRO FOR LOGRANK AND STRATIFIED LOGRANK TESTS Sample usage: %logrank(ddset, time, event, class) %logrank(a, x1, x2, treat no) for the unstratified logrank test %logstr(ddset, time, event, class, stratum) %logstr(a, x1, x2, treat no, gender) for the stratified logrank test time, event, class, and stratum are user-supplied names in the user-supplied data set called ddset. ddset ⫽ name of data set needing analysis. time ⫽ time variable in data set ddset (time ⬍ 0 not allowed) event ⫽ survival variable in data set ddset event ⫽ 0 (alive) or event ⫽ 1 (died) class ⫽ categorical variable in ddset identifying treatment group. Stratum ⫽ categorical variable in ddset defining stratum for stratified logrank test.

144

Shuster

Power and Sample Size in Phase III Trials

145

146

Shuster

REFERENCES 1. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556. 2. Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc 1972; A135:185–206.

Power and Sample Size in Phase III Trials

147

3. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc 1972; B34:187–220. 4. Shuster JJ. Practical Handbook of Sample Size Guidelines for Clinical Trials. Boca Raton, FL: CRC Press, 1992. 5. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53:457–481. 6. Bernstein D, Lagakos SW. Sample size and power determination for stratified clinical trials. J Stat Comput Simul 1987; 8:65–73. 7. George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chronic Dis 1974; 27:15–29. 8. Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Controlled Clin Trials 1981; 2:93–113. 9. Morgan TM. Planning the duration of accrual and follow-up for clinical trials. J Chronic Dis 1985; 38:1009–1018. 10. Rubinstein LJ, Gail MH, Santner TJ. Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation. J Chronic Dis 1981; 34:469–479. 11. Schoenfeld DA. Sample size formula for the proportional-hazards regression model. Biometrics 1983; 39:499–503. 12. Halperin J, Brown BW. Designing clinical trials with arbitrary specification of survival functions and for the log rank test or generalized Wilcoxon test. Controlled Clin Trials 1987; 8:177–189. 13. Cantor AB. Power estimation for rank tests using censored data: conditional and unconditional. Controlled Clin Trials 1991; 12:462–473. 14. Henderson WG, Fisher SG, Weber L, Hammermeister KE, Sethi G. Conditional power for arbitrary survival curves to decide whether to extend a clinical trial. Controlled Clin Trials 1991; 12:304–313. 15. Shuster JJ. Fixing the number of events in large comparative trials with low event rates: a binomial approach. Controlled Clin Trials 1993; 14:198–208. 16. Cox DR, Miller HD. The Theory of Stochastic Processes. London: Methuen, 1965. 17. Mehta C. EaST: Early Stopping in Clinical Trials. Cambridge, MA: CyTEL Software Corporation.

8 Multiple Treatment Trials Stephen L. George Duke University Medical Center, Durham, North Carolina

I.

INTRODUCTION

Randomized clinical trials involving more than two treatments present problems and challenges in their design and analysis that are absent in trials involving only two treatments. These problems are part of the larger issues raised by multiplicities in clinical trials. Multiplicity (1–3) refers to issues concerning multiple outcome variables, subgroups, sequential analysis, covariate adjustments, and multiple treatments. This chapter deals with multiple treatments, although some issues raised here are relevant to other multiplicity topics. When there are only two treatments in a trial, inference is relatively straightforward since the only direct comparison is between the two treatments. When there are more than two treatments inference is more complex, and this complexity affects the design, conduct, and analysis of the trial (4). For example, with three treatments, in addition to the global comparison of all three treatments simultaneously, there are three possible paired comparisons among the three treatments and an unlimited number of other comparisons arising from various pooled weightings of the treatment groups. The number of paired and subset comparisons increases rapidly as the number of treatments increases. In this setting, it is vitally important to specify clearly the objectives of the trial and to be certain that these objectives are reflected in an appropriate design and analysis.

149

150

George

In the following sections, several common settings in which more than two treatments are involved are presented. These include multiple independent treatments, multiple experimental treatments compared with a control, factorial designs, and selection designs. In each setting, design strategies are presented along with their implications for sample size and for the conduct and analysis of the trial.

II. MULTIPLICITY AND ERROR RATES A.

Types of Errors

The primary difficulty raised by multiplicity, including multiple treatments, is that error rates may be elevated beyond their putative or nominal level. A simple and well-known example of this phenomenon arises in multiple significance testing. If we conduct N independent tests of significance, each at the nominal α level, the overall probability of finding at least one erroneous ‘‘significant’’ result when there are truly no differences is 1 ⫺ (1 ⫺ α) N. For α ⫽ 0.05, this error probability becomes quite large even for moderate N. For N ⫽ 5, it is 0.23. For N ⫽ 10, it is 0.40. Multiple testing with no adjustment of individual error rates dramatically increases the probability of finding spurious, but significant, differences. In the classic statistical hypothesis testing framework, two types of errors can occur in testing a specified null hypothesis against an alternative non-null hypothesis. Erroneously rejecting a true null hypothesis is a type I error. Erroneously failing to reject a false null hypothesis is a type II error. The error rates for these two types of errors are usually denoted by α and β, respectively. The complement of the type II error rate (1 ⫺ β) is referred to as the statistical power of the test procedure. These concepts apply equally to the overall experiment (‘‘experimental’’ error rates) or to the individual comparisons within the experiment. However, as the calculation in the previous paragraph illustrates, if we test all possible pairs of treatments and wish to set the experimental type I error rate to α, the individual type I error rates must each be less than α. But if we reduce the individual type I error rates without increasing the sample size, the individual type II error rates are increased. Thus, proper control of the experimental type I error rate can lead to highly conservative procedures with low power to detect individual differences. The sample size can be increased to avoid this problem, but this may require an unacceptably large sample size. The purpose of this chapter is to describe techniques of statistical design and analysis that address these issues.

Multiple Treatment Trials

151

B. Bonferoni and Related Procedures One of the older procedures used to control the experimental error rate is the Bonferoni procedure, based on a simple form of one of the Bonferoni inequalities: N

Pⱕ

冱p

i

i⫽1

where P is the probability of at least one event occurring out of N possible events and p i is the probability of the ith event (i ⫽ 1, . . . , N ) If we want P ⱕ α, one approach is to set p i ⫽ α/N. Hence, the simplest Bonferoni procedure, applied to multiple significance tests, is to use a significance level α/N for each individual test, where N is the number of tests and α is the overall type I error rate (5). This procedure is guaranteed to yield an overall error rate no larger than α when the global null hypothesis is true. However, it is a very conservative procedure, particularly where N is large and the tests are correlated. Application to treatment trials involving K treatments in which the tests are limited to all pair-wise comparisons yields a significance level of 2α/K(K ⫺ 1) for each individual test. For example, if α ⫽ 0.05 and K ⫽ 5, the significance level would be 0.005 for each of the 10 possible paired comparisons. Modifications of the simple Bonferoni procedure have been proposed to mitigate the overly conservative nature of the procedure. One of these (6) uses a significance level of jα/N for the jth ordered test ( j ⫽ 1, . . . , N). That is, ordering the p values p (1) ⱕ ⋅⋅⋅ ⱕ p (N) , one rejects the global null hypothesis if p ( j ) ⱕ jα/N for any j. This procedure is less conservative than the simple Bonferoni procedure. It has been extended by Hochberg (7), and Holm (8) defined a procedure that rejects the hypothesis H (i) corresponding to p (i) if and only if p(j) ⱕ

α N⫺j⫹1

for all j ⱕ i

This procedure is identical to Bonferoni for i ⫽ 1 but not as conservative otherwise. Hochberg and Benjamini (9) discuss these procedures and provide some suggested improvements. To avoid complexity and to focus attention on problems of multiplicity arising from multiple treatments, in the remainder of this chapter it is assumed that the K treatments have a single primary outcome measure, x i , which is normally distributed with an unknown mean µ i and common variance σ 2 . Thus, the various scenarios considered here are all expressed in terms of hypotheses involving the µ i . Complications arising from unequal variances, non-normal distributions, stratification, drop-outs, censoring, and so on are avoided. Such additional

152

George

considerations are of considerable practical importance in actual trials, but their introduction here would introduce complexities that would obscure the salient points of emphasis.

III. MULTIPLE INDEPENDENT TREATMENTS A.

Global Null Hypothesis

Perhaps the most common setting in clinical trials with multiple treatments involves the comparison of K independent, or nominal, treatments. That is, there are K different treatments with no implied ordering or other special relationship among the treatments. Typically, K is equal to three or four, but in some trials might be larger. In this setting, the primary hypothesis test is usually H 0 : µ 1 ⫽ µ 2 ⫽ ⋅⋅⋅ ⫽ µ K vs. H1 : µi ≠ µj

for some i, j with i ≠ j

That is, the global null hypothesis H 0 is that all of the µ i are identical. The alternative H 1 is that at least one of the µ i is not equal to the others. There are two general approaches in this setting: (i) Hierarchical approach—First construct a global α-level test of H 0 (10). If H 0 is not rejected, conclude in favor of H 0 with no further tests. However, if H 0 is rejected at the first step, proceed with paired comparisons of the µi. The primary attraction of this approach is that control of the experimental error rate is simple and direct. A rather unsatisfying result is that in case of failure to reject the global null hypothesis, there is no further examination of the treatment differences. The conclusion is simply that the K treatments are not demonstrably different. (ii) All possible paired comparisons—The second general approach is to conduct all possible paired comparisons (K(K ⫺ 1)/2 in number) but to do so in a way that preserves the experimental error rate α. As discussed above in connection with Bonferoni-type adjustments, this means that each comparison must be carried out at a significance level less than α. Methods for testing all possible pairs of treatments in a study have been investigated for a very long time (11,12). Some of these techniques are Tukey’s honestly significant difference and wholly significant difference tests, the Newman-Keuls test, the Duncan test, and the least significant difference test. All provide ap-

Multiple Treatment Trials

153

proaches to analyses that control the overall type I error rate. Details are not given here. B. Sample Size Implications For both of the general approaches described above, there is a cost in terms of the required sample size. In the case of two treatments with σ 2 known, the required (equal) sample size N for each treatment is N⫽

2(z α/2 ⫹ z β ) 2 ∆2

where z x is the 100x percentile of the standard normal distribution and ∆ ⫽ (µ 1 ⫺ µ 2 )/σ, the standardized difference in means that we desire to detect with power 1 ⫺ β. In practice, N is rounded up to the nearest integer value. For example, if α ⫽ 0.05, β ⫽ 0.10, and ∆ ⫽ 0.50, then N ⫽ 85 patients are required for each treatment. If there are K ⬎ 2 treatments, the required sample size depends on the specification of the means in the alternative case. Denoting the ordered means by µ (1) ⱕ ⋅⋅⋅ ⱕ µ (K) , the least favorable configuration of means for a given maximum difference ∆ ⫽ (µ (K) ⫺ µ (1))/σ occurs when µ (i) ⫽ (µ (1) ⫹ µ (K) )/2 for i ≠ 1, K. This is the configuration yielding the minimum power for a fixed sample size and is thus the configuration used to determine sample size. The exact sample size can be obtained by an iterative solution to an equation involving noncentral F distributions (13), but a close approximation is the following: N⫽

2{√χ 21⫺α (K ⫺ 1) ⫺ (K ⫺ 2) ⫹ z β } 2 ∆2

The required N for the case K ⬎ 2 treatments is always greater than that required for K ⫽ 2 treatments. The relative increase is given in Table 1 for K ⫽ 3 to 10. For example, with α ⫽ 0.05, β ⫽ 0.10, we require 18% more patients per treat-

Table 1 Multiples of Sample Size for Two Treatments Required for K ⬎ 2 Treatments Number of treatments (K ) (α, β) (0.05, 0.20) (0.05, 0.10) (0.01, 0.10)

3

4

5

6

7

8

9

10

1.21 1.18 1.16

1.35 1.30 1.27

1.46 1.40 1.35

1.56 1.48 1.43

1.65 1.55 1.50

1.73 1.62 1.56

1.80 1.68 1.61

1.87 1.73 1.67

154

George

ment for three treatments than for two treatments. Thus, if ∆ ⫽ 0.50 as in the earlier example, a two-treatment trial requires 2 ⫻ 85 ⫽ 170 patients but a threetreatment trial requires 3 ⫻ 1.18 ⫻ 85 ≅ 301 patients, rather than the apparent, but erroneous, requirement of 3 ⫻ 85 ⫽ 255 patients based on the two-treatment situation. More complex situations are covered elsewhere (14–18). The salient point in all situations is that the number of patients per treatment group for K ⬎ 2 treatments must be increased rather substantially over the number required for two treatments, and the more treatments, the greater the increase.

IV. PRESPECIFIED COMPARISONS In the previous section, it was assumed that comparison of all pairs of treatments was of interest. No specific comparison was assumed to be of more or less interest than any other comparison. In other settings, specific combinations of treatments or specific comparisons may be of primary or even exclusive interest. If this is the case, the number of comparisons can be limited and prespecified to reduce the inherent problems of multiplicity. One common example in a clinical trial setting is the use of K experimental treatments and a control or standard treatment (or perhaps both). If the control treatment mean is denoted µ 0 and the experimental treatment means µ i (i ⫽ 1, . . . , K), the hypotheses of interest may be, for example, µ i vs. µ 0 , ∑ µ i /K vs. µ 0 , or related hypotheses. The number of such hypotheses may be far less than K(K ⫺ 1)/2, the total number of paired comparisons. In the particular case of K experimental treatments compared with a control treatment, the hypotheses of interest are H 0 : µ i ⫽ µ 0 i ⫽ 1, . . . , K H 1 : µ i ≠ µ 0 for at least one µ i This represents exactly K comparisons, one for each experimental treatment with the control. The experimental treatments themselves are not directly compared. The most common test procedure in this case is Dunnett’s procedure (19), an example of a ‘‘many-one’’ test statistic. In brief, the procedure involves computing K t-statistics in the usual way: ti ⫽

(y i ⫺ y 0 ) √N s√2

where y i is the mean for the ith treatment group, s is the (pooled) standard deviation, and N is the number of observations in each treatment, assumed equal in this formulation. To preserve the experimental type I error rate, the statistics t i must each be compared with a critical value larger than the value for K ⫽ 1 (i.e.,

Multiple Treatment Trials

155

Table 2 Critical Values for Comparing K Experimental Treatments to a Single Control Treatment Using Dunnett’s Procedure (α ⫽ 0.05, two-sided tests) Degrees of Freedom 10 20 60 120 ∞

Number of experimental treatments 1

2

3

4

5

2.23 2.09 2.00 1.98 1.96

2.57 2.38 2.27 2.24 2.21

2.76 2.54 2.41 2.38 2.35

2.89 2.65 2.51 2.47 2.44

2.99 2.73 2.58 2.55 2.51

one treatment, one control). This critical value is a function of N and the degrees of freedom, ν ⫽ (K ⫹ 1)(N ⫺ 1), in estimating s. Table 2 gives the required critical values for this purpose (11). Other work has extended Dunnett’s procedure in various ways. For example, Chen and Simon, in a series of papers (20–22), considered comparing the experimental treatments that are found to differ from the control in the case when there is a prior preference ordering for the treatments based on considerations such as toxicity or cost.

V.

OTHER DESIGNS

A. Factorial Designs Another type of design involving multiple treatments is the factorial design (23,24) in which two or more factors, each with two or more levels, are combined together. For example, two different drugs, say A and B, may be either present or absent resulting in four different treatment groups: A only, B only, A and B, neither A nor B. This is an example of the simplest type of factorial design, the 2 ⫻ 2 factorial. If the two factors have a and b levels, the design is an a ⫻ b design and the number of treatment groups is K ⫽ ab. In general there is no limit to the number of factors and the number of levels of each factor, but in clinical trials the 2 ⫻ 2 design is the most common and introduces fewer complexities than higher level designs. Factorial designs are popular because they carry a promise of being able to exploit the relationships among the treatments in ways not possible in the case of K-independent treatments. And in a clinical setting in which the factors are different treatment options (drugs, modalities, etc.) that might plausibly be administered jointly, it is natural to study their joint effects and interactions directly.

156

George

Unfortunately, as in the case with K-independent treatments, it is not possible to get something for nothing. The primary issues are addressed in terms of the 2 ⫻ 2 design described earlier in which the four treatments are defined based in the presence or absence of the treatments A and B. Table 3 gives the four possible treatment group means as µ i j , where i ⫽ 0, 1 and j ⫽ 0, 1 depending on the absence or presence of treatments A and B, respectively. The pooled means are simply the means pooled over the relevant categories. For example, µ 1• is the mean where treatment A is present pooled over the treatment B categories (absent, present). The ‘‘main effect’’ of treatment A is defined as µ 1• ⫺ µ 0• and the ‘‘simple main effects’’ of treatment A are µ 10 ⫺ µ 00 and µ 11 ⫺ µ 01 , the main effects of treatment A in the absence and in the presence of treatment B. Similar definitions apply for treatment B. The primary advantage of a factorial design occurs when it can be assumed that the simple main effects of a treatment are equal. Then we can design the trial as if it were a two-treatment trial for treatment A and obtain a test for treatment B seemingly at no cost. The problem arises when the simple main effects are not equal. In this situation the effect of one treatment depends on whether or not the second treatment is present and is called an interaction between the treatments. The model can be written µ i j ⫽ µ 00 ⫹ αx ⫹ βy ⫹ γxy where x ⫽ 0, 1 and y ⫽ 0, 1 depending on the absence or presence of treatments A and B, respectively, and α,β, and γ are parameters defining the effects of A and B and their interaction, respectively. If γ ⫽ 0, there is no interaction. Thus, for example, the simple main effects of A are α and α ⫹ γ in the absence or presence of treatment B. If γ ⬎ 0, a positive interaction, the effect of treatment A is heightened when given with treatment B. If γ ⬍ 0, a negative interaction, the effect of A is decreased. Nonzero interactions are the source of potential difficulties in factorial designs. Even moderate interaction effects can have a profound impact on statistical power. For example, to detect an interaction

Table 3 Treatment Group Means in 2 ⫻ 2 Factorial Design Treatment B

Treatment A Absent Present Pooled

Absent

Present

Pooled

µ 00 µ 10 µ •0

µ 01 µ 11 µ •1

µ 0• µ 1• µ ••

Multiple Treatment Trials

157

effect of the same magnitude as a treatment main effect requires four times the number of patients required to detect the main effect where no interaction is assumed (23). Smaller interactions will require more patients for reliable detection. Stated another way, a negative interaction will produce smaller overall treatment effects and reduce the power of the statistical tests. Thus, in practice, one must give careful consideration to the plausible size of an interaction and design the study accordingly. It is generally unreasonable to expect to detect a small or moderate interaction but reasonable to assume that they might exist. Simon and Freedman (24) provide a Bayesian approach (see also Green [25] in this volume). B. Selection Designs An additional design involving multiple treatments and occasionally used in clinical trials is a selection design (26). In this setting the purpose is to identify the ‘‘best’’ treatment or the one with the largest or smallest value of some parameter. In the normal means case, we have an ordering of the means µ (1) ⱕ µ (2) ⱕ ⋅⋅⋅ ⱕ µ (K) , and the purpose is to identify (‘‘select’’) the treatment associated with µ (K) with a high probability of correct selection P* whenever µ (K) ⫺ µ (K⫺1) ⱖ δ*. The required sample sizes in this setting are surprisingly small for most values of δ*/σ. The most highly publicized of this type of design applied to clinical trials is a randomized phase II design (27,28). Here, several treatments that might have been tested in a sequence of traditional phase II designs, one for each treatment, are instead considered in a single study with random assignment of the available treatments. At the end of the trial, the treatment with the highest response rate is selected for further testing in a phase II trial. These selection designs do not allow explicit comparisons among the treatments in the usual sense. The smaller sample sizes required by these designs are obtained by changing the objectives of the trial. The treatment selected may be imperceptibly better than some or all of the competing treatments. Thus, such designs should be carefully applied only in very select circumstances. One example is when the treatments differ only slightly (e.g., in dose or intensity) so that the loss in selecting the wrong treatment is not excessive. If one wishes to compare the treatments in the usual way, a selection design is not appropriate. One of the designs discussed earlier should be used (see also Liu [29] in this volume).

VI. SUMMARY Multiple treatment trials are more difficult to design and analyze than trials with only two treatments. Careful attention to detail and proper characterization of the objectives of the trial can minimize these difficulties. However, it is important

158

George

to recognize at the outset that such trials will of necessity be larger than twotreatment trials. In many settings the advantages may outweigh the costs, but there are always costs. The considerations in this chapter can help in an assessment of when the advantages outweigh the costs.

REFERENCES 1. Tukey JW. Some thoughts on clinical trials, especially problems of multiplicity. Science 1977; 198:679–684. 2. Koch GG, Gansky SA. Statistical considerations for multiplicity in confirmatory protocols. Drug Inform J 1996; 30:523–534. 3. Senn S. Statistical Issues in Drug Development. New York: Wiley, 1997. 4. Green S, Benedetti J, Crowley J. Clinical Trials in Oncology. New York: Chapman and Hall, 1997. 5. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. Br Med J 1995; 310:170. 6. Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 73:751–754. 7. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75:800–802. 8. Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist 1979; 6:65–70. 9. Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med 1990; 9:811–818. 10. Bauer P. Multiple testing in clinical trials. Stat Med 1991; 10:871–890. 11. Miller RG. Simultaneous Statistical Inference. New York: Springer-Verlag, 1981. 12. Klockars AJ, Sax G. Multiple Comparisons. London: Sage Publications, 1986. 13. Desu MM, Raghavarao D. Sample Size Methodology. San Diego: Academic Press, 1990. 14. George SL. The required size and length of a clinical trial. In: Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials. Oxford: Oxford University Press, 1984, pp. 287–310. 15. Makuch RW, Simon RM. Sample size requirements for comparing time-to-failure among k treatment groups. J Chron Dis 1982; 35:861–867. 16. Liu PY, Dahlberg S. Design and analysis of multiarm clinical trials with survival endpoints. Control Clin Trials 1995; 16:119–130. 17. Phillips A. Sample size estimation when comparing more than two treatment groups. Drug Inform J 1998; 32:193–199. 18. Ahnn S, Anderson SJ. Sample size determination for comparing more than two survival distributions. Stat Med 1995; 14:2273–2282. 19. Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 1955; 50:1096–1121. 20. Chen TT, Simon R. A multiple-step selection procedure with sequential protection of preferred treatments. Biometrics 1993; 49:753–761.

Multiple Treatment Trials

159

21. Chen TT, Simon RM. Extension of one-sided test to multiple treatment trials. Control Clin Trials 1994; 15:124–134. 22. Chen TT, Simon R. A multiple decision procedure in clinical trials. Stat Med 1994; 13:431–446. 23. Peterson B, George SL. Sample size requirements and length of study for testing interaction in a 2 ⫻ k factorial design when time-to-failure is the outcome. Controlled Clin Trials 1993; 14:511–522. 24. Simon R, Freedman LS. Bayesian design and analysis of two ⫻ two factorial clinical trials. Biometrics 1997; 53:456–464. 25. Green S. Factorial designs with time-to-event end points. In: Crowley J, ed. Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001. 26. Gibbons JD, Olkin I, Sobel M. Selecting and Ordering Populations: A New Statistical Methodology. New York: John Wiley & Sons, 1977. 27. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treatment Rep 1985; 69:1375–1381. 28. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival. Biometrics 1993; 49:391–398. 29. Liu PY. Phase II selection designs. In: Crowley J, ed. Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001.

9 Factorial Designs with Time-to-Event End Points Stephanie Green Fred Hutchinson Cancer Research Center, Seattle, Washington

FACTORIAL DESIGN The frequent use of the standard two-arm randomized clinical trial is due in part to its relative simplicity of design and interpretation. Conclusions are straightforward: Either the two arms are shown to be different or they are not. Complexities arise with more than two arms; with four arms there are six possible pairwise comparisons, 19 ways of pooling and comparing two groups, 24 ways of ordering the arms, plus the global test of equality of all four arms. Some subset of these comparisons must be identified as of interest; each comparison has power, level, and magnitude considerations; the problems of multiple testing must be addressed; and conclusions can be difficult, particularly if the comparisons specified to be of interest turn out to be the wrong ones. Factorial designs are sometimes considered when two or more treatments, each of which has two or more dose levels (possibly including level 0; i.e., no treatment) are of interest alone or in combination. A factorial design assigns patients equally to each possible combination of levels of each treatment. If treatment i, i ⫽ 1 ⫺ K, has l i levels the result is an l1 ⫻ l2 . . . ⫻ lK factorial. Generally, the aim is to study the effect of levels of each treatment separately by pooling across all other treatments. The assumption often is made that each treatment has the same effect regardless of assignment to the other treatments (no interaction). There has been a fair amount of recent interest in factorial designs. Byar (1) suggested potential benefit in use of factorials for studies with low event rates, 161

162

Green

such as screening studies. A theoretical discussion of factorials in the context of the proportional hazards model is presented by Slud (2). Other recent contributions to the topic include those by Simon and Freedman (3), who discussed Baysian design and analysis of 2 ⫻ 2 factorials (allowing for some uncertainty in the assumption of no interaction); by Hung (4), who discussed testing first for interaction when outcomes are normally distributed and interactions occur only if there are effects of both treatment arms; and by Akritas and Brunner (5), who proposed a nonparametric approach to analysis of factorial designs with censored data (making no assumptions about interaction). To illustrate the issues in factorial designs, a simulation of a 2 ⫻ 2 factorial trial of control treatment O (control arm) vs. O plus treatment A (arm A) vs. O plus treatment B (arm B) vs. O plus A and B (arm AB) was performed. The simulated trial had 125 patients per arm accrued over 3 years with 3 additional years of follow-up. Survival was exponentially distributed on each arm, and median survival was 1.5 years on the control arm. The sample size was sufficient for a one-sided 0.05 level test of A vs. no-A to have power 0.9 with no effect of B, no interaction, and an A/O hazard ratio of 1/1.33. Various cases were considered using the usual proportional hazards model λ ⫽ λ0exp(αzA ⫹ βzB ⫹ γzA zB): neither A nor B effective (α and β ⫽ 0), A effective (α ⫽ ⫺ln(1.33)) with no effect of B, A effective and B detrimental (β ⫽ ln(1.5)), and both A and B effective (α and β both ⫺ln(1.33)). Each of these was considered with no interaction (γ ⫽ 0), favorable interaction (AB hazard improved compared to expected, γ ⫽ ⫺ln(1.33)), and unfavorable interaction (worse, γ ⫽ ln(1.33)). Table 1 summarizes the cases considered. The multiple comparisons problem is one of the issues that must be considered in factorial designs. If tests of each treatment are performed at level α (typical for factorial designs; see Gail et al. [6]), then the experiment-wide level (probability that at least one comparison will be significant under the null hypothesis) is greater than α. There is disagreement on the issue of whether all primary questions should each be tested at level α or whether the experiment-wide level across all primary questions should be level α, but clearly if the probability of at least one false-positive result is high, a single positive result from the experiment will be difficult to interpret and may well be dismissed by many as inconclusive. Starting with global testing followed by pairwise tests only if the global test is significant is a common approach to limit the probability of false-positive results; a Bonferoni approach (each of T primary tests performed at α/T ) is also an option. Power issues must also be considered. From the point of view of individual tests, power calculations are straightforward under the assumption of no interaction—calculate power according to the number of patients in the combined groups (also typical; again see Gail et al. [6]). A concern even in this ideal case may be the joint power for both A and B. If power to detect a specified effect

Factorial Designs

Table 1

163

Arm Medians

冢

冣

O A Used in the Simulation B AB Case

Interaction No interaction Unfavorable interaction Favorable interaction

1—Null

2—Effect of A No effect of B

3—Effect of A Effect of B

4—Effect of A Detrimental effect of B

1.5 1.5 1.5 1.5

1.5 1.5 1.5 1.13

1.5 1.5 1.5 1.5

2 2 2 1.5

1.5 2 1.5 2

2 2.67 2 2

1.5 1 1.5 1

2 1.33 2 1

1.5 1.5

1.5 2

1.5 1.5

2 2.67

1.5 2

2 3.55

1.5 1

2 1.77

Each Case Has The Median of the Best Arm in Bold.

of A is 1 ⫺ β and power to detect a specified effect of B is also 1 ⫺ β, the joint power to detect the effects of both is closer to 1 ⫺ 2β. From the point of view of choosing the best arm, power considerations are considerably more complicated. The ‘‘best’’ arm must be specified for the possible true configurations, the procedures for designating the preferred arm at the end of the trial (which generally is the point of a clinical trial) must be specified, and the probabilities of choosing the best arm under alternatives of interest must be calculated. Several scenarios were considered in the simulation. The first approach is to analyze assuming there are no interactions and to do only two one-sided tests, A vs. not-A and B vs. not-B. If both A is better than not-A and B is better than not-B, then AB is assumed to be the best arm. The second approach is to test first for interaction (two-sided) using the model λ ⫽ λ0exp(αzA ⫹ βzB ⫹ γzAzB). If the interaction term is not significant, then base conclusions on the tests of A vs. not-A and B vs. not-B. If it is significant, then base conclusions on tests of the three terms in the model and on appropriate subset tests. The treatment of choice is as follows: Arm O if 1. γ is not significant, A vs. not-A is not significant, and B vs. not-B is not significant, or 2. γ is significant and negative (favorable interaction), α and β are not

164

Green

3.

significant in the three-parameter model, and the test of O vs. AB is not significant, or γ is significant and positive (unfavorable interaction) and α and β are not significant in the three-parameter model.

Arm AB if 1. 2. 3.

4.

5.

γ is not significant, and A vs. not-A and B vs. not-B are both significant, or γ is significant and favorable and α and β are both significant in the three-parameter model, or γ is significant and favorable, α is significant and β is not significant in the three-parameter model, and the test of A vs. AB is significant, or γ is significant and favorable, β is significant and α is not significant in the three-parameter model, and the test of B vs. AB is significant, or γ is significant and favorable, α is not significant and β is not significant in the three-parameter model, and the test of O vs. AB is significant.

Arm A if 1. 2.

3. 4.

γ is not significant, B vs. not-B is not significant, and A vs. not-A is significant, or γ is significant and favorable, α is significant and β is not significant in the three-parameter model, and the test of A vs. AB is not significant, or γ is significant and unfavorable, α is significant and β is not significant in the three-parameter model, or γ is significant and unfavorable, α and β are significant in the threeparameter model, and the test of A vs. B is significant in favor of A.

Arm B if results are similar to A above, but with the results for A and B reversed. Arm A or Arm B if γ is significant and unfavorable, α and β are significant in the three-parameter model, and the test of A vs. B is not significant. (Try putting this into the statistical considerations of a protocol.) The third approach is to control the overall level of the experiment by first doing an overall test of differences among the four arms and to proceed with the

Factorial Designs

165

second approach above only if this test is significant. If the overall test is not significant, then arm O is concluded to be the treatment of choice. The possible outcomes of a trial of O vs. A vs. B vs. AB are to recommend one of O, A, B, AB, or A or B but not AB. Tables 2–4 show the simulated probabilities of making each of the conclusions in the 12 cases of Table 1, for the approach of ignoring interaction, for the approach of testing for interaction, and for the approach of doing a global test before testing for interaction. The global test was done at the 0.05 level. Tests of A vs. not-A, B vs. not-B, O vs. A, O vs. B, O vs. AB, and model terms α and β were one-sided; other tests were two-sided. One-sided tests were done at the 0.05 level, two-sided at 0.1. For each table the probability of drawing the correct conclusion is in bold. Tables 2–4 illustrate several points. In the best case of using approach 1 when in fact there is no interaction, the experiment level is 0.11 and power when both A and B are effective is 0.79, about as anticipated, and possibly insufficiently conservative. Apart from that, approach 1 is best if there is no interaction. The probability of choosing the correct arm is reduced if approach 2 (testing first for interaction) is used instead of approach 1 in all four cases with no interaction. If there is an interaction, approach 2 may or may not be superior. If the interaction masks the effectiveness of the best regimen, it is better to test for interaction (e.g., case 4 with an unfavorable interaction, where the difference between A and not-A is diminished due to the interaction). If the interaction enhances the effectiveness of the best arm, testing is detrimental (e.g., case 4

Table 2 Simulated Probability of Conclusion with Approach 1: No Test of Interaction Conclusion Case, interaction 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,

none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable

O

A

B

AB

A or B

0.890 0.999 0.311 0.078 0.562 0.002 0.010 0.316 0 0.078 0.578 0.002

0.055 0.001 0.243 0.867 0.437 0.627 0.104 0.244 0.009 0.922 0.422 0.998

0.053 0 0.259 0.007 0 0.002 0.095 0.231 0.006 0 0 0

0.002 0 0.187 0.049 0.001 0.369 0.791 0.208 0.985 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

166

Table 3

Simulated Probability of Conclusion with Approach 2: Test of Interaction

Conclusion Case, interaction 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,

none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable

O

A

B

AB

0.865 0.914 0.309 0.086 0.349 0.003 0.009 0.185 0 0.117 0.341 0.198

0.060 0.036 0.128 0.810 0.601 0.384 0.089 0.172 0.004 0.883 0.659 0.756

0.062 0.033 0.138 0.006 0.001 0.002 0.089 0.167 0.003 0 0 0

0.005 0 0.424 0.078 0.001 0.612 0.752 0.123 0.990 0 0 0.046

A or B 0.008 0.017 0 0.019 0.048 0 0.061 0.353 0.002 0 0 0

Test for interaction, probability of rejection 0.116 0.467 0.463 0.114 0.446 0.426 0.122 0.434 0.418 0.110 0.472 0.441

Green

Simulated Probability of Conclusion with Approach 3: Global Test Followed by Approach 2 Conclusion

Case, interaction 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4,

Factorial Designs

Table 4

none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable

O

A

B

AB

0.972 0.926 0.503 0.329 0.528 0.014 0.068 0.466 0 0.117 0.341 0.198

0.011 0.032 0.049 0.578 0.432 0.374 0.069 0.067 0.004 0.882 0.659 0.756

0.010 0.026 0.057 0.001 0 0.001 0.063 0.072 0.003 0 0 0

0.003 0 0.390 0.074 0 0.611 0.741 0.109 0.990 0 0 0.046

A or B 0.004 0.015 0 0.018 0.039 0 0.059 0.286 0.002 0 0 0

Global test, Probability of rejection 0.052 0.578 0.558 0.684 0.541 0.987 0.932 0.535 1.00 0.997 1.00 0.999

167

168

Green

with favorable interaction, where the difference between A and not-A is larger due to the interaction whereas B is still clearly ineffective). In all cases the power for detecting interactions is poor. Even using 0.1 level tests, the interactions were detected at most 47% of the time in these simulations. Approach 3 does restrict the overall level (probability of not choosing O when there is no positive effect of A or B or AB), but this is at the expense of a reduced probability of choosing the correct arm when the four arms are not sufficiently different for the overall test to have high power. Unfavorable interactions are particularly devastating to a study. The probability of identifying the correct regimen is poor for all methods if the correct arm is not the control arm. Approach 1, assuming there is no interaction, is particularly poor. Unfavorable interactions and detrimental effects happen: Study 8300 (similar to case 4 with an unfavorable interaction) is an unfortunate example (7). In this study in limited non-small cell lung cancer, the roles of both chemotherapy and prophylactic radiation to the brain were of interest. All patients received radiation to the chest and were randomized to receive prophylactic brain irradiation (PBI) plus chemotherapy vs. PBI vs. chemotherapy vs. no additional treatment. PBI was to be tested by combining across the chemotherapy arms and chemotherapy was to be tested by combining across PBI arms. Investigators chose a Bonferoni approach to limit type 1 error: The trial design specified level 0.025 for two tests, a test of whether PBI was superior to no PBI, and a test of whether chemotherapy was superior to no chemotherapy. No other tests were specified. It was assumed that PBI and chemotherapy would not affect each other. Unfortunately, PBI was found to be detrimental to patient survival. The worst arm was PBI plus chemotherapy, followed by PBI, then no additional treatment, then chemotherapy alone. Using the design criteria one would conclude that neither PBI nor chemotherapy should be used. With this outcome, however, it was clear that the comparison of no further treatment vs. chemotherapy was critical— but the study had seriously inadequate power for this test, and no definitive conclusion could be made concerning chemotherapy. Once you admit your K ⫻ J factorial study is not one K-arm study and one J-arm study (which happen to be in the same patients) but rather a K ⫻ Jarm study with small numbers of patients per arm, the difficulties become more evident. This is not ‘‘changing the rules’’ of the design; it is acknowledging the reality of most clinical settings. Perhaps in studies where A and B have unrelated mechanisms of action and are being used to affect different outcomes, then assumptions of no interaction may not be unreasonable. However, in general A cannot be assumed to behave the same way in the presence of B as in the absence of B. Potential drug interactions, overlapping toxicities, differences in compliance, and so on all make it more reasonable to assume there will be differences— and with small sample sizes per group it is unlikely these will be detected.

Factorial Designs

169

OTHER APPROACHES TO MULTIARM STUDIES Various approaches to multiarm studies are available. If the example study could be formulated as O vs. A, B and AB, the problem of comparing control vs. multiple experimental arms would apply. There is a long history of articles on this problem, for instance Dunnet (8), Marcus et al. (9), and Tang and Lin (10), focusing on appropriate global tests or on appropriate tests for subhypotheses. Liu and Dahlberg (11) discuss design and provide sample size estimates (based on the least favorable alternative for the global test) for K-arm trials with timeto-event end points. The procedure investigated (a K-sample logrank test is performed at level α, followed by α level pairwise tests if the global test is significant) has good power for detecting the difference between a control arm and the best treatment. These authors also point out the problems when power is considered in the broader sense of drawing the correct conclusions. Properties are good for this approach when each experimental arm is similar either to the control arm or the best arm but not when survivals are more evenly spread out among the control and other arms. Designs for ordered alternatives are another possibility (say for this example there are theoretical reasons to hypothesize superiority of A over B, resulting in the alternative O ⬍ B ⬍ A ⬍ AB). Liu et al. (12) propose a modified logrank test for ordered alternatives,

冱冫冤冱 K⫺1

T⫽

K⫺1

L (i)

i⫽1

i⫽1

var(L (i) ) ⫹ 2

冱 i⬍j

cov(L (i), L ( j))

冥

1/2

where L (i) is the numerator of the one-sided logrank test between the pooled groups 1, . . . i and pooled groups i ⫹ 1 . . . K; this test is used as the global test before pairwise comparisons in this setting. Similar comments apply as to the more general case above, with the additional problem that the test will not work well if the ordering is misspecified. A related approach includes a preference ordering, say by expense of the regimens (which at least has a good chance of being specified correctly), and application of a ‘‘bubble sort’’ analysis (e.g., take the most costly only if significantly better than the rest, the second most only if significantly better than the less costly arms, and the most costly is not significantly better . . .). This approach is discussed in Chen and Simon (13). Any model assumption can result in problems when the assumptions are not correct. As with testing for interactions, testing other assumptions can either be beneficial or detrimental, with no way of ascertaining beforehand which is the case. If assumptions are tested, procedures must be specified for when the assumptions are shown not to be met, which changes the properties of the experiment and complicates sample size considerations. Southwest Oncology Group

170

Green

study S8738 (14) provides an example of incorrect assumptions. This trial randomized patients to low-dose cisplatin (CDDP) vs. high-dose CDDP vs. highdose CDDP plus mitomycin-C (with the obvious hypothesized ordering). The trial was closed approximately half way through the planned accrual because survival on high-dose CDDP was convincingly shown not to be superior to standard-dose CDDP by the hypothesized 25% (in fact, it appeared to be worse). A beneficial effect of adding mitomycin-C to high-dose CDDP could not be ruled out at the time, but this comparison became meaningless in view of the standarddose vs. high-dose comparison. CONCLUSION The motivation for simplifying assumptions in multiarm trials is clear. The sample size required to have adequate power for multiple plausible alternatives, while at the same time limiting the overall level of the experiment, is large. If power for specific pairwise comparisons is important for any outcome, then the required sample size is larger. An even larger sample size is needed if detection of interaction is of interest. To detect an interaction of the same magnitude as the main effects in a 2 ⫻ 2 trial, four times the sample size is required (15), thereby eliminating what most view as the primary advantage to factorial designs. Likely not all simplifying assumptions are wrong, but disappointing experience tells us that too many are too often wrong. Unfortunately, the small sample sizes resulting from oversimplification lead to unacceptable chances of inconclusive results (and a tremendous waste of resources). The correct balance between conservative assumptions vs. possible efficiencies is rarely clear. In the case of factorial designs, combining treatment arms seems to be a neat trick—multiple answers for the price of one—until you start considering how to protect against the possibility that the assumptions allowing the trick are incorrect.

REFERENCES 1. Byar J. Factorial and reciprocal control design. Stat Med 1990; 9:55–64. 2. Slud E. Analysis of factorial survival experiments. Biometrics 1994; 50:25–38. 3. Simon R, Freedman L. Baysian design and analysis of two x two factorial clinical trials. Biometrics 1997; 53:456–464. 4. Hung H. Two-stage tests for studying monotherapy and combination therapy in twoby-two factorial trials. Stat Med 1993; 12:645–660. 5. Akritas M, Brunner E. Nonparametric methods for factorial designs with censored data. J Am Stat Assoc 1997; 92:568–576.

Factorial Designs

171

6. Gail M, You W-C, Chang Y-S, Zhang L, Blot W, Brown L, Groves F, Heinrich J, Hu J, Jin M-L, Li J-Y, Liu W-D, Ma J-L, Mark S, Rabkin C, Fraumeni J, Xu G-W. Factorial trial of three interventions to reduce the progression of precancerous gastric lesions in Sandong, China: design issues and initial data. Control Clin Trials 1998; 19:352–369. 7. Miller T, Crowley J, Mira J, Schwartz J, Hutchins L, Baker L, Natale R, Chase E, Livingston R. A randomized trial of chemotherapy and radiotherapy for stage III non-small cell lung cancer. Cancer Ther 1998; 1:229–236. 8. Dunnet C. A multiple comparisons procedure for comparing several treatments with a control. J Am Stat Assoc 1955; 60:573–583. 9. Marcus R, Peritz E, Gabriel K. On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976; 63:655–660. 10. Tang D-I, Lin S. An approximate likelihood ratio test for comparing several treatments to a control. J Am Stat Assoc 1997; 92:1155–1162. 11. Liu P-Y, Dahlberg S. Design and analysis of multiarm trial clinical trials with survival endpoints. Control Clin Trials 1995; 16:119–130. 12. Liu P-Y, Tsai W-Y, Wolf M. Design and analysis for survival data under order restrictions with a modified logrank test. Stat Med 1998; 17:1469–1479. 13. Chen T, Simon R. Extension of one-sided test to multiple treatment trials. Control Clin Trials 1994; 15:124–134. 14. Gandara D, Crowley J, Livingston R, Perez E, Taylor C, Weiss G, Neefe J, Hutchins L, Roach R, Grunberg S, Braun T, Natale R, Balcerzak S. Evaluation of cisplatin in metastatic non-small cell lung cancer: A phase III study of the Southwest Oncology Group. J Clin Oncol 1993; 11:873–878. 15. Peterson B, George S. Sample size requirements and length of study for testing interaction in a 2 ⫻ K factorial design when time to failure is the outcome. Control Clin Trials 1993; 14:511–522.

10 Therapeutic Equivalence Trials Richard Simon National Cancer Institute, National Institutes of Health, Bethesda, Maryland

I.

INTRODUCTION

The objective of a therapeutic equivalence trial is generally to demonstrate that a new treatment is equivalent to a standard therapy with regard to a specified clinical end point. The new treatment may be less invasive and less debilitating or it may be more convenient. Consequently, if it is equivalent to the standard with regard to the primary efficacy end point, it would be attractive to patients. Usually, however, one is willing to exchange only very small reductions in efficacy for the advantages in secondary end points. Therapeutic equivalence trials are contrasted to bioequivalence trials where the objective is to demonstrate equivalence of serum concentrations of the active moiety. In some therapeutic equivalence trials called active control trials, investigators would like to demonstrate that the new treatment is effective compared with no treatment, but because use of a no-treatment arm is not feasible, they attempt to demonstrate therapeutic equivalence to a standard treatment. In this chapter we review some of the problems with therapeutic equivalence trials and provide recommendations for the design and analysis of such trials.

173

174

Simon

II. PROBLEMS WITH THERAPEUTIC EQUIVALENCE TRIALS One inherent problem is that it is impossible to demonstrate equivalence. If the outcomes for the two treatments are very different, then one can conclude that the two treatments are not therapeutically equivalent. In the absence of demonstrating lack of equivalence, however, one can only conclude that results are consistent with only small differences. In conventional trials, rejection of the null hypothesis is usually established with substantial statistical reliability, and rejection of the null hypothesis leads to change in the treatment of future patients. The implications of failure to reject the null hypothesis are often more difficult to interpret. Failure to reject the null hypothesis in conventional trials generally does not lead to change medical practice, however, and hence the ambiguity associated with its interpretation is in some sense of less concern. For therapeutic equivalence trials the situation is quite different. Failure to demonstrate nonequivalence is the ambiguous outcome and the outcome that leads to change in the treatment of future patients. This is not always the case but is for a large class of therapeutic equivalence trials where a standard effective treatment is compared with a shorter lower dose or less invasive regimen. Failure to demonstrate nonequivalence is often interpreted as a demonstration of therapeutic equivalence and grounds for adoption of the new regimen. Many problems with therapeutic equivalence trials are associated with reasons why failure to demonstrate nonequivalence should not be interpreted as demonstration of equivalence. Someone once said that everything looks like a nail to a person whose only tool is a hammer. Overreliance on statistical significance testing is one of the problems with the conduct of therapeutic equivalence trials. Failure to reject the null hypothesis may be a result of inadequate sample size and not grounds for concluding equivalence. Unfortunately, closeness of sample means or sample distributions and nonsignificant p values are convincing to a large part of the medical audience. The sample size for clinical trials is often determined on practical grounds based on patient availability over a limited time period or funding available. Consequently, many trials are undersized. This is a particular problem for therapeutic equivalence trials because it leads to failure to demonstrate nonequivalence as a result of inadequate statistical power. A related problem with therapeutic equivalence trials is that large sample sizes are often needed. For example, consider a cancer trial evaluating tumor resection as an alternative to amputation of the organ containing the tumor in a setting where amputation is the standard therapy known to be curative in a large number of cases. Tumor resection may have clear advantages with regard to

Therapeutic Equivalence Trials

175

quality of life, but many patients would be interested in these advantages only if they were assured that the chance for cure they might give up would be very small. Hence, the appropriate trial should focus on distinguishing the null hypothesis that the new treatment is no worse than the standard, i.e., ∆ ⱕ 0 from the hypothesis that the standard treatment is superior by at least some very small amount δ, i.e., ∆ ⱖ δ. Consequently, this trial would have to be quite large. Some therapeutic equivalence trials compare a treatment regimen E to a control C when the advantage of C over placebo P or no treatment is small. Such trials must also be very large because they must demonstrate that the difference in efficacy between E and C is no greater than a fraction of the difference between C and P. Another difficulty with the therapeutic equivalence trial is that there is no internal validation of the assumption that the control C is actually effective for the patient population at hand. It is not enough for E to be therapeutically equivalent to C, we want equivalence coupled with the effectiveness of E and C relative to P. A related problem is the difficulty in selecting the difference δ to be distinguished from the null hypothesis. In general, the difference δ should represent the largest difference that a patient is willing to give up in efficacy of the standard treatment C for the secondary benefits of the experimental treatment E. The difference δ must be no greater than the efficacy of C relative to P and will in general be a fraction of this quantity δ c . Estimation of δ c requires review of clinical trials that established the effectiveness of C relative to P. δ c should not be taken as the maximum likelihood estimate (mle) of treatment effect from such trials because there is substantial probability that the true treatment effect in those trials was less than the point mle. We discuss later in this chapter quantitative methods for utilizing information that demonstrates the effectiveness of C relative to P in planning a therapeutic equivalence trial.

III. DESIGN AND ANALYSIS Of the most frequent methods that have been proposed for the design and analysis of therapeutic equivalence trials, that based on confidence intervals has important advantages (1–3). As indicated above, a significance test of the null hypothesis can provide a very misleading interpretation of the results of a therapeutic equivalence trial. Testing an alternative hypothesis has been proposed as an alternative method (4). For an experiment that is too small to be informative, a test of an alternative hypothesis will be less likely to be interpreted as supporting therapeutic equivalence, but this approach will still not indicate the basic inadequacy of the experiment. A confidence interval for the difference in efficacy between E

176

Simon

and C will indicate the range of values that are consistent with the data. A confidence interval is more informative than a hypothesis test and a statement of statistical power. Statistical power is relevant for planning the study, but it is not a good parameter for interpreting the results of a study because it ignores the data actually obtained. A confidence interval incorporates both the size of the study and the results obtained. The well-known study by Frieman et al. (5) tabulated the power of 71 trials reported as negative. They found that 50 of these 71 trials had power less than 0.9 for detecting a 50% treatment effect. If one computes approximate confidence intervals from those trials, however, one finds that 16 of these 50 trials with inadequate statistical power have confidence limits that exclude 50% treatment effects and hence are definitively negative. Frieman et al. focused attention on statistical power of trials that claimed to be ‘‘negative.’’ This was useful, but calculation of confidence intervals for treatment differences is a much more relevant and informative means of analysis. To encourage physicians to use such confidence intervals, Simon (2) showed how to calculate such approximate confidence intervals for commonly encountered end points. If one decides to use a confidence interval as the method of analysis, the questions of one-sided versus two-sided and the confidence coefficient arise. Therapeutic equivalence trials are by nature asymmetric with regard to E and C. We are generally interested in whether E is about the same as or substantially worse than C. Often, there is information that makes it less likely that E will be superior to C. In any case, clinical decision making may be the same whether E is equivalent to C or if E is superior to C. Consequently, one-sided confidence intervals are often justified. Since for some therapeutic equivalence trials it is possible that E is superior to C, two-sided confidence intervals may also be desirable. I have recommended symmetric two-sided 90% confidence intervals in many cases. This provides the same upper limit for the C ⫺ E difference as a one-sided 95% confidence interval and also provides a lower limit for evaluating whether E is actually superior to C. Another alternative is the 11/2-sided confidence limit in which the lower limit is extended. For example, a 11/2-sided 95% confidence limit would have 31/3% area above the upper limit and 12/3% area below the lower limit. Let ∆ˆ denote the mle of the difference in treatment effects C ⫺ E. We will assume that a positive value favors C and that ∆ˆ is approximately normal with mean ∆ and standard deviation σ. An upper 1 ⫺ α level confidence limit for ∆ is approximately ∆ up ⫽ ∆ˆ ⫹ z 1⫺α σ

(1)

In planning the trial a value δ must be specified where δ represents the largest true C ⫺ E difference consistent with therapeutic equivalence. Once the data are obtained, if ∆ up ⬍ δ, then one concludes that E is therapeutically equivalent to

Therapeutic Equivalence Trials

177

C. Whether or not this condition is achieved, the confidence interval provides the range of relative effectiveness’ consistent with the data. There are several approaches to planning the size of the study using the confidence interval as the basis for analysis. All the methods are based, however, on the fact that σ that occurs in Eq. (1) is a function of the sample size. In the case of survival comparisons with proportional hazards, σ is a function of the number of events observed. One approach is to set σ so that under the null hypothesis, the probability that ∆ up ⬍ δ is a specified value 1 ⫺ β. If σ is independent of the value of ∆, this leads to the familiar condition for sample size planning with normal distributions that δ/σ ⫽ z 1⫺α ⫹ z 1⫺β

(2)

For the two-sample normal case σ ⫽ √2σ 20 /n, where σ 0 is the standard deviation per observation and n is the sample size per treatment group. For the two-sample normal case, this approach provides the same sample size as does the hypothesis testing framework, with proper definition of α and β. For the two-sample binomial or the two-sample time-to-event case, the correspondence is not exact because of dependence of σ on ∆. The correspondence is approximately the same, however. For example, Eq. (2) can be used for sample size planning in the twosample time-to-event case with the approximation σ ⫽ √4/total events . An alternative approach to sample size planning is to use a symmetric twosided confidence interval and require that Pr ∆⫽0 [∆ˆ ⫹ z 1⫺α σ ⬎ δ] ⫽ β and Pr ∆⫽δ [∆ˆ ⫺ z 1⫺α σ ⬍ 0] ⫽ β

(3)

In the case where σ is independent of ∆, satisfying condition (2) automatically satisfies both parts of condition (3). A more stringent approach to sample size planning is to require that the width of the two-sided confidence interval be of size δ. This ensures that the confidence interval will always exclude either 0 or δ. It requires substantially more patients, however. Interim analysis using confidence intervals and group sequential methods is described by Durrleman and Simon (6).

IV. ANALYSIS TO ESTABLISH EQUIVALENCE AND EFFECTIVENESS In an important sense, none of the above approaches represents a satisfactory statistical framework for the design and analysis of therapeutic equivalence trials. These approaches depend on the specification of a minimal difference δ in effi-

178

Simon

cacy that one is willing to tolerate. None of the approaches deal with how δ is determined. Fleming (7) and Gould (8,9) have noted that the design and interpretation of equivalence trials must utilize information about previous trials of the active control. Fleming proposed that the new treatment is considered effective if an upper confidence limit for the amount that the new treatment may be inferior to the active control does not exceed a reliable estimate of the improvement of the active control over placebo or no treatment. Gould provided a method for creating a synthetic placebo control group based on previous trials comparing the active control to placebo. Simon presented a general Bayesian approach to the utilization of information from previous trials in the design and analysis of an equivalence trial (10). Two major objectives can be distinguished. The first is to determine whether the experimental treatment is effective relative to P. This requires explicit use of prior information about outcomes of trials comparing P to the active control. Meaningful interpretation of active control trials is impossible without consideration of such information. Establishing whether or not the experimental treatment is effective relative to P is a first requirement. The second objective is to determine whether any medically important portion of the treatment effect for the active control is lost with the experimental treatment. In some cases this objective is unrealistic because the size of the treatment effect (relative to P) for the active control is imprecisely determined. We use the following model: y ⫽ α ⫹ βx ⫹ γz ⫹ ε, where y denotes the response of a patient, x ⫽ 0 for placebo or the experimental treatment and 1 for the control treatment, z ⫽ 0 for placebo or the control treatment and 1 for the experimental treatment, and ε is normally distributed experimental error. Hence the expected response for C is α ⫹ β, the expected response for E is α ⫹ γ, and the expected response for P is α. The likelihood function for the data (D) from the active controlled trial can be expressed as π(D| α, β, γ) ⬀ π(y c |α, β)π(y e |α, γ), where the first factor is the likelihood of the data for the control group and the second factor is the likelihood of the data for the experimental group. We use the notation π( ) informally to denote either probability density of observable data, prior probability density of a parameter, or posterior density of a parameter. The first factor is N(α ⫹ β, σ 2 ) and the second factor is N(α ⫹ γ, σ 2 ), where σ is the standard error for the observed means. We assume that σ 2 is known, although it will generally be estimated. For the large sample sizes appropriate for active control trials, the additional variability caused by uncertainty in σ 2 should be very small. This assumption enables us to obtain simple analytical results, but a more exact treatment is possible using posterior distribution sampling methods. The posterior distribution of Θ ⫽ (α, β, γ) has density proportional to π(D| α, β, γ)π(Θ). We shall assume that the parameters have independent normal prior densities π(α) ⬃ N(µ α , σ 2α ), π(β) ⬃ N(µ β , σ 2β ), and π(γ) ⬃ N(µ γ , σ 2γ ). Hence,

Therapeutic Equivalence Trials

179

the posterior distribution of Θ is π(Θ| D) ⬀ π(y c | α, β)π(y e |α, γ)π(α)π(β)π(γ). The posterior distribution can be shown to be multivariate normal. The covariance matrix is ∑ ⫽ (K/σ 2 )

冢

(1 ⫹ r β )(1 ⫹ r γ ) ⫺(1 ⫹ r γ ) ⫺(1 ⫹ r β ) ⫺(1 ⫹ r γ ) r γ ⫹ (1 ⫹ r α )(1 ⫹ r γ ) 1 1 r β ⫹ (1 ⫹ r α )(1 ⫹ r β ) ⫺(1 ⫹ r β )

冣

(4)

where r α ⫽ σ 2 /σ 2α , r β ⫽ σ 2 /σ 2β , and r γ ⫽ σ 2 /σ 2γ and K ⫽ r α (1 ⫹ r β )(1 ⫹ r γ ) ⫹ r β (1 ⫹ r γ ) ⫹ (1 ⫹ r β )r γ . The mean vector η ⫽ (η α , η β , η γ ) of the posterior distribution is r α (1 ⫹ r β )(1 ⫹ r γ )µ α ⫹ r β (1 ⫹ r γ )(y c ⫺ µ β ) ⫹ r γ (1 ⫹ r β )(y e ⫺ µ γ ) K r (r ⫹ (1 ⫹ r α )(1 ⫹ r γ ))µ β ⫹ r α (1 ⫹ r γ )(y c ⫺ µ α ) ⫹ r γ (y c ⫺ y e ⫹ µ γ ) ηβ ⫽ β γ K

ηα ⫽

ηγ ⫽

r γ (r β ⫹ (1 ⫹ r α )(1 ⫹ r β ))µ γ ⫹ r α (1 ⫹ r β )(y e ⫺ µ α ) ⫹ r β (y e ⫺ y c ⫹ µ β ) K (5)

This indicates that the posterior mean of α is a weighted average of three estimates of α. The first estimate is the prior mean µ α . The second estimate is the observed y c minus the prior mean for β. This makes intuitive sense since the expectation of y c is α ⫹ β. The third estimate in the weighted average is the observed y e minus the prior mean for γ. The expectation of y e is α ⫹ γ. The sum of the weights is K. The other posterior means are similarly interpreted. The marginal posterior distribution of γ is normal with mean η γ and variance the (3, 3) element of ∑. The parameter γ represents the contrast of experimental treatment versus placebo. One can thus easily compute the posterior probability that γ ⬎ 0, which would be a Bayesian analog of a statistical significance test of the null hypothesis that the experimental regimen is no more effective than placebo (if negative values of the parameter represent effectiveness). The posterior distribution of γ ⫺ kβ is univariate normal with mean η γ ⫺ kη β and variance ∑ 33 ⫹ k 2 ∑ 22 ⫺ 2k ∑ 23 . Consequently, one can also easily compute the posterior probability that γ ⫺ kβ ⱕ 0. For k ⫽ 0.5, if β ⬍ 0 this represents the probability that the experimental regimen is at least half as effective as the active control. Since there may be positive probability that β ⬎ 0, it is more appropriate to compute the joint probability that β ⬍ 0 and γ ⫺ kβ ⱕ 0 to represent the probability that the experimental regimen is at least a kth as effective as the active control.

180

Simon

In the special case where noninformative prior distributions are adopted for α and γ, one obtains

∑ ⫽ σ 2β

冢

1 ⫹ rβ ⫺1 ⫺(1 ⫹ r β )

冣

⫺1 ⫺(1 ⫹ r β ) 1 1 1 1 ⫹ 2r β

(6)

In this case the posterior distribution of β is N(µ β , σ 2β ) the same as the prior distribution, the posterior distribution of γ is N(µ β ⫹ ye ⫺ yc , σ 2β ⫹ 2σ 2 ), and the posterior distribution of α is N(y c ⫺ µ β , σ 2β ⫹ σ 2 ). It can be seen that the clinical trial comparing C to E contains information about α if an informative prior distribution is used for β. One may permit correlation among the prior distributions. Let S denotes the covariance matrix for the multinormal prior distribution for (α, β, γ). Then ∑⫺1 ⫽ M ⫹ S⫺1 where

冤冥

2 1 1 M ⫽ (1/σ 2 ) 1 1 0 1 0 1

(7)

and the posterior mean vector is the solution of ∑⫺1 η ⫽ (1/σ 2 )(y • y c y e )′ ⫹ S⫺1 µ′ where µ ⫽ (µ α µ β µ γ ) and y • ⫽ y c ⫹ y e . The above results can be applied to binary outcome data by approximating the log odds of failure by a normal distribution. The approach can also be extended in an approximate manner to the proportional hazards model. Let the hazard be written as λ(t) ⫽ λ 0 (t) exp(βx ⫹ γz) where λ 0 (t) denotes the baseline hazard function and the indicator variables x and z are the same as described above in Section IV. The data will be taken as the maximum likelihood estimate of the log hazard ratio for E relative to C for the active control study and will be denoted by y. Thus, for large samples y is approximately normally distributed with mean γ ⫺ β and variance σ 2 ⫽ 1/d C ⫹ 1/d E where the d’s are the number of events observed on C and E, respectively. Using normal priors for β and γ as above, the same reasoning results in the posterior distribution of the parameters (β, γ) being approximately normal with mean η ⫽ (η β , η γ ) and covariance matrix ∑ ⫽ (λ ij )⫺1 with λ 11 ⫽ 1/σ 2 ⫹ 1/σ 2β λ 22 ⫽ 1/σ 2 ⫹ 1/σ 2γ λ 12 ⫽ ⫺1/σ 2

Therapeutic Equivalence Trials

181

and mean vector determined by Λη ⫽

冢

µ β /σ 2β ⫺ y/σ 2 y/σ 2 ⫹ µ γ /σ 2γ

冣

If a noninformative prior is used for γ, then λ 22 ⫽ ⫺λ 12 , and we obtain that the posterior distribution of β is N(µ β , σ 2β ), the same as the prior distribution. In this case the posterior distribution of γ is N(µ β ⫹ y, σ 2β ⫹ σ 2 ). The posterior covariance of β and γ is ⫺σ 2β . Hence, the posterior probability that the experimental treatment is effective relative to placebo is Φ(⫺(µ β ⫹ y)/√σ 2β ⫹ σ 2 ).

V.

PLANNING TO ESTABLISH EQUIVALENCE AND EFFECTIVENESS

A minimal objective of the active controlled trial is to determine whether or not E is effective relative to P. Hence, we might require that if γ ⫽ β, then it should be very probable that the trial will result in data y ⫽ (y e , y c ) such that Pr(γ ⬍ 0| y) ⬎ 0.95, where γ ⬍ 0 represents effectiveness of the experimental treatment. Thus, we want Pr[η γ /√ ∑33 ⬍ ⫺1.645] ⱖ ξ

(8)

where η γ , ∑ 33 are the posterior mean and variance of γ, the probability is calculated assuming γ ⫽ β and that β is distributed according to its prior distribution, and ξ is some appropriately large value such as 0.90. The posterior mean η γ is a linear combination of the data and is thus itself normally distributed with mean and variance denoted by ρ γ , ζ γ respectively. Thus, Eq. (8) can be written ⫺1.645 √ ∑33 ⫺ ρ γ √ζ γ

⫽ zξ

(9)

where z ξ is the 100ξth percentile of the standard normal distribution. When γ ⫽ β ρ γ ⫽ ∑31 (2µ β /σ 2 ⫹ µ α /σ 2α ) ⫹ ∑32 (µ β /σ 2 ⫹ µ β /σ 2β ) ⫹ ∑33 (µ β /σ 2 ⫹ µ γ /σ 2γ )

(10)

and ζ γ ⫽ ∑31 (2 ⫹ µ α /σ 2α ) ⫹ ∑32 (1 ⫹ µ β /σ 2β ) ⫹ ∑33 (1 ⫹ µ γ /σ 2γ )

(11)

182

Simon

Hence, one can determine the value of σ 2 that satisfies Eq. (9). σ 2 represents the variance of the means y e and y c and hence is inversely proportional to the sample size per treatment arm in the active controlled trial. In the special case where noninformative prior distributions are adopted for α and γ, that is, σ 2α ⫽ σ 2γ → ∞, the above results simplify and the mean of the predictive distribution is ρ γ ⫽ µ β with predictive variance ζ γ ⫽ 2σ 2. Using these results in Eq. (9) and simplifying yields ⫺1.645√1 ⫹ 2σ 2 /σ 2β ⫺ µ β /σ β ⫽ zξ √2σ 2 /σ 2β

(12)

The trial may be sized by finding the value of σ 2 that satisfies Eq. (12). It is of interest that µ β /σ β is the ‘‘z value’’ for the evaluation of the active control versus placebo. The required sample size for the active control trial is very sensitive to that z value. For example, suppose that µ β /σ β ⫽ 3. This represents substantial evidence that the active control is indeed effective relative to placebo. In this case, for ξ ⫽ 0.8 one requires that the ratio r ⫽ σ 2 /σ 2β ⫽ 0.4 for Eq. (12) to be satisfied. Since σ 2β is known and since σ 2 represents the variance of the mean response per treatment arm in the active controlled trial, the sample size per arm can be determined. Alternatively, if there is less substantial evidence for the effectiveness of the active control, for example µ β /σ β ⫽ 2, then one requires that the ratio r ⫽ σ 2 /σ 2β ⫽ 0.05 to satisfy Eq. (12). This represents eight times the sample size required for the case when r ⫽ 3. When the evidence for the effectiveness of the active control is marginal, then the active control design is neither feasible nor appropriate. For the binary response approximation described in Section III, we have approximately σ 2 ⫽ 1/npq, where n is the sample size per treatment group in the active control trial. If there is one previous randomized trial of active control versus placebo on which to base the prior distribution of β, then we have approximately that σ 2β ⫽ 2/n 0 pq, where n 0 denotes the average sample size per treatment group in that trial. Consequently, σ 2 /σ 20 ⫽ n 0 /2n. If µ β /σ β ⫽ 3, then n 0 /2n ⫽ 0.4, that is n ⫽ 1.25n 0 , and the sample size required for the active control trial is 25% larger than that required for the trial, demonstrating the effectiveness of the active control. On the other hand, if µ β /σ β ⫽ 2, then n 0 /2n ⫽ 0.05, that is, n ⫽ 10n 0 . Planning the trial to demonstrate that the new regimen is effective compared with placebo seems a minimal requirement. As indicated above, even establishing that objective may not be feasible unless the data demonstrating the effectiveness of the active control is definitive. One can be more ambitious and plan the trial to ensure with high probability that the results will support the conclusion that the new treatment is at least 100k% as effective as the active control when

Therapeutic Equivalence Trials

183

in fact the new treatment is equivalent to the active control. That is, we would require that Pr(γ ⬍ kβ| y) ⬎ 0.95. To achieve this, one obtains instead of Eq. (9) the requirement ⫺1.645√(1 ⫺ k) 2 ⫹ 2σ 2 /σ 2β ⫺ (1 ⫺ k)µ β /σ β √2σ 2 /σ 2β

⫽ zξ

(13)

VI. EXAMPLE As an example of the analysis of therapeutic equivalence trials, we consider two recently reported clinical trials of bolus t-PA (tissue plasminogen activator) for lysis of coronary artery thrombosis. Both trials, GUSTO III and COBALT, compared t-PA administered in two boluses separated by 30 min to standard t-PA administered in an accelerated infusion over 90 min (11,12). Heparin was administered intravenously in all cases. The GUSTO III trial used a recombinant mutant version of t-PA for the bolus group. Infusion t-PA was considered the standard treatment; but bolus administration is more convenient to administer. Thirty-day mortality results for the COBALT and GUSTO III trials are shown in Tables 1 and 2. In COBALT, the 30-day mortality for bolus was higher than that for infusion, but the difference was not statistically significant. The investigators concluded that ‘‘double-bolus alteplase was not shown to be equivalent according to the prespecified criteria, to accelerated infusion with regard to 30-day mortality. There was also a slightly higher rate of intracranial hemorrhage with the double-bolus method. Therefore, accelerated infusion of alteplase over a period of 90 minutes remains the preferred regimen.’’ The results of Gusto III were similar to those for COBALT. The 30-day mortality for the bolus arm was slightly but not statistically significantly higher than for the infusion arm. In contrast to the COBALT result, the investigators implied that the two regimens were equivalent, although they indicated that the trial was not sized for demonstrating therapeutic equivalence since they expected the bolus regimen to be superior.

Table 1 COBALT

t-PA ⫹ IV heparin Bolus t-PA ⫹ IV heparin

n(planned)

n(actual)

30-day mortality (%)

4029 4029

3584 3595

7.53 7.98

184 Table 2

Simon GUSTO III

t-PA ⫹ IV heparin Bolus reteplase ⫹ IV heparin

n

30-day mortality (%)

4,921 10,138

7.24 7.47

Using the logit approximation, the logit of the odds of 30-day mortality for the bolus regimen compared with infusion was ⫺0.0621 with a standard error of 0.088 for COBALT and ⫺0.0341 with a standard error of 0.067 for GUSTO III. A weighted average of these two results gives a log odds ratio of ⫺0.044 with standard deviation of 0.053. The negative logit reflects an odds ratio of 0.957, slightly favoring the infusion regimen. A 95% two-sided confidence interval for the log odds ratio is (⫺0.148, 0.06), which corresponds to a confidence interval for the odds ratio of (0.862, 1.062). The lower limit corresponds to a 14% lower 30-day mortality for the standard infusion regimen compared with the bolus regimen. The two arms of GUSTO I using infusion t-PA gave an average 30-day mortality rate of 6.65% based on 20,672 patients (13). The other two arms involving streptokinase (SK) gave an average of 7.30% 30-day mortality based on 20,173 patients. The odds ratio for infusion t-PA relative to SK is 0.9046 and the logit is ⫺0.10027 with a standard error of 0.039. The Z value for this comparison is 2.57, and an approximate 95% confidence limit for the odds ratio is (0.838, 0.976). Since the point estimate of the odds ratio for 30-day mortality for infusion t-PA versus SK is 0.9046, there is about a 50% chance that the reduction in risk is less than 10%. The Bayesian analysis described previously was applied to these data in an approximate manner. Flat prior distributions were used for the intercept parameter (α) and for the effect of bolus t-PA relative to SK (γ). The prior distribution for the effect of infusion t-PA relative to SK was obtained from GUSTO I as indicated in the previous paragraph. That is, for β we used a normal prior with mean ⫺0.10 and standard deviation of 0.039. This ignores any possible interstudy variability in the effectiveness of infusion t-PA. We could account for such additional variability by increasing the standard deviation of β. We incorporated the COBALT and GUSTO III data in a two-step manner. First, we summarized the result of COBALT using the empirical logit transform to be represented by y c ⫽ ⫺2.507, y e ⫽ ⫺2.445, and σ ⫽ 0.0625. The standard deviation was computed as the average of the standard deviations for the two treatment arms. Using these data, we computed the posterior distributions of the parameters. These results are shown in Table 4. We then summarized the results

Therapeutic Equivalence Trials Table 3

185

GUSTO I Sample size

30-day mortality (%)

10,344 10,377 9,796 10,328

6.3 7.4 7.2 7.0

t-PA ⫹ IV heparin SK ⫹ IV heparin SK ⫹ SC heparin t-PA ⫹ SK ⫹ IV heparin

of GUSTO III in a similar manner as y c ⫽ ⫺2.5512, y e ⫽ ⫺2.517, and σ ⫽ 0.0472. In this study the sample sizes for the two arms are quite different, and it would be more accurate to generalize the results of the Bayesian approach for this. We have, however, approximated using an average standard deviation. For this second step of analysis we used the posterior distribution obtained from the COBALT data as a prior distribution for incorporating the GUSTO III data. It should be noted that there are substantial correlations among the parameters in the posterior distribution obtained from the COBALT data, and hence the generalized formula (7) was used. The last column of Table 4 shows the approximate posterior distributions obtained after incorporating both the COBALT and GUSTO III data. From the mean and standard deviation of the posterior distribution of γ we can compute that the posterior probability that γ ⬍ 0, that is, that bolus t-PA, is more effective than SK is 0.80. Hence, these data provide only suggestive, but not definitive, evidence that bolus t-PA is even more effective that SK. We also computed the posterior probability that both β ⬍ 0 and γ ⬍ 0.5β. This can be interpreted as the probability that infusion t-PA is more effective than SK and that bolus t-PA is at least 50% as effective as infusion t-PA. This probability was 0.54. Hence there appears to be little evidence from these trials that bolus t-PA is at least 50% as effective as infusion t-PA relative to SK.

Table 4 Distribution of Parameters

α: mean ⫾ SD β: mean ⫾ SD γ: mean ⫾ SD ρα β ρα γ ρβ γ

Prior

After COBALT

After COBALT and GUSTO III

0 ⫾ 10 ⫺0.10 ⫾ 0.039 0 ⫾ 10 0 0 0

⫺2.41 ⫾ 0.074 ⫺0.10 ⫾ 0.039 ⫺0.038 ⫾ 0.0966 ⫺0.53 ⫺0.76 0.40

⫺2.44 ⫾ 0.054 ⫺0.10 ⫾ 0.039 ⫺0.056 ⫾ 0.066 ⫺0.72 ⫺0.82 0.59

186

Simon

One can obtain from Eq. (13) the size of clinical trial needed to establish that a regimen is at least 50% as effective as infusion t-PA relative to SK. We used Eq. (13) with Z ⫽ ⫺2.57 from GUSTO I. With k ⫽ 0.5 we found that a ratio R ⫽ σ 2 /σ 2β of 0.059 is required to make the righthand side equal 0.84, corresponding to 80% power. This means that the sample size required for the planned equivalence trial should be 1/0.059 or about 17 times the size of GUSTO I. Even to perform an equivalence trial for establishing indirectly that a regimen is more effective than SK (k ⫽ 0), one obtains that a ratio R of 0.235 is required for 80% power. This corresponds to a sample size 4.25 as large as for GUSTO I. One can conclude that infusion t-PA was not sufficiently better than SK and the difference was not strongly enough established in GUSTO I to make therapeutic equivalence trials practical.

VII. CONCLUSION In this chapter we have attempted to clarify the serious limitations of therapeutic equivalence trials. We have also tried to indicate that standard methods for the planning and analysis of such trials are problematic and potentially misleading, and we have described a new approach to planning and analysis of such trials. This new approach is based on the premise that a therapeutic equivalence trial is not interpretable unless one provides the quantitative evidence that the control treatment is effective. The method is presented in a Bayesian context but has a frequentist interpretation if flat priors are used for the α and γ parameters. An important implication of the new approach is that reliable therapeutic equivalence trials are not practical unless the evidence of the effectiveness of the control treatment is overwhelming. Unless this is the case, the sample size needed for the equivalence trial is many times larger than the sample size needed to establish the effectiveness of the control treatment. Conventional methods for planning therapeutic equivalence trials often miss this point because they take the maximum likelihood estimate of the effectiveness of the control treatment as if it were a value known with certainty. This ignores the fact that the degree of effectiveness of the control treatment is only imprecisely known unless the effect is overwhelmingly significant. For example, if the effect is of borderline significance, then the confidence interval for the size of the effect almost includes zero. Consequently, many planned therapeutic equivalence trials, even large multicenter trials, cannot demonstrate clinically relevant objectives. The methods described here for the planning of such trials will hopefully help organizations to avoid such misdirected efforts. A corollary to these considerations is that superiority trials, rather than therapeutic equivalence trials with marginally effective control treatments, is strongly preferable whenever possible.

Therapeutic Equivalence Trials

187

REFERENCES 1. Makuch R, Simon R. Sample size requirements for evaluating a conservative therapy. Cancer Treatment Rep 1978; 62:1037–1040. 2. Simon R. Confidence intervals for reporting results from clinical trials. Ann Intern Med 1986; 105:429–435. 3. Simon R. Why confidence intervals are useful tools in clinical therapeutics. J Biopharm Stat 1993; 3:243–248. 4. Blackwelder W. Proving the null hypothesis in clinical trials. Control Clin Trials 1982; 3:345–353. 5. Frieman JA, Chalmers TC, Smith HJ. The importance of beta the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 ‘‘negative’’ trials. N Engl J Med 1978; 299:690–694. 6. Durrleman S, Simon R. Planning and monitoring of equivalence studies. Biometrics 1990; 46:329–336. 7. Fleming T. Evaluation of active control trials in AIDS. J Acquir Immune Defic Syndr 1990; 3:S82–S87. 8. Gould A. Another view of active-controlled trials. Control Clin Trials 1991; 12: 474–485. 9. Gould L. Sample sizes for event rate equivalence trials using prior information. Stat Med 1993; 12:2001–2023. 10. Simon R. Bayesian design and analysis of active control clinical trials. Biometrics 1999; 55:484–487. 11. The GUSTO III Investigators. A comparison of reteplase with alteplase for acute myocardial infarction. N Engl J Med 1998; 337:1118–1123. 12. The COBALT Investigators. A comparison of continuous infusion of alteplase with double bolus administration for acute myocardial infarction. N Engl J Med 1998; 337:1124–1130. 13. The GUSTO Investigators. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction. N Engl J Med 1993; 329:673– 682.

11 Early Stopping of Cancer Clinical Trials James J. Dignam National Surgical Adjuvant Breast and Bowel Project and University of Pittsburgh, Pittsburgh, Pennsylvania, and University of Chicago, Chicago, Illinois

John Bryant and H. Samuel Wieand National Surgical Adjuvant Breast and Bowel Project and University of Pittsburgh, Pittsburgh, Pennsylvania

I.

INTRODUCTION

Most cancer clinical trials use formal statistical monitoring rules to serve as guidelines for possible early termination. Such rules provide for the possibility of early stopping in response to positive trends that are sufficiently strong to establish the treatment differences the clinical trial was designed to detect. At the same time, they guard against prematurely terminating a trial on the basis of initial positive results that may not be maintained with additional follow-up. We may also consider stopping a trial before its scheduled end point if current trends in the data indicate that eventual positive findings are unlikely. For example, consider a trial comparing a new treatment to an established control regimen. Early termination for negative results may be called for if the data to date are sufficient to rule out the possibility of improvements in efficacy that are large enough to be clinically relevant. Alternatively, it may have become clear that study accrual, drug compliance, follow-up compliance, or other factors have rendered the study incapable of discovering a difference, whether or not one exists. In this chapter we discuss methods for early stopping of cancer clinical trials. We focus in particular on situations where evidence suggests that differ189

190

Dignam et al.

ences in efficacy between treatments will not be demonstrated, as this aspect of trial monitoring has received less attention. For concreteness, we restrict our attention to randomized clinical trials designed to compare two treatments, using survival (or slightly more generally, time to some event) as the primary criterion. However, the methods we discuss may be extended to other trial designs and efficacy criteria. In most applications, one treatment represents an established regimen for the disease and patient population in question, whereas the second is an experimental regimen to be tested by randomized comparison with this control. In this chapter, we first describe group sequential approaches to trial monitoring and outline a general framework for designing group sequential monitoring rules. We then discuss the application of asymmetric monitoring boundaries to clinical trials in situations where it is appropriate to plan for the possibility of early termination in the face of negative results. Next, we consider various approaches to assessing futility in ongoing trials, including predictive methods such as stochastic curtailment. We then briefly examine Bayesian methods for trial monitoring and early stopping. National Surgical Adjuvant Breast and Bowel Project (NSABP) Protocol B-14 is presented as a detailed example illustrating the use of stochastic curtailment calculations and Bayesian methods. We also give a second example to illustrate the use of a common asymmetric monitoring plan adapted for use in Southwest Oncology Group (SWOG) Protocol SWOG8738. This approach is compared with a slight modification of a monitoring rule proposed by Wieand et al. We conclude with a discussion of considerations relevant to the choice of a monitoring plan.

II. GROUP SEQUENTIAL MONITORING RULES The most common statistical monitoring rules are based on group sequential procedures. Consider a clinical trial designed to compare two treatments, A and B, using survival as the primary end point. The relative effectiveness of the two treatments can be summarized by the parameter δ ⫽ ln(λ B (t)/λ A(t)), the logarithm of the ratio of hazard rates λ B (t) and λ A (t). We assume that this ratio is independent of time t. Thus, the hypothesis that the two treatments are equivalent is H 0: δ ⫽ 0, whereas values of δ ⬎ 0 indicate the superiority of A to B and values of δ ⬍ 0 indicate the superiority of B to A. Suppose patients are accrued and assigned at random to receive either treatment A or B. In a group sequential test of H 0, information is allowed to accumulate over time; at specified intervals an interim analysis is performed, and a decision is made whether to continue with the accumulation of additional information or to stop and make some decision based on the information collected to date. The accumulation of information is usually quantified by the total number of deaths, and the comparison of treatments is based on the logrank statistic. A large number of group sequential procedures have been proposed in this

Stopping Clinical Trials

191

setting. Most fall into a common general framework that we now describe: For k ⫽ 1, 2, . . . , K ⫺ 1, an interim analysis is scheduled to take place after m k total deaths have occurred on both treatment arms, and a final analysis is scheduled to occur after the m Kth death. Let L k denote the logrank statistic computed at the kth analysis, let V k denote its variance, and let Z k represent the corresponding standardized statistic Z k ⫽ L k /√V k. For each k ⫽ 1, 2, . . . , K, the real line R1 is partitioned into a continuation region C k and a stopping region S k ⫽ R1 ⫺ C k ; if Z k ∈ C k , we continue to the (k ⫹ 1)st analysis, but if Z k ∈ S k, we stop after the kth analysis. The stopping region for the Kth analysis is the entire real line, S K ⫽ R 1. Define t k ⫽ m k /m K, so that t k represents the fraction of the total information available at the kth analysis, and let W k ⫽ Z k ⋅ √t k , k ⫽ 1, 2, . . . , K. Under appropriate conditions (roughly, sequential entry, randomized treatment assignment, loss to follow-up independent of entry time and treatment assignment), the W k behave asymptotically like Brownian motion: Defining ∆t k ⫽ t k ⫺ t k ⫺ 1 and η ⫽ δ ⋅ √m K /2, the increments W k ⫺ W k⫺1 are approximately uncorrelated normal random variables with means η ⋅ ∆t k and variances ∆t k (1–3). This result permits the extension of sequential methods based on evolving sums of independent normal variates to the survival setting. In particular, the recursive integration scheme of Armitage et al. (4) may be used to compute the density fk (w; η) ⫽ dPr{τ ⱖ k,Wk ⱕ w;η}/dw where τ represents the number of the terminal analysis: Letting φ{⋅} represent the standard normal density, f1 (w;η) ⫽ φ{w ⫺ η ⋅ t 1 )/√t1}/√t 1 , and for k ⫽ 2, 3, . . . , K fk (w;η) ⫽

冮C

fk⫺1( y;η) ⋅ [φ{(w ⫺ y ⫺ η ⋅ ∆tk )/√∆tk }/√∆tk ]dy

(1)

k⫺1

From this result all operating characteristics of the group sequential procedure (size, power, stopping probabilities, etc.) may be obtained. In cases where a two-sided symmetric test of the hypothesis H 0: δ ⫽ 0 is appropriate, the continuation regions are of the form C k ⫽ {Z k : ⫺b k ⱕ Z k ⱕ b k }, k ⫽ 1, 2, . . . , K ⫺ 1. If Z k ⬍ ⫺b k at the kth analysis, we reject H 0 in favor of H A : δ ⬍ 0, whereas if Z k ⬎ b k , H 0 is rejected in favor of H A : δ ⬎ 0. If testing continues to the Kth and final analysis, a similar decision rule is applied except that if ⫺b K ⱕ Z K ⱕ b K , we accept H 0 rather than continuing to an additional analysis. The b k are chosen to maintain a desired experiment-wise type I error rate Pr{Reject H 0 |H 0 } ⫽ α, and the maximum duration m K of the trial is selected to achieve power 1 ⫺ β against a specified alternative H A :δ ⫽ δ A, by determining that value of η that yields Pr{Reject H 0 |η} ⫽ 1 ⫺ β, and then setting mK ⫽ 4 ⋅ η2 /δ2A

(2)

192

Dignam et al.

Early stopping rules proposed by Haybittle (5), Pocock (6,7), O’Brien and Fleming (8), Wang and Tsiatis (9), and Fleming et al. (10) all fit into this general framework. In the method by Haybittle, a constant large critical value is used for analyses k ⫽ 1, 2, . . . , K ⫺ 1, and the final analysis is performed using a critical value corresponding to the desired overall type I error level. For a moderate number of analyses (say, K ⫽ 5) and a large critical value (z ⫽ 3.0 was suggested if one wishes to obtain an overall 0.05 level procedure), the method can be shown to achieve nearly the desired type I error rate, despite no adjustment to the final test boundary value. To obtain the final critical value that would yield precisely the desired level overall, Eq. (1) can be used. The Pocock bounds are obtained by constraining the z-critical values to be identical for each k: b k ⫽ constant, k ⫽ 1, 2, . . . , K. For the O’Brien–Fleming procedure, the W-critical values are constant, so that b k ⫽ constant/√t k . Wang and Tsiatis (9) boundaries have the form bk ⫽ constant ⋅ tk∆⫺1/2, where ∆ is a specified constant. Fleming et al. (10) boundaries retain the desirable property of the O’Brien–Fleming procedure that the nominal level of the Kth analysis is nearly equal to α but avoid the extremely conservative nature of that procedure for small k when K ⬎ 3. Since most phase III trials compare a new therapy A to an accepted standard B, it may often times be appropriate to consider one-sided hypothesis tests of H 0: δ ⫽ 0 versus H A: δ ⬎ 0 and to make use of asymmetric continuation regions of the form C k ⫽ {Z k : a k ⱕ Z k ⱕ b k } or equivalently C k ⫽ {W k : A k ⱕ W k ⱕ B k }, where A k ⫽ a k ⋅ √t k , B k ⫽ b k ⋅ √t k . Crossing the upper boundary results in rejection of H 0 in favor of H A , whereas crossing the lower boundary results in trial termination in recognition that the new therapy is unlikely to be materially better than the accepted standard or that H 0 is unlikely to be rejected with further follow-up. The design of such asymmetric monitoring plans presents no significant new computational difficulties. After restricting the choice of a k and b k, k ⫽ 1, 2, . . . , K, to some desired class of boundaries, Eq. (1) is used (generally in an iterative fashion) to fix the both the size of the monitoring procedure and its power against a suitable alternative or set of alternatives. Equation (2) is used to determine the maximum duration of the trial in terms of observed deaths. In this context, DeMets and Ware (11) proposed the use of asymmetric Pocock boundaries: The lower boundary points a k are fixed at some specified value independent of k (the range ⫺2.0 ⱕ a k ⱕ ⫺0.5 is tabled) and then a constant value for the b k is determined by setting the type I error rate to α. A second suggestion was to use a test with boundaries that are motivated by their similarity to those of a sequential probability ratio test (12). This procedure is most easily expressed in terms of its W-critical values, which are linear in information time: Ak ⫽ ⫺ (Z′L /η) ⫹ (η/2) ⋅ tk , Bk ⫽ (Z′U /η) ⫹ (η/2) ⋅ tk , k ⫽ 1,2, . . . , K (3)

Stopping Clinical Trials

193

Here Z′L ⫽ ln((1 ⫺ α)/β), and Z u′ and η are chosen to satisfy type I and type II error requirements by iterative use of Eq. (1). The maximum number of observed deaths is given by Eq. (2), as before. In a subsequent publication (13), DeMets and Ware recommended that the Wald-like lower boundary should be retained but the upper boundary should be replaced by an O’Brien–Fleming boundary B k ⬅ B, k ⫽ 1, 2, . . . , K. Although iteration is still required to determine η and B, the value of B is reasonably close to the symmetric O’Brien–Fleming bound at level 2α. Whitehead and Stratton (14) indicate how the sequential triangular test (15,16) may be adapted to the group sequential setting in which K analyses will be carried out at equally spaced intervals of information time, t k ⫽ k/K, k ⫽ 1, 2, . . . , K. Suppose first that it is desired to achieve type I error rate α and power 1-α against the alternative δ ⫽ δ A . The W-critical values are Ak ⫽ ⫺Q ⫹ (3η/4) ⋅ tk , Bk ⫽ Q ⫹ (η/4) ⋅ tk , k ⫽ 1,2, . . . , K where η satisfies η2 ⫹ {2.332/√K } ⋅ η ⫺ 8 ⋅ ln(1/2α) ⫽ 0 and Q ⫽ 2 ⋅ ln(1/2α)/ η ⫺ 0.583/√K. The maximum number of observed deaths required to achieve this is given by Eq. (2). If instead one wishes to achieve a power of 1 ⫺ β ≠ 1 ⫺ α against the alternative δ ⫽ δ A , the operating characteristic curve of the fixed sample size test satisfying Pr{Reject H 0 | δ ⫽ 0} ⫽ α, Pr{Reject H 0 | δ ⫽ δ A} ⫽ 1 ⫺ β may be used to determine an alternative δ′A such that Pr{Reject H 0|δ ⫽ δ′A} ⫽ 1 ⫺ α. Then δ′A should be used in place of δ A in Eq. (2). The adjustment factor of 0.583/√K in the formula for Q is an approximate correction for exact results that hold in the case of a purely sequential procedure. Slightly more accurate results may be obtained by iteratively determining η and Q to satisfy type I and type II error constraints via Eq. (1). The triangular test approximately minimizes the expected number of events at termination under the alternative δ ⫽ δ′A /2. Jennison (17) considered group sequential tests that minimize expected number of events under various alternatives and presented parametric families of tests that are nearly optimal in this sense. These are specified in terms of spending functions, similar to Lan and DeMets (18). The power boundaries of Wang and Tsiatis (9) may be adapted for use in testing hypotheses of the form H 0 : δ ⫽ 0 verses H A : δ ⫽ δ A (19,20). W-critical values are of the form Ak ⫽ ⫺Q ⋅ t ∆k ⫹ η ⋅ tk , Bk ⫽ (η ⫺ Q) ⋅ t ∆k , k ⫽ 1,2, . . . , K where the constant ∆ is specified by the trial designer; η and Q are determined iteratively to satisfy type I and type II error constraints using Eq. (1). The maximum number of required deaths is given by Eq. (2). ∆ ⫽ 0 corresponds essentially to a design using an upper O’Brien–Fleming bound to test H 0:δ ⫽ 0 and a lower O’Brien–Fleming bound to test H A: δ ⫽ δ A . ∆ ⫽ 1/2 results in Pocock-like boundaries. In general, larger values of ∆ correspond to a greater willingness to

194

Dignam et al.

terminate at an earlier stage. Emerson and Fleming (19) compared the efficiencies of one-sided symmetric designs having power boundaries to the results of Jennison (17) and concluded that the restriction to boundaries of this form results in negligible loss of efficiency. Pampallona and Tsiatis (20) provide a comparison of the operating characteristics of asymmetric one-sided designs based on power boundaries with the designs proposed by DeMets and Ware (11,13). Both Emerson and Fleming (19) and Pampallona and Tsiatis (20) also consider two-sided group sequential procedures that allow for the possibility of early stopping in favor of the null hypothesis. These procedures are similar in spirit to the double triangular test (14). Wieand, Schroeder, and O’Fallon (21) proposed a method for early termination of trials when there appears to be no benefit after a substantive portion of total events has been observed, which is tantamount to adopting asymmetric boundaries after sufficient information has been obtained to guarantee high power against alternatives of interest. The method is an extension of earlier work by Ellenberg and Eisenberger (22) and Wieand and Therneau (23) and was first considered for multistage trials in advanced disease, where patient outcomes are poor and there is likely to be substantial information regarding treatment efficacy while accrual is still underway. In its simplest form, the proposed rule calls for performing an interim analysis when one half of the required events have taken place. At that time, if the event rate on the experimental arm exceeds that on the control arm, then termination of the trial should be considered. It can be shown that the adoption of this rule has essentially no effect on the size of a nominal 0.05-level test of equality of hazards and results in a loss of power of ⱕ0.02 for any alternative hypothesis indicating a treatment benefit, compared with a fixed sample size test of that same alternative at the scheduled definitive analysis (21). Similarly, if this rule is superimposed on symmetric two-sided boundaries by replacing the lower boundary a k with 0 for any information time t k greater than or equal to one half and the result is viewed as an asymmetric group sequential procedure testing a one-sided hypothesis, there is almost no change in the operating characteristics. In this implementation, the stopping rule calls for early termination if at any scheduled interim analysis at or after the halfway point the experimental treatment is observed to be no more efficacious than the control.

III. CONDITIONAL POWER METHODS A.

Stochastic Curtailment

A commonly applied predictive approach to early stopping makes use of the concept of stochastic curtailment (24–27). The stochastic curtailment approach requires a computation of conditional power functions, defined as

Stopping Clinical Trials

195

γ ⫽ Pr(Z(1) ∈ R | D, H )

(4)

where Z(1) represents a test statistic to be computed at the end of the trial, R is the rejection region of this test, D represents current data, and H denotes either the null hypothesis H 0 or an alternative hypothesis H a. If this conditional probability is sufficiently large under H 0, one may decide to stop and immediately reject H 0. On the other hand, if under a ‘‘realistic’’ alternative hypothesis H a this probability is sufficiently small or, equivalently, if 1-γ ⫽ Pr(Z(1) ∉ R | D, H a ) is sufficiently large, we may decide that continuation is futile because H 0 ultimately will not be rejected regardless of further observations. This is the case of interest when considering early stopping for negative results. In an example presented later, we condition on various alternatives in favor of the treatment to assess the potential for a trial to reverse from early interim analyses results unexpectedly favoring the control group. In Section II it was noted that the normalized logrank statistics asymptotically behave like Brownian motion. This provides an easy way to compute conditional power over a range of alternatives (26,27): C(t) ⫽ 1 ⫺ Φ

冢

冣

Zα ⫺ Z(t)√t ⫺ η(1 ⫺ t) √1 ⫺ t

(5)

In Eq. (5), Φ(⋅) is the standard normal distribution function, t is the fraction of total events for definitive analysis that have occurred to date (so-called information time; this was defined for prespecified increments as t k ⫽ m k /m K in Sect. II), Z(t) is the current standard normal variate associated with the logrank test, Z α is the critical value against which the final test statistic Z(1) is to be compared, and η is defined in Section II. B. Predictive Power and Current Data Methods Stochastic curtailment has been criticized on the basis that it requires conditioning on the current data and at the same time an alternative hypothesis that may be unlikely to have given rise to that data. In any case, since H a must be specified, the method always depends on unknown information at the time of the decision. This criticism has motivated methods that take an unconditional predictive approach in assessing the consequences of continuing the trial (28–30). These socalled predictive power procedures use weighted averages of conditional power over values of the alternative, specified through a distribution Pr(Z(1) ∈ R | D) ⫽ ∫ Pr(Z(1) ∈ R | D, H )Pr(H| D)dH

(6)

A Bayesian formulation is a natural setting for this approach. If a noninformative prior distribution is used for the distribution of the parameter of interest (e.g., H

196

Dignam et al.

expressed as a difference of means, difference or ratio of proportions, or hazard ratio), then the posterior distribution in this weighted average of conditional power depends only on the current data. Alternatively, the current (observed) alternative could be used in the conditional power formulation in Eq. (5) to project power resulting from further follow-up according to the pattern of observations thus far (26,29,30). When an informative prior distribution is used, then we obtain a fully Bayesian approach, described in the following section.

IV. A BAYESIAN APPROACH TO ASSESS EARLY TERMINATION FOR NO BENEFIT Recently, interest has grown in the application of Bayesian statistical methodology to problems in clinical trial monitoring and early stopping (31–35). Although Bayesian analyses entail the difficult and sometimes controversial task of specifying prior information, if the goal of any clinical trial is ultimately to influence clinical practice, its results must be sufficiently strong to prove compelling to a community of clinical researchers whose prior opinions and experiences are diverse. Thus, in situations where early termination is considered, an analysis of the robustness of conclusions over a range of priors thought to resemble the a priori beliefs of reasonable members of the clinical research community should provide insight into the impact that trial results might be expected to exert on clinical practice. This is often an overlooked aspect of trial monitoring, as early stopping can result in diminished impact of the findings and continued controversy and delay while results are debated and large expensive trials are replicated. Bayesian calculations for clinical trial monitoring can be motivated by adopting the log hazard ratio δ defined in Section II as a summary measure of relative treatment efficacy (32). We denote the partial maximum likelihood estimate of the log hazard ratio as δˆ . δ has an approximately normal likelihood with mean δˆ and variance 4/m, where m is the total number of events currently observed. We assume a normal prior distribution with specified mean δ p and variance σ 2p. The values of δ p and σ 2p may be determined to reflect an individual’s prior level of enthusiasm regarding the efficacy of a proposed regimen, and these parameters may be altered to reflect varying degrees of enthusiasm. In this spirit, the notion of ‘‘skeptical’’ and ‘‘optimistic’’ prior distributions is discussed by numerous authors (31–33,36). It is suggested that a skeptical member of the clinical community may adopt a prior for δ that is centered at 0, reflecting the unfortunate fact that relatively few regimens tested lead to material improvements in outcome. Nevertheless, the trial designers will have specified a planning alternative for δ, δ ⫽ δ A , say, that they must believe is both clinically meaningful and relatively probable. If the skeptic is reasonably open-minded, he or she would

Stopping Clinical Trials

197

be willing to admit some probability that this effect could be achieved, perhaps on the order of 5%. Using these considerations, a skeptical prior with mean δ p ⫽ 0 and standard deviation σ p ⫽ δ A /1.645 is specified. Using similar logic, one might be inclined to consider the trial organizers as being among the most optimistic of its proponents, but even they would be reasonably compelled to admit as much as a 5% chance that the proposed regimen will have no effect, i.e., that δ ⱕ 0. It therefore may be reasonable to model an optimist’s prior by setting δ p ⫽ δ A and σ p ⫽ δ A /1.645. For a given prior distribution and the observed likelihood, a posterior density can be computed for δ, and the current weight of evidence for benefit can thus be assessed directly by observing the probability that the effect is in some specified range of interest, say δ ⬎ 0, indicating a benefit, or δ ⱖ δ ALT ⬎ 0, corresponding to some clinically relevant effect size δ ALT. Following well-known results from Bayesian inference using the normal distribution, the posterior distribution for δ has mean and variance given by δpost ⫽ (n0δp ⫹ mδˆ )/(n0 ⫹ m) and σ 2post ⫽ 4/(n0 ⫹ m) where n0 ⫽ 4/σ 2p. This quantity is thought of as the prior ‘‘sample size,’’ since the information in the prior distribution is equivalent to that in an hypothetical trial yielding a log hazard ratio estimate of δ p based on this number of events. From the posterior distribution one can also formulate a predictive distribution to assess the consequences of continuing the trial for some fixed additional number of failures. As before, let m be the number of events observed thus far and let n be the number of additional events to be observed. Denote by δˆ n the log relative risk that maximizes that portion of the partial likelihood corresponding to failures m ⫹ 1, m ⫹ 2, . . . , m ⫹ n. The predictive distribution of δˆ n is normal with the same mean as the posterior distribution and variance σ 2pred ⫽ 4/(n 0⫹m) ⫹ 4/n.

V.

EXAMPLES

A. A Trial Stopped Early for No Benefit In 1982, NSABP initiated Protocol B-14, a double-blind comparison of 5 years of tamoxifen (10 mg b.i.d.) with placebo in patients having estrogen receptorpositive tumors and no axillary node involvement. The first report of findings in 1989 indicated improved disease-free survival (DFS, defined as time to either breast cancer recurrence, contralateral breast cancer or other new primary cancer, or death from any cause, 83% vs. 77% event free at 4 years, p ⬍ 0.00001). Subsequent follow-up through 10 years has confirmed this benefit, with 69% of patients receiving tamoxifen remaining event free compared with 57% of placebo patients and has also showed a significant survival advantage (at 10 years; 80% tamoxifen vs. 76% placebo, p ⫽ 0.02) (37).

198

Dignam et al.

In April 1987 a second randomization was initiated. Patients who had received tamoxifen and were event free through 5 years were rerandomized to either continue tamoxifen for an additional 5 years or to receive placebo. Between April 1987 and December 1993, 1172 patients were rerandomized. To provide for a 0.05 level one-sided test with a power of at least 0.85 under the assumed alternative of a 40% reduction in DFS failure rate, a total of 115 events would be required before definitive analysis. Four interim analyses were scheduled at approximately equal increments of information time. Stopping boundaries were obtained using the method of Fleming et al. (10) at the two-sided 0.10 level. Confidential interim end point analyses were to be compiled by the study statistician and presented to the independent Data Monitoring Committee of the NSABP. At the first interim analysis, based on all data received as of September 30, 1993, more events had occurred in the tamoxifen group (28 events) than among those receiving placebo (18 events). There had been six deaths on the placebo arm and nine among tamoxifen patients. By the second interim analysis (data received as of September 30, 1994), there were 24 events on the placebo arm and 43 on the tamoxifen arm (relative risk ⫽ 0.57, nominal 2p ⫽ 0.03), with 10 deaths on the placebo arm and 19 among patients receiving tamoxifen. Although there was concern regarding the possibility of a less favorable outcome for patients continuing tamoxifen, we recommended that the trial be continued to the next scheduled interim analysis because the early stopping criterion was not achieved (2α ⫽ 0.0030) and follow-up for most patients was relatively short (mean, 3.75 years). At that time, we computed the conditional probability of rejecting the null hypothesis at the scheduled final analysis (115 events), given the current data and a range of alternative hypotheses [Eq. (5)]. Results suggested that even under extremely optimistic assumptions concerning the true state of nature, the null hypothesis could almost certainly not be rejected: Even under the assumption of a 67% reduction in failures, the conditional probability of eventual rejection was less than 5%. We also considered an ‘‘extended trial,’’ repeating conditional power calculations as if we had intended to observe a total of 229 events before final analysis (this number of events would allow for the detection of a 30% reduction in event rate with a power of 85%). Results indicated that if the trial was continued and the underlying relative risk was actually strongly in favor of tamoxifen, then the null hypothesis could possibly be rejected (Fig. 1). At the third interim analysis (data received as of June 30, 1995), there were 32 events on the placebo arm and 56 on the treatment arm (relative risk ⫽ 0.59). Four-year DFS was 92% for patients on placebo and 86% for patients on tamoxifen. The boundary for early termination (2α ⫽ 0.0035) was not crossed (2p ⫽ 0.015). However, calculations showed that even if the remaining 27 events (of 115) all occurred on the placebo arm, the logrank statistic would not approach

Stopping Clinical Trials

199

Figure 1 Conditional probability of finding a significant benefit for tamoxifen in the NSABP B-14 trial if definitive analysis was deferred to the 229th event. Probabilities are conditional on the results of the second and third interim analysis and are graphed as a function of the assumed placebo/tamoxifen relative risk. The solid line is based on Eq. (5) (27); the dashed line is based on a binomial calculation following from a Poisson occurrence assumption. Reprinted from (41) with permission from Elsevier Science.

significance. The imbalance in deaths also persisted (13 placebo arm, 23 tamoxifen, 2p ⫽ 0.11). For the extended trial allowing follow-up to 229 events, Figure 1 shows that conditional power was now about 15% under the planning alternative of 40% reduction in relative risk and was only 50% under the more unlikely assumption of a twofold benefit for continuing tamoxifen. At this time, we also considered the early stopping rule proposed by Wieand et al. (21) discussed earlier. To illustrate the consequences of superimposing this rule on the established monitoring boundaries of this trial, suppose the lower boundaries at the third, fourth, and final analysis were replaced with zeros. Then the (upper) level of significance is reduced from 0.0501 to 0.0496 and the power under the alternative of a 40% reduction in event rate is reduced from 0.8613 to 0.8596. By this interim analysis, considerably more events had occurred on the treatment arm than on the control arm. Had such a conservative ‘‘futility’’ rule such as that described

200

Dignam et al.

above been incorporated into the monitoring plan, it would have suggested termination by this time. As discussed, the approaches taken in considering the early termination of the B-14 study were frequentist. We subsequently also applied Bayesian methods for comparative purposes and to attempt to address the broader question of consensus in clinical trials, as the closure of the B-14 study had prompted some criticism from the cancer research community (38–40). Figure 2 shows the log hazard ratio likelihood for the B-14 data at the third interim analysis, having mean ln(0.586) ⫽ ⫺0.534 and standard deviation 0.213. Also shown is an ‘‘optimistic’’ prior distribution centered at δ p ⫽ δ A ⫽ 0.511,

Figure 2 Prior distribution, likelihood, and posterior distribution of the logged placebo/ tamoxifen hazard ratio after the third interim analysis of NSABP B-14. An ‘‘optimistic’’ normal prior distribution is assumed, under which the most probable treatment effect is a 40% reduction in risk, with only a 5% prior probability that the treatment provides no benefit. The resulting posterior distribution contains about 13% probability mass to the right of ln(hazard ratio) δ ⫽ 0. Reprinted from (41) with permission from Elsevier Science.

Stopping Clinical Trials

201

corresponding to a 40% reduction in failures for patients continuing on tamoxifen relative to those stopping at 5 years. The prior standard deviation is σ p ⫽ δ A / 1.645 ⫽ 0.311. The resulting posterior distribution has mean ⫺0.199 and standard deviation 0.176 (also shown). From this distribution one can determine that the posterior probability that δ ⬎ 0 is 1 ⫺ Φ(0.199/0.176) ⫽ 0.13, where Φ(⋅) is the standard normal distribution function. To the degree that this prior distribution represents that of a clinical researcher who was initially very optimistic, these calculations suggest that even in the face of the negative trial findings as of the third interim analysis, this individual would still assign a small but nonnegligible probability to the possibility that continued tamoxifen has some benefit. On the other hand, this individual would now assign essentially no probability (⬇3 ⫻ 10⫺5) to the possibility that the benefit is as large as a 40% reduction in risk. For the prior distribution specified above and the observations at the third interim analysis, we also computed the predictive distribution. If the trial were to be extended to allow a total of 229 events or 141 events beyond the third interim analysis, we obtain µ pred ⫽ ⫺0.199 and σ pred ⫽ 0.243. The predictive probability of a significant treatment comparison following the 229th event is determined as follows: If δˇ m⫹n denotes the estimated log relative risk based on all the data, then the ultimate result will be significant at the 0.05 level is δˆ m⫹n ⬎ 1.645 ⋅ √(4/229) ⫽ 0.217. Since approximately δˆ m⫹n ⬇ (88δˆ ⫹ 141δˆ n )/299, this event requires that δˆ n ⬎ 0.686. The predictive probability of this occurrence is 1 ⫺ Φ({0.686 ⫹ 0.199}/0.243) ⬇ 0.0001. Subsequent follow-up of Protocol B-14 continues to support the findings that prompted early closure of this study. By 1 year subsequent to publication (data through December 31, 1996), 135 total events had occurred, 85 among patients who had received continued tamoxifen and 50 among patients rerandomized to placebo (relative risk ⫽ 0.60, nominal 2p ⫽ 0.002). There were 36 deaths among tamoxifen patients and 17 among control patients (nominal 2p ⫽ 0.01). A more extensive discussion of this case study has been published elsewhere (41). B. Effect of Two Easily Applied Rules in the Adjuvant and Advanced Disease Setting The trial presented in the preceding example was not designed with an asymmetric rule for stopping in the face of negative results. It was partly for this reason that the data monitoring committee and investigators considered several methods of analysis before reaching a decision to stop the trial. Although there will sometimes be special circumstances that require analyses not specified a priori, it is preferable to determine in advance whether the considerations for stopping are asymmetric in nature and, if so, to include an appropriate asymmetric stopping rule in the initial protocol design.

202

Dignam et al.

Computer packages (East, Cytel Software Corp., Cambridge, MA; and PEST3, MPS Research Unit, University of Reading; Reading, UK) are available to help with the design of studies using any of the rules discussed in Section 2 (see Emerson [42] for a review). Alternatively, one may modify a ‘‘standard’’ symmetric rule (e.g., O’Brien–Fleming boundaries) by retaining the upper boundary for early stopping due to positive results but replacing the lower boundary to achieve a more appropriate rule for stopping due to negative results. It is often the case that this will alter the operating characteristics of the original plan so little that no additional iterative computations are required. To illustrate this, suppose one had designed a trial to test the hypothesis H 0: δ ⫽ 0 versus δ ⬎ 0 to have 90% power versus the alternative H A: δ ⫽ ln(1.5) with a one-sided α of 0.025, using a Fleming et al. (10) rule with three interim looks. Such a design would require interim looks when there had been 66,132, and 198 events, with final analysis at 264 events. From Table 1a of Fleming et al. (10), one such rule would be to stop and conclude that the treatment was beneficial if the standardized logrank statistic Z exceeded 2.81 at the first look, 2.74 at the second look, or 2.67 at the third look. The null hypothesis would be rejected at the end of the trial if Z exceeded 2.02. If a symmetric lower boundary were considered inappropriate, one might choose to replace it by simply testing the alternate hypothesis H A: δ ⫽ ln(1.5) versus δ ⬍ ln(1.5) at some small significance level (e.g., α ⫽ 0.005) at each interim look (this suggestion is adapted from the monitoring rules in SWOG Protocol SWOG-8738, an advanced disease lung cancer trial). In the framework of Section II, this rule is asymptotically equivalent to stopping if the standardized Z is ⬍ ⫺0.93 at the first look, or ⬍ ⫺0.25 at the second look, or ⬍0.28 at the third look (a fact that one does not need to know to use the rule, since the alternative hypothesis can be tested directly using standard statistical software, e.g., SAS Proc PHREG [43]). It follows that if the experimental treatment adds no additional benefit to the standard regimen, there would be a 0.18 chance of stopping at the first look, a 0.25 chance of stopping at the second look, and a 0.22 chance of stopping at the third look (Table 1). Adding this rule does not significantly change the experiment-wise type I error rate (α ⫽ 0.0247) and would only lower the power to detect a treatment effect of δ A ⫽ ln(1.5) from 0.902 to 0.899. Following Wieand et al. (21), an alternative but equally simple way to modify the symmetric Fleming et al. boundaries would be simply to replace the lower Z-critical values with zeroes at each interim analysis at or beyond the halfway point. Using this approach, if the experimental arm offered no additional benefit to that of the standard regimen, the probability of stopping at the first look would be very small (0.0025), but the probability of stopping at the second look would be 0.4975, and at the third look would be 0.10 (Table 1). Again no special program is needed to implement this rule, and its use has a negligible effect on the original operating characteristics of the group sequential procedure (α ⫽ 0.0248, power ⫽ 0.900).

Stopping Clinical Trials

203

Table 1 Probability of Stopping at Each of Three Early Looks Using Two Easily Applied Rules Probability of Stopping If Treatments are Equivalent No. of events 66 132 198

Probability of Stopping Under Alternative δ ⫽ 1.5

SWOG

WSO

SWOG

WSO

0.18 0.25 0.22

0.0025 0.4975 0.10

0.005 0.004 0.003

0.000 0.010 0.001

WSO, Wieand, Schroeder, O’Fallon (21). SWOG, Southwest Oncology Group.

The decision of which rule to use will depend on several factors, including the likelihood that patients will still be receiving the treatment at the time of the early looks and whether it is likely that the experimental treatment would be used outside the setting of the clinical trial before its results are presented. To illustrate this, we consider two scenarios. Scenario 1: The treatment is being tested in an advanced disease trial where the median survival with conventional therapy has been 6 months and the alternative of interest is to see if the experimental treatment results in at least a 9-month median survival. Under the assumption of constant hazards, this is equivalent to the hypothesis δ A ⫽ ln(1.5). Suppose one would expect the accrual rate to such a study to be 150 patients per year. If one designed the study to accrue 326 patients, which would take 26 months, one would need to follow them for slightly less than 4.5 additional months to observe 264 deaths if the experimental regimen offers no additional benefit to the standard regimen or an additional 7.5 months if δ ⫽ δ A ⫽ ln(1.5). If the experimental treatment offers no additional benefit, one would expect 146 patients to have been entered when 66 deaths have occurred, 227 patients to be entered when 132 deaths have occurred, and 299 patients to have been entered when 198 deaths have occurred (Table 2). Early stopping after 66 deaths have occurred would prevent 180 patients from being entered to the trial and stopping after 132 deaths would prevent 99 patients from being entered. Thus, the potential benefit of stopping in the face of negative results would be to prevent a substantial number of patients from receiving the apparently ineffective experimental regimen, in addition to allowing early reporting of the results (the savings in time for reporting the results would be approximately 19, 12, and 7 months according to whether the trial stopped at the first, second, or third look, respectively). Scenario 2: The treatment is being tested in an adjuvant trial where the expected hazard rate is 0.0277 deaths/person-year, corresponding to a 5-year survival rate of slightly more than 87%. If one is now looking for an alternative δ A

204

Dignam et al.

Table 2 Effect of Early Stopping on Accrual and Reporting Time No. of Patients Accrued

No. of Events 66 132 198 264

No. of Patients to be Accrued

Time until Final Analysis (mo)

Advanced Disease Trial

Adjuvant Disease Trial

Advanced Disease Trial

Adjuvant Disease Trial

Advanced Disease Trial

Adjuvant Disease Trial

146 227 299 326

1975 2600 2600 2600

180 99 27 0

625 0 0 0

19 12 7 0

36 24 12 0

⫽ ln(1.5) (which would roughly correspond to increasing the 5-year survival rate to 91%) and if the accrual rate was approximately 800 patients per year, a reasonable plan would be to accrue 2600 patients, which would take approximately 39 months, and to analyze the data when 264 deaths have occurred, which should occur approximately 66 months after initiation of the trial, if the experimental regimen offers no additional benefit to the standard regimen (75 months after initiation if δ ⫽ δ A ⫽ ln{1.5}). With this accrual and event rate, 1975 of the expected 2600 patients will have been entered by the time 66 events have occurred if the experimental regimen offers no additional benefit to the standard regimen (Table 2). The second and third looks would occur approximately 3 and 15 months after the termination of accrual, so early stopping after these analyses would have no effect on the number of patients entering the trial, although it could permit early reporting of the results. The savings in time for reporting the results would be approximately 36, 24, and 12 months according to whether the trial stopped at the first, second, or third look, respectively. If there is little likelihood that the therapy will be used in future patients unless it can be shown to be efficacious in the current trial, there may be little advantage to reporting early negative results, and one might choose not to consider early stopping for negative results at any of these looks.

VI. SUMMARY Statistical monitoring procedures are used in cancer clinical trials to ensure the early availability of efficacious treatments while at the same time preventing spurious early termination of trials for apparent benefit that may later diminish.

Stopping Clinical Trials

205

Properly designed, these procedures can also provide support for stopping a trial early when results do not appear promising, conserving resources and affording patients the opportunity to pursue other treatment options and avoid regimens that may have known and unknown risks while offering little benefit. By weighing these considerations against each other in the specific study situation at hand, a satisfactory monitoring procedure can be chosen. The group sequential monitoring rules described in this chapter differ with respect to their operating characteristics, and care should be taken to select a monitoring policy that is consistent with the goals and structure of a particular clinical trial. For example, among symmetric rules, the Pocock procedure is associated with a relatively large maximum number of events (m K ) required for final analysis. But the probability of early stopping under alternatives of significant treatment effect is relatively high, so that the expected number of events required to trigger reporting of results is reduced under such alternatives. In contrast, for the O’Brien–Fleming procedure, m K is only very slightly greater than the number of events that would be required if no interim analyses was to be performed. The price paid for this is some loss of efficiency (in terms of expected number of events required for final analysis) under alternatives of significant treatment effect. In phase III cancer trials (particularly in the adjuvant setting), it is often the case that both accrual and treatment of patients are completed before a significant number of clinical events (e.g., deaths or treatment failures) have occurred, and more emphasis has been placed on the use of interim analysis policies to prevent the premature disclosure of early results than on their use to improve efficiency by allowing the possibility of early reporting. In such circumstances, it has often been considered to be most important to minimize the maximum number of required events, leading to the rather widespread use of the O’Brien–Fleming method and similar methods such as those of Haybittle and Fleming et al. Other considerations (e.g., the need to perform secondary subset analyses, the possibility of attenuation of treatment effect over time) also argue for the accumulation of a significant number of events before definitive analysis and therefore favor policies that are rather conservative in terms of early stopping. The power boundaries of Wang and Tsiatis provide a convenient way to explore trade-offs between maximum event size and expected number of events to final analysis by considering a variety of values of the tuning parameter ∆. In the absence of a prespecified stopping rule that is sensitive to the possibility of early stopping for no benefit or negative results, such evidence of a negative effect for an experimental therapy at the time of an interim analysis may be analyzed with the help of conditional power calculations, predictive power, or fully Bayesian methods. These methods are quite practical, have had a careful mathematical development, and have well-studied operating characteristics. We discussed these approaches in Sections III and IV and applied several of them to data from the NSABP Protocol B-14, a study that was closed in the face of

206

Dignam et al.

negative results for an experimental schedule of extended tamoxifen (10 years vs. the standard 5 years). It is preferable to include a plan for stopping in the face of negative results at the time the study protocol is developed. In particular, it is important to know what effect the plan will have on the power and significance level of the overall design. The mathematics required to create an appropriate group sequential design that incorporates asymmetric monitoring boundaries is presented in Section II, with examples of several asymmetric designs in current use. Many factors enter into the choice of a design, including the anticipated morbidity of the experimental regimen, severity of the disease being studied, the expected accrual rate of the study, and the likely effect of early release of results on other studies. The example in Section V.B showed that a rule applied in an advanced disease setting might prevent the accrual of a fraction of patients to an ineffective experimental regimen, whereas the same rule applied to an adjuvant trial is likely only to affect the timing of the presentation of results, as accrual may be completed before the rule is applied. When asymmetric monitoring boundaries are required, our preference has been to use simple approaches such as the Wieand et al. modification of symmetric boundaries or the use of an upper boundary of the O’Brien–Fleming or Fleming et al. type, coupled with a lower boundary derived by testing the alternative hypothesis H A: δ ⫽ δ A, using the O’Brien–Fleming or Fleming et al. rules. In the latter case, one may or may not require the procedure to be closed (i.e., require the upper and lower boundaries to join at the Kth analysis). If closure is required, the use of O’Brien–Fleming boundaries leads to the method of Pampallona and Tsiatis (20) with ∆ ⫽ 0. Our experience is that these approaches are easily explained to (and accepted by) our clinical colleagues. We advocate that trial designers give serious thought to the suitability of asymmetric monitoring rules. If such an approach is reasonable, the methods in Section II allow the statistician to develop a design that seems most appropriate for his or her situation. If at an interim look one is faced with negative results and has not designed the trial to consider this possibility, we recommend one of the approaches described in Sections III and IV. Even when one has had the foresight to include an asymmetric design, one may gain further insight regarding unexpected results by applying some of these methods. Of course, when one deviates from the original design, the initial power and significance level computations of the study are altered. After we completed this chapter, we became aware of a new volume by Jennison and Turnbull (44). They were kind enough to provide us with an advance copy as one of us (Bryant) was about to teach a Group Sequential Monitoring course at the University of Pittsburgh. Their work contains a comprehensive coverage of many of the methods and issues discussed in our chapter, and we heartily

Stopping Clinical Trials

207

recommend the book to individuals who wish to expand their knowledge regarding early stopping procedures.

REFERENCES 1. Tsiatis AA. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika 1981; 68:311–315. 2. Tsiatis AA. Repeated significance testing for a general class of statistic used in censored survival analysis. J Am Stat Assoc 1982; 77:855–861. 3. Gail MH, DeMets DL, Slud EV. Simulation studies on increments of the two-sample logrank score test for survival time data, with application to group sequential boundaries. In: Crowley J, Johnson RA, eds. Survival Analysis. Hayward, CA: Institute of Mathematical Statistics Lecture Notes—Monograph Series, Vol. 2, 1982, pp. 287–301. 4. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating data. J Royal Stat Soc Series A 1969; 132:235–244. 5. Haybittle JL. Repeated assessment of results in clinical trials in cancer treatments. Br J Radiol 1971; 44:793–797. 6. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64:191–199. 7. Pocock SJ. Interim analyses for randomized clinical trials: the group sequential approach. Biometrics 1982; 38:153–162. 8. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556. 9. Wang SK, Tsiatis AA. Approximately optimal one-parameter boundaries for group sequential trials. Biometrics 1987; 43:193–199. 10. Fleming TR, Harrington DP, O’Brien PC. Designs for group sequential tests. Control Clin Trials 1984; 5:348–361. 11. DeMets DL, Ware JH. Group sequential methods for clinical trials with a one-sided hypothesis. Biometrika 1980; 67:651–660. 12. Wald A. Sequential Analysis. New York: John Wiley and Sons, 1947. 13. DeMets DL, Ware JH. Asymmetric group sequential boundaries for monitoring clinical trials. Biometrika 1982; 69:661–663. 14. Whitehead J, Stratton I. Group sequential clinical trials with triangular continuation regions. Biometrics 1983; 39:227–236. 15. Whitehead J, Jones D. The analysis of sequential clinical trials. Biometrika 1979; 66:443–452. 16. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Chichester: Ellis Horwood Ltd. 1983. 17. Jennison C. Efficient group sequential tests with unpredictable group sizes. Biometrika 1987; 74:155–165. 18. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika 1983; 70:659–663.

208

Dignam et al.

19. Emerson SS, Fleming TR. Symmetric group sequential test designs. Biometrics 1989; 45:905–923. 20. Pampallona S, Tsiatis AA. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. J Stat Plan Infer 1994; 42:19–35. 21. Wieand S, Schroeder G, O’Fallon JR. Stopping when the experimental regimen does not appear to help. Stat Med 1994; 13:1453–1458. 22. Ellenberg SS, Eisenberger MA. An efficient design for phase III studies of combination chemotherapies. Cancer Treatment Rep 1985; 69:1147–1154. 23. Wieand S, Therneau T. A two-stage design for randomized trials with binary outcomes. Control Clin Trials 1987; 8:20–28. 24. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical trials. Commun Stat Sequent Anal 1982; 1:207–219. 25. Halperin M, Lan KKG, Ware JH, Johnson NJ, DeMets DL. An aid to data monitoring in long-term clinical trials. Control Clin Trials 1982; 3:311–323. 26. Lan KKG, Wittes J. The B-value: a tool for monitoring data. Biometrics 1988; 44: 579–585. 27. Davis BR, Hardy RJ. Upper bounds for type I and type II error rates in conditional power calculations. Commun Stat Theory Meth 1991; 19:3571–3584. 28. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Control Clin Trials 1986; 7:8–17. 29. Choi SC, Pepple PA. Monitoring clinical trials based on predictive probability of significance. Biometrics 1989; 45:317–323. 30. Jennison C, Turnbull BW. Statistical approaches to interim monitoring of medical trials: a review and commentary. Stat Sci 1990; 5:299–317. 31. Kass R, Greenhouse J. Comment on ‘‘Investigating therapies of potentially great benefit: ECMO’’ by J. Ware. Stat Sci 1989; 4:310–317. 32. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomized trials [with discussion]. J R Stat Soc A 1994; 157:357–416. 33. Freedman LS, Spiegelhalter DJ, Parmar MKB. The what, why and how of Bayesian clinical trials monitoring. Stat Med 1994; 13:1371–1383. 34. Greenhouse J, Wasserman L. Robust Bayesian methods for monitoring clinical trials. Stat Med 1994; 14:1379–1391. 35. Berry DA, Stangl DK, eds. Bayesian Biostatistics. New York: Marcel Dekker, 1996. 36. Parmar MKB, Ungerleider RS, Simon R. Assessing whether to perform a confirmatory randomized clinical trial. J Natl Cancer Inst 1996; 88:1645–1651. 37. Fisher B, Dignam J, Bryant J, et al. Five versus more than five years of tamoxifen therapy for breast cancer patients with negative lymph nodes and estrogen receptorpositive tumors. J Natl Cancer Inst 1996; 88:1529–1542. 38. Swain SM. Tamoxifen: the long and short of it. J Natl Cancer Inst 1996; 88:1510– 1512. 39. Peto R. Five years of tamoxifen—or more? J Natl Cancer Inst 1996; 88:1791–1793. 40. Current Trials Working Party of the Cancer Research Campaign Breast Cancer Trials Group. Preliminary results from the Cancer Research Campaign trial evaluating tamoxifen duration in women aged fifty years or older with breast cancer. J Natl Cancer Inst 1996; 88:1834–1839.

Stopping Clinical Trials

209

41. Dignam J, Bryant J, Wieand HS, Fisher B, Wolmark N. Early stopping of a clinical trial when there is evidence of no treatment benefit: Protocol B-14 of the National Surgical Adjuvant Breast and Bowel Project. Control Clin Trials 1998; 19:575–588. 42. Emerson SS. Statistical packages for group sequential methods. Am Stat 1996; 50: 183–192. 43. SAS Technical Report P-217. SAS/STAT Software: the PHREG Procedure, Version 6. Cary, NC: Sas Institute Inc., 1991. 63 pp. 44. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman & Hall/CRC, 2000.

12 Use of the Triangular Test in Sequential Clinical Trials John Whitehead The University of Reading, Reading, England

I.

INTRODUCTION

A clinical trial is described as sequential if its design includes one or more interim analyses that could lead to a resolution of the primary therapeutic question. Thus it can be distinguished from a fixed-sample trial, in which there are no interim analyses and the necessary sample size is calculated in advance. In a sequential trial, the sample size is unknown in advance and is determined in part by the nature of the emerging data. Although a fixed-sample trial may not, in the event, achieve its target sample size, that will occur for practical or logistical reasons rather than as a consequence of the nature of the accumulating data. Trials with purely administrative looks or with interim assessments of safety only are not generally considered to be sequential, as the primary therapeutic question is not repeatedly addressed. Sequential clinical trials are becoming increasingly common in clinical research because they offer the ethical and economic advantages of avoiding continuation in the face of mounting evidence against a treatment and of requiring relatively small sample sizes when the advantage of a treatment becomes quickly and clearly apparent. Most sequential designs currently being used are derived from either the boundaries approach or the α-spending approach, and this chapter concentrates on the most frequently implemented member of the former class: the triangular test. 211

212

Whitehead

The triangular test is an efficient form of sequential procedure that uses as small a sample size as possible while still maintaining the required precision of the testing procedure. It is an asymmetrical procedure in the sense that overwhelming evidence is required to reach an early conclusion that an experimental treatment is effective, whereas the trial will be stopped for lack of effect as soon as it is apparent that continuation is futile. These features are made more precise in subsequent sections. The following section is an introduction to the clinical trial context in which the triangular test can most easily be applied. Section III consists of a detailed account of a trial in renal cancer that used the triangular test. The history and mathematical properties of the method are given in Section IV, and rivals and variations to the approach are described in Section V. Section VI is a survey of recent applications of the triangular test.

II. COMPARATIVE CLINICAL TRIALS Throughout this chapter it is assumed that patients are being randomized between one experimental treatment E and one control treatment C and that the primary therapeutic question concerns a single patient response. The symbol θ will be used to denote the advantage of E over C in the patient population as a whole, whereas Z will denote the observed advantage apparent from the current data. The amount of information about θ contained in Z will be denoted by V. The quantity θ is an unknown population parameter, whereas Z and V are observable sample statistics. To clarify the meaning of each of the quantities above, two examples are given. Suppose that in a trial of cancer therapy, the primary patient response is the survival time of the patient from randomization to death. Then θ, the advantage of E over C, might be expressed as minus the log of the ratio of the hazard on E to that on C. The log is taken so that when hazards are equivalent, θ is equal to log (1) ⫽ 0, thus expressing zero advantage; the minus sign means that a reduction in hazard on E will show as a positive value of θ. The statistic Z is the logrank statistic, which can be thought of as the observed number of deaths on C minus the number expected under the null hypothesis of no advantage of E over C. The control treatment is focused on, so that a positive value of Z indicates an advantage of E. The statistic V will be the null variance of Z, which is approximately equal to one quarter of the number of deaths. The full formulae for Z and V are given in Section 3.4 of Whitehead (1). For a second example, take a comparison of the types of mattress on which a patient lies during surgery, in which the primary patient response is the incidence of pressure sores. Denote by pE and pC the probabilities of suffering a

Triangular Test in Sequential Clinical Trials

213

pressure sore on the experimental and control (standard) mattresses, respectively. Then θ will be taken to be the log-odds ratio: θ ⫽ log

冢

冣

冢

冣

1 ⫺ pE 1 ⫺ pC ⫺ log . pE pC

Let SE, SC and FE, FC denote the numbers of successes (no pressure sores) and failures (pressure sores) on E and C, respectively. Then n n SF nS and V ⫽ E C3 Z ⫽ SE ⫺ E n n where nE and nC denote the total number of patients on E and C, respectively, and S ⫽ SE ⫹ SC, F ⫽ FE ⫹ FC, and n ⫽ nE ⫹ nC. The traditional χ2 statistic for a 2 ⫻ 2 contingency table, usually expressed as ∑(O ⫺ E)2 /E, can be shown to be equal to Z 2 /V. At each interim analysis of the trial, Z and V are computed from the available data. The statistic Z is constructed so that its expected value is θV and its variance is V; consequently, a positive value of Z is encouraging. In a sequential analysis, the value of Z is plotted against V at each interim, resulting in an expected linear path of gradient θ, with random variation about it quantified by the variance V. This is illustrated in Figure 1.

Figure 1 Maintaining a plot of Z against V at interim analyses.

214

Whitehead

Sequential designs deriving from the boundaries approach are defined by superimposing boundaries on the Z-V plane. The triangular test is illustrated in Figure 2. A rising path of Z against V indicates growing evidence of an advantage of E over C, and the upper boundary is placed so that once the path crosses it, the trial can be stopped with a conclusion that E is significantly better than C. The trial is also stopped if the lower boundary is crossed, and a region corresponding to significant disadvantage of E over C is indicated as the solid portion of the lower boundary in Figure 2. In large samples the maximum likelihood estimate θˆ of θ and its standard error SE(θˆ ) are approximately equal to Z/V and 1/√V, respectively. One commonly used variation on the general scheme above is to use θˆ {SE (θˆ )}⫺2 in place of Z and {SE (θˆ )}⫺2 in place of V. Sequential designs are constructed according to the same sort of power requirement as governs a conventional sample size calculation. Crossing the upper boundary represents significant evidence that E is better than C. It is arranged that when θ ⫽ 0, this occurs with probability 1/2α, and for θ equal to some reference improvement θR ⬎ 0, it occurs with probability (1 ⫺ β). The part of the

Figure 2 Boundaries of the triangular test.——, reject the null hypothesis;– – –, do not reject the null hypothesis.

Triangular Test in Sequential Clinical Trials

215

lower boundary corresponding to significant evidence that E is worse than C is reached with probability –12 α when θ ⫽ 0; consequently, if either rejection region is reached the final two-sided p value will be less than α. Once the trial has stopped, an analysis must be performed. If a conventional analysis is applied, then the p value found is invalid, the point estimate of θ is biased, and confidence intervals for θ are too narrow. A variety of techniques is now available to overcome these problems and to produce acceptable analyses based on the form of sequential design used. Full details of both design and analysis methods are given in Whitehead (1).

III. A CLINICAL TRIAL OF ALPHA-INTERFERON IN METASTATIC RENAL CARCINOMA The MRC Renal Cancer Collaborators (2) described a multicenter, randomized, controlled trial in patients with metastatic renal carcinoma. Standard treatment in this indication is hormonal therapy with medroxyprogesterone acetate, and the patients randomized to the control group (C) received this therapy. The treatment under investigation was biological therapy with alpha-interferon, and patients receiving this treatment formed the experimental group (E). Patients were assigned to treatment using the method of minimization, stratifying by center and by whether the patient had had a nephrectomy and by single or multiple metastases. The primary treatment comparison concerned survival time from randomization to death. Treatment with alpha-interferon is both expensive and toxic. It was believed to be appropriate to continue the trial only as long as the emerging results were consistent with an outcome in favor of E. It was certainly not believed necessary to have a high power of establishing a significant disadvantage of E relative to C to dissuade clinicians from using E if the trial was negative. Following these considerations, a triangular design was chosen. Interim analyses were planned for 12 months after the start of the trial and at 6-month intervals thereafter. It was anticipated that recruitment would proceed at the rate of 125 patients per year. At each interim, the survival patterns of groups E and C were compared using a logrank test, stratified by whether the patient had had a nephrectomy before randomization. The stratification was not accounted for in the design, which was based on overall target and anticipated survival patterns on E and C. Denote the hazard functions of patients on E and C by hE(t) and hC(t), respectively, and the survivor functions by SE(t) and SC(t). For survival data, the parameter θ measuring the advantage of E over C was defined in Section I to be minus the log of the ratio of the hazard on E to that on C. Mathematically,

216

Whitehead

冦hh (t)(t)冧,

θ ⫽ ⫺log

E

for all t ⬎ 0.

(1)

C

The assumption that this quantity is constant over all t is known as the proportional hazards assumption. An equivalent expression for θ is θ ⫽ ⫺log {⫺log SE(t)} ⫹ log {⫺log SC(t)}, for all t ⬎ 0.

(2)

Based on results from Selli et al. (3), a 2-year survival rate of 0.2 was anticipated in the control group, that is, SC(2) ⫽ 0.2. A power of (1 ⫺ β) ⫽ 0.90 was set for achieving significance at the 5% level (two-sided alternative) if alphainterferon increased this 2-year survival rate to SE(2) ⫽ 0.32. Substituting in Eq. (2) gives a reference improvement of θR ⫽ ⫺log {⫺log (0.32)} ⫹ log {⫺log (0.2)} ⫽ 0.345, corresponding to a hazard ratio of hE(t)/hC(t) ⫽ 0.708. The triangular design satisfying this power requirement has upper and lower boundaries Z ⫽ 14.28 ⫹ 0.105V and Z ⫽ ⫺14.28 ⫹ 0.315V respectively. If the lower boundary is crossed with V ⱕ 17.3, then E will be declared to be significantly inferior to C. Table 1 shows the properties of the

Table 1 Properties of the Triangular Test Used in the Renal Carcinoma Trial Probability of finding E significantly θ

SE(2)

SC(2)

Better

Worse

⫺0.345

0.2

0.103

0.000

0.293

⫺0.173

0.2

0.148

0.000

0.105

0

0.2

0.200

0.025

0.025

0.173

0.2

0.258

0.362

0.004

0.345

0.2

0.320

0.900

0.000

No. of deaths at termination

Duration of trial (mo)

Median (90th %ile)

Median (90th %ile)

82 (124) 109 (176) 163 (284) 246 (381) 201 (341)

17 (22) 21 (29) 28 (41) 38 (52) 34 (49)

Triangular Test in Sequential Clinical Trials

217

design. If θ ⫽ ⫺0.345, that is, E is worse than C by a magnitude equal to the target improvement (on the log-hazards scale), then the power of obtaining a significant result is only 0.3. This emphasises the asymmetrical nature of the design and represents the scientific loss due to reducing expected sample size. (Although a scientific loss, it is of course an ethical gain.) The equivalent fixed sample size design would require 353 deaths and would last for 49 months. It can be seen that a substantial reduction in trial duration is likely to be achieved by use of the triangular test. In Table 1, the probabilities of finding E significantly better or worse, and the medians and 90th percentiles of the number of deaths, depend only on the values of the (minus) log-hazard ratio θ indicated. On the other hand, the medians and 90th percentiles of duration depend on the pretrial estimate of SC(2) as 0.2 being correct, on an exponential form of survival pattern in both groups, and on a steady entry rate of 125 patients per year. Recruitment was to continue until a boundary was reached or until 600 patients had entered the trial. Details of the trial design were published by Fayers et al. (4). Recruitment to the trial began in February 1992. The recruitment rate averaged 60 per year throughout the trial, less than half the anticipated rate. A total of six interim analyses were performed, and these are summarized in Table 2. Each row first gives the date of the interim analysis, the number of patients recruited (n), and the number of known deaths (d). Then the logrank statistic (Z) and its null variance (V) are given separately for each stratum by nephrectomy and combined by summation. The Q statistic is Cochran’s test statistic for heterogeneity between strata and is given by Q ⫽ ∑ (Z 2 /V) ⫺ (∑ Z)2 /(∑ V). This formula is familiar from meta-analysis (see ref. 5). If this were a fixed sample study, Q would follow the χ2 distribution on one degree of freedom: Here caution is required in interpretation due to the repeated interim analyses.

Table 2 Interim Analyses for the Renal Carcinoma Trial Nephrectomy Date 29 Oct 93 22 Sept 94 20 Feb 95 14 Feb 96 10 Feb 97 1 Oct 97

No nephrectomy

Combined

n

d

Z

V

Z

V

Z

V

Q

69 122 158 222 293 335

20 37 67 130 190 236

⫺1.62 ⫺1.84 ⫺2.01 2.03 6.68 9.27

2.44 4.79 9.38 16.73 23.99 29.98

2.69 4.21 3.61 7.17 9.69 10.55

1.99 3.75 6.33 14.63 22.00 26.68

1.07 2.38 1.60 9.21 16.37 19.83

4.42 8.54 15.71 31.35 45.99 56.66

4.47 4.78 2.33 1.06 0.30 0.10

218

Whitehead

The combined values of Z and V are plotted against one another in Figure 3. This figure displays a feature of the stopping rule, which has not yet been discussed. The outer triangular boundaries are calculated to achieve the required error probabilities when monitoring of the trial is continuous. Because the interim analyses are discrete, it is possible that the hypothetical sample path arising from continuous monitoring might cross the boundaries undetected between interim analyses, and return by the time of the next look. Thus discrete monitoring makes stopping less likely. To compensate and achieve the required error probabilities, the stopping boundaries must be brought closer together. The jagged inner boundaries achieve this: The longer the gap between looks, the more the boundaries are brought in. The magnitude of the correction is 0.583 times the square root of the increment in V. The resulting boundaries are known as Christmas tree boundaries because of their shape. It is sufficient for the plotted point to reach these inner boundaries for the trial to be stopped. They can be used with any design based on straight-line boundaries but are especially accurate for the triangular test. At each of the first two interim analyses, the Q statistic was nominally significant at the 5% level, indicating an interaction between treatment and nephrectomy group. Relative to control, the experimental group appeared to be

Figure 3 The final sequential plot for the MRC trial of alpha-interferon.

Triangular Test in Sequential Clinical Trials

219

benefited within the no nephrectomy stratum and disadvantaged within the nephrectomy stratum. These first two interim analyses, like all subsequent ones, were presented to a Data Monitoring Committee, and the apparent heterogeneity caused some concern. However, taking informal account of multiple and repeated testing, it was decided to take no action at either of the first two interims. The impression of heterogeneity thereafter subsided, and later analyses showed no trace of it at all. At the sixth interim analysis, the upper boundary was reached. The trial protocol gave the Data Monitoring Committee the power to recommend stopping or continuing the trial. The validity of proportional hazards and the consistency of the advantage of E over C over various stratifications of the patients were considered informally and found to be satisfactory. The Data Monitoring Committee did recommend stopping, and the decision was confirmed by the MRC Renal Cancer Working Party. Recruitment was closed on 30 November 1997. The analysis conducted on 1 October 1997, which allowed for the sequential nature of the design, found that alpha-interferon is associated with significantly better survival than medroxyprogesterone acetate, with p ⫽ 0.017 (twosided alternative). The median unbiased estimate of log-hazard ratio is θM ⫽ 0.334, with 95% confidence interval (0.062, 0.600). Transformed to the hazard ratio hE(t)/hC(t), a median unbiased estimate of 0.716 is obtained, with a 95% confidence interval of (0.549, 0.940). The value of θM is very close to the target improvement of θR ⫽ 0.345 under which the trial was powered. Two-year survival probabilities for E and C were estimated to be 0.22 and 0.12, respectively. Both are worse than anticipated, and this greater death rate has contributed to the reduction in duration of the trial. Median survival times are estimated to be 8.5 months on E and 6.0 months on C, the extra 2.5 months on E being a modest advantage given the cost and toxicity of alpha-interferon. At the time of the sixth interim analysis, 236 deaths had occurred. This is just two thirds of the 353 calculated as required for a fixed sample design at the beginning of the trial. By reacting to both the increased death rate relative to predictions and to the emerging advantage of E, the triangular test appreciably reduced the duration of the renal carcinoma trial.

IV. HISTORY OF THE TRIANGULAR TEST Sequential analysis originated in the war-time work of Wald (6) and Barnard (7) concerning the inspection of newly manufactured batches of military hardware. These authors both developed the sequential probability ratio test (SPRT). Applied directly to the clinical trial context of Section II, the stopping boundaries for Z are a ⫹ cV (upper) and ⫺b ⫹ cV (lower). These parallel boundaries are

220

Whitehead

open-ended, and so there is no upper limit on the amount of information that is required. The SPRT has an optimality property, stated and proved by Wald and Wolfowitz (8). For the clinical trial context, imagine that interim analyses are very frequent and that the power is set at 1 ⫺ –12 α. Under these circumstances b ⫽ a and c ⫽ –12 θR. If V* denotes the value of V at termination, then the optimality of the SPRT implies that both E(V*; 0) and E(V*; θR)—which will be equal to one another—reach the minimum value achievable by a sequential test satisfying the specified power requirement. Following the publications of Wald and Barnard (6,7), many authors sought to modify the SPRT and in particular to overcome the open-ended nature of the boundaries. Anderson (9) derived properties of a variety of procedures based on straight-line stopping boundaries, which included the triangular test. The 2-SPRT design of Lorden (10) and the minimum probability ratio test of Hall (11) are both alternative characterizations of the triangular test. Now we return to the case of very frequent interim analyses and a power set to be 1 ⫺ –12 α, in which the SPRT minimizes E(V*; θ) at θ ⫽ 0 and θ ⫽ θR. Despite this optimal property, the maximum value of E(V*; θ), which occurs at θ ⫽ –12 θR , can be quite large. When θ ⫽ –12 θR, the sample path has the maximum propensity for wandering between the two parallel boundaries. Lai (12) sought designs that minimize maxθ E(V*; θ) ⫽ E(V*; –12 θR) for the specified power requirement. He described these and found that as α → 0 they become the triangular test, with gradient of the lower boundary three times that of the upper boundary as described in Sections I and II above. The recommendation of the triangular test as a means of minimizing expected sample size is justified by this result. Huang et al. (13) found that although α ⬎ 0, E(V*; –12 θR) can be reduced further by using a triangular test that is longer and thinner than that illustrated in Figures 2 and 3; however, the reduction is very small. Jennison (14) reported numerical work seeking optimal designs when interim analyses are not especially frequent. When expressed on the Z-V diagram, they become similar to triangular stopping regions. Although sequential analysis was first introduced in the context of quality control, its attractions for clinical research were quickly recognized. During the early 1950s, Bross (15) devised sequential medical plans, which resemble the double triangular test in spirit, and Kilpatrick and Oldham (16) implemented a sequential t-test (a form of SPRT) in a trial of bronchial dilators. Excellent surveys of the emerging methodology are given in the two editions of the book by Armitage (17,18). The latter edition describes a wide variety of designs, most based on the approximate properties of straight-line boundaries. Included are double versions of the SPRT (popularly known as the trouser test), restricted procedures, and skew plans that resemble the triangular test (see also Spicer [19]). The sequential designs of the mid-1970s suffered from four major limitations:

Triangular Test in Sequential Clinical Trials

221

1. Only normally distributed and binary patient responses were catered for, and in the case of binary observations the artificial stratagem of matched pairs was usually necessary to eliminate nuisance parameters. 2. Approximations used in the theory were accurate only if interim analyses occurred after every individual response or matched pair of responses. 3. No special methods existed for producing a valid analysis once a sequential trial has been completed. 4. Software for sequential methods was rudimentary, and usually a choice had to be made from a limited repertoire of tabulated designs. My own work (20,21) examined the analogy between the test statistics Z and V and the sample sum and sample size of independent normal observations with mean θ and variance 1. This allows many of the early procedures developed for the latter case to be applied far more generally. The requirement that sequential methods involve very frequent interim analyses was a greater practical barrier to widespread implementation. Continuous monitoring of data emerging from a trial was logistically difficult. More reasonable was a series of a few interim analyses, each conducted after a new group of patients had responded. The group sequential designs of Pocock (22) and O’Brien and Fleming (23) were based on an approach that was totally different from much of what had been developed earlier. The concept was not of boundaries but of setting the null probability of stopping to reject the null hypothesis at or before each interim analysis, so that the final value was equal to the required α. The idea was given flexibility and generality through the α-spending function introduced by Lan and DeMets (24). Although formulated in terms of error probabilities, ‘‘group sequential’’ designs could be displayed on the Z-V plane, where they were seen as symmetrical, bearing a general resemblance to restricted procedures. Two camps of sequential methodology having been established, they rapidly began to converge. Asymmetrical group sequential methods introduced by DeMets and Ware (25) began the development of an α-spending counterpart to the triangular test, whereas adjustments for discrete monitoring (26) allowed the triangular test to be used with group sequential sampling and eventually led to the Christmas tree correction described in Section III above. For continuous monitoring, all sequential designs have both a boundaries and an α-spending function representation, two such functions being needed to characterize asymmetrical designs. Allowance for discrete monitoring can be achieved by using the recursive numerical integration routines described by Armitage et al. (27) or by corrections of continuous boundaries such as the Christmas tree method. The latter has been developed only for straight-line boundaries. It

222

Whitehead

is extremely accurate when used with the triangular test (28), but less precise for restricted procedures or the SPRT. The issue of posttrial analysis has attracted a great deal of attention, which is outside the scope of this chapter. For a summary of methods that lead to valid analyses, see Chapter 5 of Whitehead (1). Software in the form of PEST 4 (29) and EaSt 2000 (30) is now available: For a comparative review of earlier versions of these two packages, see Emerson (31). A sequential module to the language S plus is also available. The renal carcinoma described in Section III above was designed and analyzed using PEST. The triangular test was ready for application in the early 1980s and was first used in a comparison of anesthetic techniques (32) and in a clinical trial in lung cancer (33–35).

V.

RIVALS AND VARIATIONS TO THE TRIANGULAR TEST

The triangular test is a design that will efficiently distinguish between superiority of an experimental treatment E over a control C and lack of superiority. The power properties noted in the alpha-interferon example of Section III hold quite generally: A triangular test designed to have high power of detecting a treatment advantage will have little power to detect a disadvantage of comparable magnitude. Figure 4, a and b, illustrates two asymmetric designs with similar properties. Figure 4a shows a truncated SPRT, whereas Figure 4b shows a member of the class of designs described by Pampallona and Tsiatis (36). The precise optimality results of the (untruncated) SPRT and the triangular test cited in Section IV motivate the following less formal considerations. The triangular test is effective in reducing sample size whatever the true value of θ might be and especially for moderate values lying between 0 and the reference improvement θR, for which the largest sample sizes occur. The truncated SPRT is effective at reducing sample size if θ is negative or exceeds θR. This can be seen from the fact that for small V, the boundaries of the truncated SPRT are closer together than those of the triangular test. To achieve such optimality, truncation should occur at quite a large value of V, so that the design resembles its untruncated counterpart. Truncation at the value of V 50% greater than the fixed sample size for equivalent power should suffice. If there is good reason to believe that θ exceeds θR, perhaps from previous studies, then the truncated SPRT might be chosen. It should lead to a small sample size if the prediction of substantial superiority proves true, while providing a valid if more lengthy procedure if such optimism turns out not to be justified. Sometimes there is reason to fear that θ ⬍ 0 but nevertheless to wish to proceed with a trial because the potential for benefit remains worthy of investigation. In this case too, a truncated SPRT is indicated. It is likely, however, that the triangular test would be preferred in most trials.

Triangular Test in Sequential Clinical Trials

223

Figure 4 Alternative sequential designs. (a) Truncated SPRT; (b) a design of Pampallona and Tsiatis; (c) reverse triangular test; (d) double triangular test.

The Pampallona and Tsiatis design shown in Figure 4b is one of a family of asymmetric designs indexed by the curvature of the boundaries. This one is very similar to a triangular test of equivalent power, especially if there are to be no interim analyses when V is small. The designs are not derived from considerations of optimality. Other sequential alternatives to the triangular test for fulfillment of asymmetric power requirements include stochastic curtailment procedures based on conditional or predictive power. The use of conditional power was introduced by Lan et al. (37), whereas Spiegelhalter et al. (38) discussed predictive power. These and other approaches are described in the book by Jennison and Turnbull (39). Although outside the scope of this chapter, it is worth remarking that any such stopping rule can be mapped onto the Z-V plane as a stopping boundary for comparison with the procedures discussed here. Such a step is recommended so that an informed choice of design can be made. Figure 4, c and d, shows, respectively, a reverse triangular test and a double triangular test. The reverse triangular test has high power of detecting inferiority of the experimental treatment but low power of showing advantage. It might be

224

Whitehead

used in cases in which E has clear nonefficacy advantages such as cost, ease of use, or safety, so that only proven inferiority will dissuade clinicians and patients from using it. The double triangular test will potentially distinguish between three true situations: E better than C, E no different from C, and E worse than C. It is sometimes used when investigators wish to hedge their bets. The most desirable trial outcome is a claim that E is better than C, but a fall-back in which E is no different from C but has secondary advantages (not apparent in the sequential plot) might be worthwhile. The double triangular test is also appropriate when demonstration of equivalence is the primary concern. When departing from a conventional objective of demonstrating superior efficacy, the values specified for power and for type I error rate must be chosen with caution (see ref. 40). The truncated SPRT and the reverse and double triangular tests are implemented in the computer package PEST 4, whereas the Pampallona and Tsiatis designs (including ‘‘double’’ versions) are available in EaSt 2000. Sometimes symmetrical designs are required, which will have small sample size only if a major treatment difference is apparent, with the maximum sample size being desirable otherwise. This may be to allow sufficient power to meet secondary objectives such as subgroup analysis or investigation of secondary end points. This requirement also rules out the double triangular test. In these situations, restricted procedures (1,18) or symmetrical designs based on the α-spending function approach described in Section IV could be implemented.

VI. RECENT CLINICAL TRIALS BASED ON THE TRIANGULAR TEST AND FURTHER WORK The triangular test has now been used in a wide variety of clinical studies concerned with many therapeutic areas. Examples include trials of corticosteroids for AIDS-induced pneumonia (41), of enoxaparin for prevention of deep vein thrombosis resulting from hip replacement surgery (42), of isradipine for the acute treatment of stroke (43), and of implanted defibrillators in coronary heart disease (44). In pediatric medicine, the triangular design has been used to study the use of surfactant to alleviate respiratory distress in infants (45) and in a trial concerning gastrointestinal reflux (46). An evaluation of the drug Viagra in the treatment of erectile dysfunction after spinal injury also used the method (47), and it has been implemented in animal studies of medical techniques (48). An interesting combination of the triangular test with the play-the-winner rule was applied in a study of spinal anesthesia during cesarean section (49). Within oncology, besides the renal and lung cancer trials mentioned in Sections III and IV, respectively, Storb et al. (50) described a triangular test of immunotherapy as a preparation for bone marrow transplantation in leukemia. The double triangular test has also found application. Nixon et al. (51)

Triangular Test in Sequential Clinical Trials

225

described such a design in a comparison of pressure sore rates after the use of two types of mattress during cancer surgery, Boden et al. (52) report a trial based on the design in cardiology and Yentis (53) described an application to a trial in elective Caesarian section. Other trials using triangular and related designs are given on the web page: http://www.rdg.ac.uk/mps/mps_home/software/pest4/ practice.htm The properties of the triangular test are now well understood, and its optimality makes it hard to improve. It is not appropriate for every situation, and suitable alternatives exist. The Christmas tree correction could be improved on, but in the case of the triangular test, not by much. An exact version of the triangular test, applicable to a single stream of binary observations, is described by Stallard and Todd (54). There is a range of response types to which the triangular and related designs can be extended; in particular, work is proceeding on survival responses with nonproportional hazards and longitudinal ordinal responses. Methods for analyzing data after sequential trials of this type are also being developed. In the future, two main challenges remain. One is to use the principles underlying the triangular design to help in the construction of optimal sequential designs for multivariate responses and for multiple treatment comparisons. The other is, quite simply, to spread its usage to all trials that can benefit from its ethical and economic advantages.

REFERENCES 1. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Revised 2nd Edition. Chichester: Wiley, 1997. 2. MRC Renal Cancer Collaborators. Interferon-α and survival in metastatic renal carcinoma: early results of a randomised controlled trial. Lancet 1999; 353:14–17. 3. Selli C, Hinshaw W, Woodward BH, Paulson DF. Stratification of risk factors in renal cell carcinoma. Cancer 1983; 52:899–903. 4. Fayers PM, Cook PA, Machin D, Donaldson N, Whitehead J, Ritchie A, Oliver RTD, Yuen P. On the development of the medical research council trial of α-interferon in metastatic renal carcinoma. Stat Med 1994; 13:2249–2260. 5. Whitehead A, Whitehead J. A general parametric approach to the meta-analysis of randomized clinical trials. Stat Med 1991; 10:1665–1677. 6. Wald A. Sequential Analysis. 1947; New York: Wiley. 7. Barnard GA. Sequential tests in industrial statistics. J R Statist Soc 1946; (Suppl. 8): 1–26. 8. Wald A, Wolfowitz J. Optimum character of the sequential probability ratio test. Ann Math Stat 1948; 19:326–339. 9. Anderson TW. A modification of the sequential probability ratio test to reduce sample size. Ann Math Stat 1960; 31:165–197.

226

Whitehead

10. Lorden G. 2-SPRT’s and the modified Kiefer-Weiss problem of minimizing an expected sample size. Ann Stat 1976; 4:281–291. 11. Hall WJ. Sequential minimum probability ratio tests. In: Chakravarti IM, ed. Asymptotic Theory of Statistical Tests and Estimation. New York: Academic Press, 1980. 12. Lai TL. Optimal stopping and sequential tests which minimize the maximum expected sample size. Ann Stat 1973; 1:659–673. 13. Huang P, Dragalin V, Hall WJ. Asymptotic design of symmetric triangular tests for the drift of Brownian motion. Sequential Analysis, to appear. 14. Jennison C. Efficient group sequential tests with unpredictable group sizes. Biometrika 1987; 74:155–165. 15. Bross I. Sequential medical plans. Biometrics 1952; 8:188–205. 16. Kilpartick GS, Oldham PD. Calcium chloride and adrenaline as bronchial dilators compared by sequential analysis. Br Med J 1954; 2:1388–1391. 17. Armitage P. Sequential Medical Trials, 1st ed. Oxford: Blackwell, 1960. 18. Armitage P. Sequential Medical Trials, 2nd ed. Oxford: Blackwell, 1975. 19. Spicer CC. Some new closed sequential designs for clinical trials. Biometrics 1962; 18:203–211. 20. Whitehead J. Large sequential methods with application to the analysis of 2 ⫻ 2 contingency tables. Biometrika 1978; 65:351–356. 21. Jones DR, Whitehead J. Sequential forms of the log rank and modified Wilcoxon tests for censored data. Biometrika 1979; 66:105–113. [see correction, Biometrika 1981; 68:576.] 22. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64:191–199. 23. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556. 24. Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika 1983; 70:659–663. 25. DeMets DL, Ware JH. Group sequential methods for clinical trials with a one-sided hypothesis. Biometrika 1980; 67:651–660. 26. Whitehead J, Stratton I. Group sequential clinical trials with triangular continuation regions. Biometrics 1983; 39:227–236. 27. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating data. J R Stat Soc A 1969; 132:235–244. 28. Stallard N, Facey KM. Comparison of the spending function method and the Christmas tree correction for group sequential trials. J Biopharm Stat 1996; 6:361– 373. 29. MPS Research Unit. PEST 4: Operating Manual. The University of Reading: England, 2000. 30. Cytel Software Corporation. EaSt 2000: A software package for the design and interim monitoring of group sequential clinical trials. Cambridge MA: Cytel, 2000. 31. Emerson SS. Statistical packages for group sequential methods. Am Stat 1996; 50: 183–192. 32. Hackett GH, Harris MNE, Plantevin M, Pringle HM, Garrioch DB, Avery A. Anaesthesia for out-patient termination of pregnancy, a comparison of two anaesthetic techniques. Br J Anaesth 1982; 54:865–870.

Triangular Test in Sequential Clinical Trials

227

33. Jones DR, Newman CE, Whitehead J. The design of a sequential clinical trial for comparison of two lung cancer treatments. Stat Med 1982; 1:73–82. 34. Whitehead J, Jones DR, Ellis SH. The analysis of a sequential clinical trial for the comparison of two lung cancer treatments. Stat Med 1983; 2:183–190. 35. Newman CE, Cox R, Ford CHJ, Johnson JR, Jones DR, Wheaton M, Whitehead J. Reduced survival with radiotherapy and razoxane compared with radiotherapy alone for inoperable lung cancer in a randomised double-blind trial. B J Cancer 1985; 51: 731–732. 36. Pampallona S, Tsiatis AA. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. J Stat Plan Infer 1994; 42:19–35. 37. Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical trials. Sequent Anal 1982; 1:207–219. 38. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Control Clin Trials 1986; 7:8–17. 39. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. London: Chapman and Hall/CRC, 2000. 40. Whitehead J. Sequential designs for equivalence studies. Stat Med 1996; 15:2703– 2715. 41. Montaner JSG, Lawson LM, Levitt N, Belzberg A, Schechter MT, Ruedy J. Corticosteroids prevent early deterioration in patients with moderately severe pneumocystis carinii pneumonia and the acquired immunodeficiency syndrome (AIDS). Ann Intern Med 1990; 113:14–20. 42. Whitehead J. Sequential designs for pharmaceutical clinical trials. Pharm Med 1992; 6:179–191. 43. Whitehead J. Application of sequential methods to a phase III clinical trial in stroke. Drug Inform J 1993; 27:733–740. 44. Moss AJ, Hall WJ, Cannom DS, Daubert JP, Higgins SL, Klein H, Levine JH, Saksena S, Waldo AL, Wilber D, Brown MW, Heo M. Improved survival with an implanted defibrillator in patients with coronary disease at high risk for ventricular arrhythmia. N Engl J Med 1996; 335:1933–1940. 45. Gortner L, Pohlandt F, Bartmann P, Bernsau U, Porz F, Hellwege H-H, Seitz RC, Hieronimi G, Kuhls E, Jorch G, Hentschel R, Reiter H-L, Bauer J, Versmold H, Meiler B. High-dose versus low-dose bovine surfactant treatment in very premature infants. Acta Paediatr 1994; 83:135–141. 46. Bellisant E, Duhamel J-F, Guillot M, Pariente-Khayat A, Olive G, Pons G. The triangular test to assess the efficacy of metoclopramide in gastroesophageal reflux. Clin Pharma Ther 1997; 61:377–384. 47. Derry FA, Dinsmore WW, Fraser M, Gardner BP, Glass CA, Maytom MC, Smith MD. Efficacy and safety of oral sildenafil (Viagra) in men with erectile dysfunction caused by spinal cord injury. Neurology 1998; 51:1629–1633. 48. Niemann TJ, Cairns CB, Sharma J, Lewis RJ. Treatment of prolonged ventricular fibrillation: immediate countershock versus high dose epinephrine and CPR preceding countershock. Circulation 1992; 85:281–287. 49. Rout CC, Rocke DA, Levin J, Gouws E, Reddy D. A reevaluation of the role of crystalloid preload in the prevention of hypotention associated with spinal anesthesia for elective cesarean section. Anesthesiology 1993; 79:262–269.

228

Whitehead

50. Storb R, Deeg, J, Whitehead J, Appelbaum F, Beatty P, Bensinger W, Buckner CD, Clift R, Doney K, Farewell V, Hansen J, Hill R, Lum L, Martin P, McGuffin R, Sanders J, Stewart P, Sullivan K, Witherspoon R, Yee G, Thomas ED. Methotrexate and cyclosporine compared with cyclosporine alone for prophylaxis of acute graft versus host disease after bone marrow transplantation for leukemia. N Engl J Med 1986; 314:729–735. 51. Nixon J, McElvenny D, Mason S, Brown J, Bond S. A sequential randomised controlled trial comparing a dry visco-elastic polymer pad and standard operating table mattress in the prevention of post-operative pressure sores. Int J Nurs Stud 1998; 35:193–203. 52. Boden WE, van Gilst WH, Scheldewaert RG, Starkey IR, Carlier MF, Julian DG, Whitehead A, Bertrand ME, Col JJ, Pedersen OL, Lie KI, Santoni J-P, Fox KM. Diltiazem in acute myocardial infarction treated with thrombolytic agents: a randomised placebo-controlled trial. Lancet 2000; 355:1751–1756. 53. Yentis SM, Jenkins CS, Lucas DN, Barnes PK. The effect of prophylactic glycopyrrolate on maternal haemodynamics following spinal anaesthesia for elective caesarian section. Int J Obstet Anesth 2000; 9:156–159. 54. Stallard N, Todd S. Exact sequential tests for single samples of discrete responses using spending functions. Statistics in Medicine 2000; 19:3051–3064.

13 Design and Analysis Considerations for Complementary Outcomes Bernard F. Cole Dartmouth Medical School, Lebanon, New Hampshire

I.

INTRODUCTION

In cancer clinical trials, data are routinely collected for various patient outcomes, including treatment-related adverse events and clinical response. Adverse event data are used to describe the risks associated with a new treatment, and the clinical response data are used to describe the benefits. Clinical response is often defined in terms of changes in tumor size, time until disease progression, or time until death. By compiling such data, researchers are able to make an objective evaluation of the safety and efficacy of a new therapy. Modern clinical trials often collect data in addition to these usual outcomes. The two most common of these ‘‘complementary outcomes’’ are quality of life (1–3) and factors related to economic cost (4). By including these outcomes in clinical trials, researchers are able to address questions regarding quality of life and monetary cost that may be raised by patients, physicians, payers, and policy makers considering the use of a new regimen. In this chapter, we provide an overview of design and analysis considerations relating to complementary outcomes in cancer clinical trials.

229

230

Cole

II. QUALITY-OF-LIFE ASSESSMENT A.

History

Early measures of quality of life in cancer focused on physical functioning. The Karnofsky performance status (KPS), introduced in 1948 (5,6), is generally considered to be the first such measure. KPS is measured on an 11-point scale from 0% to 100% (10% increments) where 0% denotes death, 100% denotes normal function, and other values denote ‘‘approximate percentage of normal physical performance.’’ The KPS assessment is made by the clinician. Subsequent efforts in quality-of-life assessment evaluated illnesses and therapeutic regimens from the patients’ perspective. For example, in 1971 Izsak and Medalie (7) developed a multidimensional scale that measured physical, social, and psychological variables in cancer patients. In 1975 a trial for patients with acute myelogenous leukemia used a six-level assessment of quality of life ranging from ‘‘hospital stay throughout illness’’ to ‘‘no symptoms, normal life’’ (8). The assessments were based on patient reports of their symptoms and functioning. Modern quality-of-life assessment in cancer clinical trials is generally cited to have begun in 1976 with Priestman and Baum’s study of breast cancer treatment (9). Using a 10-question instrument, they assessed patients’ general feeling of well-being, mood, level of activity, pain, nausea, appetite, ability to perform housework, social activities, general level of anxiety, and overall treatment experience. The results indicated that this instrument could be used to assess the subjective benefit of treatment in individual women, to detect changes over time, and to compare different treatments within a clinical trial. B.

Measuring Quality of Life

Many instruments are available for measuring quality of life in clinical trials. These can be divided into general and disease-specific instruments. Commonly used general instruments include the SF-36 (10), the Sickness Impact Profile (11), and the SCL-90-R (12). Each of these instruments includes general questions relating to a patient’s health and functioning, and they can be applied in a wide range of disease settings. A list of cancer-specific instruments is provided in Table 1. The goal of each instrument is to measure overall quality of life and various quality-of-life domains, such as physical functioning, social functioning, and mental health. Other domains include disease symptoms, pain, general health perceptions, vitality, and role functioning. Quality-of-life instruments usually include several individual questions, or items pertaining to a particular domain, and the domain score (also called scale score) is obtained by summarizing the responses from the associated items (e.g., average of the item responses). Each

Complementary Outcomes Table 1

231

Cancer-Specific Quality-of-Life Measurement Instruments

Instrument and reference Breast Cancer Chemotherapy Questionnaire (BCQ) (43) Cancer Rehabilitation Evaluation System (CARES) (44) European Organization for Research and Treatment of Cancer scale (EORTC: QLQ-C30) (45)

Number of Items 30

93–132

42

Functional Assessment of Cancer Therapy (FACT) (46)

36–40

Functional Living Index—Cancer (FLIC) (47)

22

International Breast Cancer Study Group Quality of Life Questionnaire (IBCSG–QL) (48) Linear Analogue Self-Assessment (LASA) (9) Quality of Life Index (QLI) (18)

10

25 5

Domains assessed Attractiveness, fatigue, physical symptoms, inconvenience, emotional, hope, social support Physical, psychosocial, medical interaction, marital, sexual, symptom- and treatment-specific items Five functional scales (physical, role, cognitive, emotional, social), three symptom scales (fatigue, pain, nausea), disease-specific items, global quality of life Physical, social/family, relationship with doctor, emotional, functional, well-being, diseasespecific items Psychological, social, disease symptoms, global well-being, treatment and disease issues, physical functioning Physical well-being, mood, fatigue, appetite, coping, social support, symptoms, overall health Physical, psychological, social Physical activity, daily living, health perceptions, psychological, social support, outlook on life

instrument has its own rules regarding the computation of the domain scores, and these rules are established after careful testing. Each item is generally measured on a Likert scale or a linear analogue self-assessment (LASA) scale. The Likert scale is an ordered categorical scale consisting of a limited choice of clearly defined responses. The most frequently used scales have either four or five categories. In contrast, the LASA scale is an unmarked line, usually 10 cm long, with text at either end describing the extremes of the scale. Each patient is asked to place a mark on the line in a position that best reflects his or her response relative to the two labeled extreme points.

232

C.

Cole

Measuring Patients’ Preferences and Utilities

In addition to measuring descriptive quality of life, it is possible to measure a patient’s preference, or utility, for particular health states. This can be accomplished by assessing a patient’s value for one health state compared with another based on quality-of-life considerations. For example, two patients might report similar symptoms with similar frequency and duration, but they may differ on how important these symptoms are in their daily lives. Descriptive quality-of-life instruments will correctly provide similar scores for these two patients, whereas a measurement of preference or utility will differentiate them. Utility is measured on a scale from 0 to 1, where 0 denotes a health state ‘‘as bad as death’’ and 1 denotes a health state ‘‘as good as perfect health.’’ Values between 0 and 1 denote degrees between these extremes. A simple interpretation of a utility for a specific health state, A, is that the utility represents the amount of time in a state of perfect health that a patient values as equal to one unit of time in state A. For example, suppose that state A has a utility of 0.7. Then 1 month in state A is equivalent in value to 0.7 months of perfect health. This interpretation leads to the idea that quality-of-life-adjusted time can be obtained by multiplying a health state duration by its utility coefficient. For example, if a patient experiences 6 months of toxicity and has a utility weight of 0.8 for time with toxicity, then the quality-adjusted time spent with toxicity is 4.2 months. This adjustment allows treatments that have different impacts on quality of life to be compared in a meaningful way. Classically, utility assessment is carried out using interview techniques. The ‘‘standard gamble’’ technique gives patients a choice between a chronic health state with certainty or an uncertain health state that is either perfect health (with probability p) or death (with probability 1 ⫺ p). The probability p is varied until the patient is indifferent between the certain and the uncertain choice and the final p is taken as the utility value. The ‘‘time trade-off’’ technique gives patients a choice between living for a certain amount of time in a state of less than perfect health or a shorter amount of time in a state of perfect health. The duration of the ‘‘perfect health’’ state is varied until the patient expresses indifference to the choice. The utility is then taken as the ratio of the final health state durations. For a detailed overview of utility assessment, see Bennett and Torrance (13). Interview techniques are cumbersome to use in practice. Fortunately, there are procedures for obtaining utility data from quality-of-life instruments using multiattribute utility theory (14). Generally, these procedures were developed by administering both the instrument and the interview to a study sample and building a statistical model for predicting the utility value from the instrument responses. Instruments that can be used for this purpose include the EuroQol (15),

Complementary Outcomes

233

Health Utilities Index (16), and the Q-tility Index (17), which uses Spitzer’s Quality of Life Index (18).

III. ANALYSIS OF QUALITY-OF-LIFE DATA A. Overview Quality-of-life data are generally collected longitudinally in cancer clinical trials. Often the data collection is most intense during the treatment phase of the study when patients have frequent clinic visits (e.g., 1-month intervals). Posttreatment measurements are generally collected at longer intervals corresponding to followup visits or are obtained using mailed surveys (e.g., 6-month intervals). Therefore, a longitudinal analysis procedure is appropriate. Standard methods include repeated-measures analysis of variance or more general mixed effects regression models. Other techniques include growth curve models and the construction of summary measures. The main difficulty in analyzing longitudinal quality-of-life data is the appropriate handling of missing observations. Observations may be missing for many reasons, some of which may be considered missing at random, whereas others are related to the quality of life the patient is experiencing. For example, a patient may be too ill to complete the questionnaire, or the patient may be doing so well that he or she does not visit the clinic at the time when a quality-of-life assessment is due. As with most statistical analyses, a graphical display of the data is a useful starting point. A common display for longitudinal quality-of-life data consists of a plot of mean scores over time according to treatment group for each qualityof-life domain measured. It is useful to indicate on these graphs which assessment time points occurred during the treatment phase of the study. It is also useful to indicate how many subjects provided data at each time point. These summaries will help to guide the modeling of the data and the interpretation of the modeling results. B. Repeated-Measures Analysis of Variance Repeated-measures analysis of variance is a commonly used and convenient approach to modeling longitudinal quality-of-life data. Generally, the model predicts a specific domain of quality of life using treatment group, time point, and a treatment group by time point interaction as independent factors. Other covariates, such as age, may be included as adjustment factors if confounding is a concern. In addition, specific contrasts of the regression parameters can be evaluated. For example, if a treatment by time interaction is found, the treatment effect

234

Cole

can be evaluated at particular time points by testing the appropriate linear combination of the regression parameters. Standard software for this analysis usually assumes compound symmetry for the covariance matrix of the longitudinal assessments. The use of mixed effect regression analysis allows one to fit other forms for the covariance matrix or to leave the covariance matrix unspecified, in which case it is estimated from the data. When quality-of-life assessments are obtained sporadically or at varying time points, one can model covariance as a function of the time difference between two assessments. Generally, a separate analysis is performed for each quality-of-life measure obtained in a study. However, this approach can lead to inflated type I error due to the multiple testing. Although a multivariate analysis approach may be used, this uses only subjects with valid data on all subscales. Often patients are missing one or more subscales from an instrument, so that the multivariate approach uses only a fraction of the available subjects. Other corrections for multiple testing can be used when this problem is present (e.g., the Bonferroni procedure). The main difficulty with repeated-measures analysis of variance is accommodating missing observations. C.

Other Techniques and Missing Data

Other techniques for analyzing longitudinal quality-of-life data include growth curve models and the construction of summary measures. Growth curve modeling generally involves fitting polynomials to the longitudinal data for an individual, with the analysis focusing on the fitted polynomials (19). The method of summary measures similarly collapses the multiple observations for an individual into a single outcome. Examples include the mean of the observations or the area under the curve of the plotted quality-of-life scores over time. An advantage of these methods is that missing data can be accommodated in a variety of ways. Alternatively, more recently developed methods for missing data can be applied when using a multivariate or repeated-measures model (e.g., imputation techniques). We refer to Fairclough (20) for an excellent review of these methods. An additional valuable reference in this area is the proceedings volume of the Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues (21). Finally, Bonetti et al. (22) recently developed a method-of-moments estimation procedure for evaluating quality-of-life data in the presence of nonignorable missing observations. D.

Example

Hu¨rny et al. (23) evaluated the quality-of-life impact of various adjuvant cytotoxic therapy schedules in patients with breast cancer who were separately treated in two clinical trials conducted by the International Breast Cancer Study Group

Complementary Outcomes

235

(trials VI and VII). One of these trials, trial VI, consisted of 1475 patients who were pre- or perimenopausal at the time of study recruitment and were randomized in a 2 ⫻ 2 factorial design to receive three or six initial cycles of chemotherapy with or without later reintroduction of three single cycles of chemotherapy administered at 3-month intervals. The quality-of-life instrument used in the study measured five indicators of quality of life: physical well-being, mood, appetite, perceived adjustment/coping, and emotional well-being. The first four indicators were measured using LASA scales, and the fifth was based on a 28-item adjective checklist. Measurements were obtained at baseline, 2 months later, then every 3 months until 24 months, and again 1 and 6 months after disease recurrence. The analysis conducted by Hu¨rny et al. (23) focused on the data recorded during the first 18 months after randomization. By this time all patients had completed their treatment. Square-root transformations were used to stabilize the variance of all scales. Analysis of variance was used to compare the treatment groups at each time point, and a repeated-measures model was used to make comparisons over time. Both models included the patient’s language/culture as a covariate. Patients who had missing data at a particular time point were excluded only from the analysis of that time point. The number of patients who furnished usable data at each time point varied from 1022 patients at baseline to 797 patients at the 18-month time point. The results indicated that mean quality-of-life scores increased over time for all treatment groups. In particular, statistically significant (p ⬍ 0.05) differences were observed in the mean coping scores for the four groups at time points 6, 9, 12, and 15 months. At these time points, the group that received three cycles of initial therapy with no reintroduction therapy had the highest coping scores (indicating a better degree of coping). At the 18-month time point, all four treatment groups had similar mean coping scores.

IV. ANALYSIS OF QUALITY-ADJUSTED SURVIVAL A. Introduction Quality-of-life-adjusted survival time is a complementary outcome that is increasingly being used in cancer clinical research. It represents a patient’s survival time weighted by the quality of life experienced, where the weightings are based on utility values. Because utility is measured on the unit interval (0,1), qualityadjusted survival time is in the same time units as overall survival. This allows comparisons of treatments that differ in their quality-of-life effects and their effects on survival using a metric that simultaneously accounts for both of these differences. The Quality-adjusted Time Without Symptoms of disease or Toxicity of treatment (Q-TWiST) method (24,25) is one technique for evaluating quality-

236

Cole

adjusted survival in clinical trials. Q-TWiST compares treatments by computing the time spent in a series of clinical health states that may impact a patient’s quality of life. Each health state is then weighted by a utility value, and the Q-TWiST outcome is defined by the sum of the weighted health state durations. The three steps involved in a Q-TWiST analysis are described briefly below followed by an illustrative example. B.

Step 1: Define Clinical Health States

The first step in the analysis is to define quality-of-life-oriented health states that are relevant for the disease setting and the treatments being studied. The health states should reflect changes in clinical status that may be associated with changes in quality of life (e.g., treatment-related adverse events, disease progression, late sequelae), and they should be progressive. That is, patients must move through the health states in order, although health states may be skipped. For example, we may define a k-state model with health states S1, . . . , Sk, where the only possible transitions are from Si to Sj, 1 ⱕ i ⱕ j ⱕ k. For example, in the adjuvant chemotherapy setting, health states may be defined as follows: S1 ⫽ time with treatment-related toxicity, S2 ⫽ time without toxicity and without disease progression, and S3 ⫽ time with disease recurrence until death. C.

Step 2: Partition Overall Survival

The second step is to estimate the mean health state durations using the clinical trial data. This is accomplished by partitioning the overall survival time into the health states for each treatment group separately. The Kaplan-Meier method can be used for this purpose; however, in practice, censoring precludes one from estimating the entire survival curve. In this case, the partitioning is done up to a restriction time L. A common choice for L is the median follow-up duration. For each patient, the clinical trial data are used to define the exiting time for each health state. For the k-state model, let ti denote the exiting time (measured from study entry or randomization time) from health state Si, i ⫽ 1, . . . , k. If a state Sj is skipped, then tj ⫽ tj⫺1. If S1 is skipped, then t1 ⫽ 0. The exiting time from state Sk will be the time of death. For example, in the adjuvant chemotherapy setting, the health state exiting times may be defined as follows: t1 ⫽ time from randomization until the end of treatment-related toxicity, disease progression, or death, whichever occurs first; t2 ⫽ time from randomization until disease progression or death, whichever occurs first; and t3 ⫽ time from randomization until death. If any of the exiting times are censored, then the exiting times for all subsequent states will be similarly censored. Let Ki (⋅) denote the Kaplan-Meier estimate corresponding to ti, i ⫽ 1, . . . , k. Then the mean health state duration, restricted to L for health state S1, is

Complementary Outcomes

τˆ 1 ⫽

冮

237

L

K1(u)du

0

and the mean health state duration for health state Si (2 ⱕ i ⱕ k) is τˆ i ⫽

冮

L

0

(Ki(u) ⫺ Ki⫺1(u))du

This approach to the estimation of the health state durations provides consistent estimates, whereas averaging individual health state durations when censoring is present leads to biased results (26). D. Step 3: Compare the Treatments The third step is to compare the treatments using a weighted sum of the mean health state durations. For example, if u1, . . . , uk denote the utility coefficients for the respective health states S1, . . . , Sk, the Q-TWiST end point is given by k

Q⫺TWiST ⫽

冱 uτ

i i

i⫽1

Note that if all of the utility coefficients equal unity, Q-TWiST is equivalent to the mean survival time restricted to L. Q-TWiST is calculated separately for each treatment group, and the treatment effects are obtained by subtracting the Q-TWiST estimates corresponding to two treatment groups (e.g., experimental drug group vs. control group). Standard errors for the treatment effects can be obtained using the bootstrap method (24) or recently derived closed-form estimators (27–29). If data are available for estimating u1, ..., uk, then these data can be incorporated directly into the analysis. When utility data are not available, the treatment comparison can be evaluated by computing the treatment effect for varying values of the utility weights in a sensitivity analysis. When two utility weights are unknown and two treatments are being compared, the treatment comparison can be plotted for all possible values of the unknown utility coefficients in a two-dimensional graph called a ‘‘threshold utility plot.’’ Contour lines can be used to indicate the magnitude of the Q-TWiST treatment effect associated with different pairs of utility values. The contour line corresponding to a treatment effect of zero is called the ‘‘threshold line.’’ The threshold line indicates all utility value pairs for which the treatment effects are equal in terms of Q-TWiST. Confidence bounds for the threshold line can also be plotted to define regions of utility coefficient values for which the treatment effect difference is statistically significant. For example, in the case where k ⫽ 3 and u2 ⫽ 1, the contour lines for the threshold utility plot are defined by

238

Cole

ue ⫽

c ⫺ u 1 ∆1 ⫺ ∆2 ∆3

where c represents the Q-TWiST treatment effect and ∆i is the treatment group difference for the duration of health state Si. Note that if c ⫽ 0, the above equation represents the threshold line. Contour lines can be obtained by setting appropriate values for c and plotting the resulting equations (e.g., c ⫽ 3, ⫺2, ⫺1, 0, 1, 2, 3 months). E.

Example

The Eastern Cooperative Oncology Group (ECOG) clinical trial EST1684 compared high-dose interferon alfa-2b therapy versus clinical observation for the adjuvant treatment of high-risk resected malignant melanoma in 280 patients (30,31). The health states defined for the Q-TWiST analysis were as follows: Tox ⫽ all time with severe or life-threatening side effects of high-dose interferon; TWiST ⫽ all time without severe or life-threatening treatment toxicity and without symptoms of disease relapse; and Rel ⫽ all time following disease relapse. These health states reflect the major clinical changes in quality of life that are important for evaluating the impact of high-dose interferon. Figure 1 shows the partitioning of overall survival into the health states according to treatment group based on the product limit method and restricted to the median follow-up interval of 84 months. For each graph, the area beneath the overall survival curve (OS) is partitioned into the health states Tox, TWiST, and Rel. This was accomplished by plotting on the same graph as OS a survival curve for the duration of severe or life-threatening side effects of interferon (Tox) and a survival curve for the time until disease relapse or death (RFS). Table 2 shows the mean health state durations, the mean overall survival time (OS), and the mean relapse-free survival time (RFS) within the first 84 months from randomization in the study. The results indicate that patients in the interferon group experienced more time in TWiST and less time in Rel as compared with the observation group; however, the interferon group also experienced more time with severe or life-threatening toxicity. Table 3 shows the computation of Q-TWiST for two possible selections of the utility weights uTox and uRel. The utility weight for TWiST in this analysis was assumed to be unity because TWiST represents a state of best possible quality of life. Note that mean OS is equivalent to mean Q-TWiST when uTox ⫽ uRel ⫽ 1 and that mean RFS is equivalent to mean Q-TWiST when uTox ⫽ 1 and uRel ⫽ 0. Figure 2 illustrates the threshold plot for the treatment comparison. Note that in this case, the interferon group experienced more quality-adjusted time than the control group regardless of the utility values used. Therefore, the threshold line does not appear on the graph. However, the contour lines indicate that

Complementary Outcomes

239

Figure 1 Partitioned survival plots for the ECOG clinical trial comparing (A) clinical observation (i.e., no therapy) and (B) interferon for patients with malignant melanoma. Each plot illustrates the overall survival curve (OS), the relapse-free survival curve (RFS), and a curve representing the duration of Toxicity (Tox). The area between the OS and RFS curves represents the duration of the relapse health state (Rel), and the area between the Tox and RFS curves represents time without symptoms of relapse or toxicity (TWiST). (From Ref. 31.)

240

Cole

Table 2 Mean Time in Months for the Components of Q-TWiST Restricted to 84 Months of Median Follow-Up in the ECOG Trial EST 1684 Treatment Group Outcome* Tox TWiST Rel OS RFS

Observation

Interferon Alfa-2b

Difference†

0.0 30.0 12.4 42.4 30.0

5.8 33.1 10.4 49.3 38.9

5.8 3.1 ⫺2.0 7.0 8.9

95% CI† 5.0 ⫺4.8 ⫺6.2 ⫺0.6 0.8

to to to to to

6.7 11.0 2.3 14.5 17.0

p (two-sided)† ⬍0.001 0.4 0.4 0.07 0.03

* Tox, time with severe or life-threatening side effects of treatment; TWiST, time without severe or life-threatening side effect of treatment and without symptoms of disease relapse; Rel, time following disease relapse until death; OS, overall survival time; RFS, relapse-free survival time. † Treatment difference corresponds to interferon minus observation and is given with a 95% confidence interval (CI) and a two-sided p value based on a Z-test. Source: Ref. 31.

the Q-TWiST benefit for interferon ranges from 2 to 8 months depending on the selection of utility weight values. In addition, the upper 95% confidence band for the threshold line appears as a dashed line. Values of the utility weights above the dashed line correspond to a significant (p ⬍ 0.05) benefit for interferon in terms of Q-TWiST and values below the dashed line indicate utility coefficient

Table 3 Mean Q-TWiST in Months Within 84 Months of Median Follow-Up in the ECOG Trial for Arbitrary Sets of Utility Weight Values Utility Values*

Treatment Group

uTox

uRel

Observation

Interferon Alfa-2b

Difference†

95% CI†

p (two-sided)†

0.5 0.9

0.5 0.4

36.2 34.9

41.2 42.5

5.0 7.6

⫺2.4 to 12.5 0.0 to 15.1

0.2 0.05

* uTox, the utility weight associated with the severe or life-threatening side effects of treatment (TOX); uRel, the utility weight associated with disease relapse. Each utility weight is measured on a scale from 0 ⫽ ‘‘as bad as death’’ to 1 ⫽ ‘‘as good as perfect health.’’ † Treatment difference corresponds to interferon minus observation and is given with a 95% confidence interval (CI) and a two-sided p value based on a Z-test. Source: Ref. 31.

Complementary Outcomes

241

Figure 2 Threshold utility analysis for the ECOG clinical trial. The graph illustrates the treatment comparison in terms of Q-TWiST for all possible values of the utility weights for toxicity and relapse. The parallel dotted lines represent contours for the treatment effect in terms of Q-TWiST (interferon minus observation) as the utility weights vary between 0 and 1. The positive numbers on the contour lines indicate that mean Q-TWiST for the interferon group is greater than for the observation group for all possible pairs of utility weights between 0 and 1. For utility value pairs above the heavy dashed line in the upper left corner of the plot, the Q-TWiST treatment effect is statistically significant (i.e., p ⬍ 0.05). (From Ref 31.)

value pairs for which the Q-TWiST comparison favored the interferon group but did not reach statistical significance. F.

Further Developments

Quality-of-life-adjusted survival in clinical trials is an area of active methodological research. Parametric (32) and semiparametric (33) regression models have been developed for quality-adjusted survival. In addition, methods have been developed for forecasting treatment effects (34) and performing meta-analyses (35–37). A number of recently published papers hold promise for an increasing array of statistical tools. In particular, Glasziou et al. (38) describe methods for combining longitudinal quality-of-life data with survival data using an integration-based approach. Zhao and Tsiatis (27) present a consistent estimator for the

242

Cole

distribution of quality-adjusted survival time and, in a second paper (28), describe techniques for estimating mean quality-adjusted lifetime based on their consistent estimator. Ongoing research focuses on developing statistical tools for the analysis of utility data in conjunction with quality-adjusted survival analysis.

V.

COST-EFFECTIVENESS AND COST–UTILITY ANALYSIS

A.

Overview

The economic evaluation of different treatment options is critically important for including medical costs in policy decisions and clinical practice decisions. Costeffectiveness analysis and cost–utility analysis are two common methods for comparing the economic cost of treatments in a standardized way. Both methods result in ratios that relate the incremental cost of a therapy to the incremental clinical benefit of that therapy. The denominator in cost-effectiveness analysis is any measure of benefit, whereas in cost–utility analysis it is expressed in qualityadjusted time. For example, to evaluate the cost-effectiveness of drug A relative to drug B, the additional cost of using drug A is quantified and the clinical benefit of drug A relative to drug B. In particular, if cA and cB denote the total costs of using drug A and drug B, respectively, and if µA and µB denote the respective mean survival times for the two drugs, the cost-effectiveness ratio is given by cA ⫺ cB µA ⫺ µB This ratio represents the additional cost of using drug A per unit of lifetime saved relative to drug B. The cost–utility ratio is obtained by replacing the denominator in the cost-effectiveness ratio with the quality-adjusted treatment effect, where the quality-adjustment is made using utility weights. For example, the treatment effect can be expressed in terms of Q-TWiST in a cost–utility ratio. The main challenge for clinical trials is to measure the utilization of resources related to treatment. The gathering of cost information is not necessary because utilization data can be assigned appropriate costs at a later time. Moreover, in practice it is very difficult to measure the real costs of medical treatment; costs can change drastically over the course of a study, and costs may vary widely from one geographic region to another. As a result, it is much more practical to collect data on resource utilization and later assign costs. At a minimum, data should be collected for all physician visits, drug treatments, diagnostic tests, and hospitalizations. The clinical centers involved with the trial can provide some of these data for the utilization at these clinical sites; however, it is important that data also are collected directly from the patients because many will receive care from other sites. The use of a patient diary or

Complementary Outcomes

243

interval questionnaire, along with periodic telephone calls from a study coordinator, will facilitate accurate collection of the data. The statistical issues related to economic analysis are too complex to be covered in this chapter. The main difficulty is that cost-effectiveness ratios are derived from varied data sources, and some of the parameters (e.g., cost) may be point estimates and not based on sampled data. The result is that the ratios have an unknown distribution, making statistical inferences impossible. A common solution to this problem is to use sensitivity analysis to examine how the costeffectiveness ratio varies as the parameters of the analysis vary (e.g., cost parameters can be varied within certain ranges). Nevertheless, in some cases, it is possible to make statistical inferences on cost-effectiveness and cost–utility ratios (39). B. Example Hillner et al.(40) performed an economic evaluation of interferon treatment for malignant melanoma based on the ECOG clinical trial EST1684. They estimated that the total cost of medical care within the first 7 years after diagnosis was $91,656 for interferon-treated patients and $76,580 for patients not treated with interferon. Therefore, the incremental cost was estimated at $15,076. These figures were discounted at an annual rate of 3%. This discounting allows the future (decreased) value of money to be expressed in current dollars. Table 2 indicates that the increase in mean survival within the first 84 months associated with interferon was 7.0 months. The cost-effectiveness ratio is therefore given by $15,076/7.0 ⫽ $2152 per life month saved ($25,848 per life-year saved). Note that the results presented by Hillner et al. differ slightly from these calculations because of rounding and because Hillner et al. discounted the survival time (as well as costs) by 3% per year.

VI. STUDY DESIGN IN CLINICAL TRIALS As a gold standard study design we propose a cancer clinical trial including the following outcomes: (1) the usual clinical end points such as progression-free survival and overall survival, (2) the usual assessment of toxicity/adverse event frequency and grade, (3) measurements of the timing and duration of all toxicities and adverse events, (4) longitudinal assessment of quality of life using a general instrument and a disease-specific instrument, (5) a procedure for estimating patient utility or preference, and (6) a procedure for estimating health care utilization. By including all these components in a clinical trial, it becomes possible to address the clinical benefits of a new therapy and its impact on quality of life and whether it is cost effective. Of course, few studies will include all these components due to constrained resources. In addition, clinical trials that began

244

Cole

in the 1980s or early 1990s will generally not include components for measuring quality of life or utility, because methods for assessment were not as well established as they are today. To fill this potential gap, researchers use other methods to address pressing clinical issues. One approach is to launch a smaller study that collects quality-of-life and utility data from a group of patients. Such a study can be longitudinal or cross-sectional. The advantage of the cross-sectional design is that the study can be completed more quickly. The disadvantage is that longitudinal effects on quality of life cannot be estimated. Inferior ancillary study designs include those that use proxy data for subjective quality-of-life domains. Another approach is to retrospectively evaluate the duration of major health states that are thought to impact quality of life (e.g., toxicity, disease progression). By combining clinical outcome data with patient-level cycle-by-cycle toxicity data (both of which are typically collected in cancer clinical trials), it is often possible to obtain estimates of durations of the health states. Utility weights can then be assigned to the health states, and this health state utility model can be used to compare treatments in terms of quality-adjusted time. The utility weights can be estimated from a secondary cross-sectional study, or they may be left unspecified. In the latter case, the results of the analysis should be displayed for a wide variety of choices for the utility weight values and not for just one or two arbitrary selections. For many clinical trials currently being designed where quality of life is an important end point, it is critical that quality-of-life components are prospectively incorporated. At a minimum, a disease-specific quality-of-life instrument should be administered longitudinally. The timing of assessments should be designed to measure quality of life for the various clinical health states that a patient might experience both during and after therapy (e.g., treatment-related toxicity, disease progression). For randomized studies, a baseline assessment should take place before treatment randomization. In addition, patients should be asked to selfreport troublesome adverse events and symptoms and their durations using a diary. These data could be used to validate the physician-reported adverse event data typically collected. The patient diary idea is particularly appealing from a quality-of-life perspective because it is likely that a patient will self-report adverse events that cause distress and therefore represent decrements in quality of life. As a result, patient-diary data are useful for estimating the duration of time spent with adverse events—an outcome necessary for a health-state utility model.

VII. CONCLUSION In this chapter, we provide an overview of the basic components of quality-oflife research in cancer clinical trials. Unfortunately, in this short space, we cannot fully cover all aspects of this topic; however, there are a number of excellent

Complementary Outcomes

245

references for further reading. In particular, the large volume edited by Spilker (41) is a thorough reference covering quality-of-life measurement, analysis, cross-cultural and cross-national issues, health policy issues, and pharmacoeconomics. This book is particularly well suited to the quality-of-life researcher involved with study design and analysis. Another more compact reference is the chapter by Gelber and Gelber (42), which reviews methods used in clinical research and provides more detail regarding statistical analysis methods. We also provided an example illustrating the use of quality-of-life-adjusted survival time (Q-TWiST) in cancer clinical research. The Q-TWiST analysis of the ECOG trial EST1684 improved the clinical usefulness of the information obtained from the clinical trial. Moreover, the evaluation illustrates the need to consider quality-adjusted survival comparisons in clinical research and to develop more practical methods for assessing patient preferences for incorporation in the decision-making process. The use of assessment tools and procedures similar to those described in this chapter is becoming increasingly important in cancer clinical research. At a minimum, future clinical trials should carefully collect data regarding toxicity grade and duration in addition to the usual clinical outcomes. The longitudinal use of a quality-of-life instrument is also strongly recommended, as is the tracking of individual health care costs over the course of a study. With these components in place, a meaningful evaluation can be made of treatments in terms of clinical outcome, quality of life, and cost.

ACKNOWLEDGMENTS I thank Richard Gelber and Shari Gelber for helpful comments on this chapter. Supported in part by the American Cancer Society (RPG-90-013-08-PBP) and the National Cancer Institute (CA23108).

REFERENCES 1. Schumacher M, Olschewski M, Schulgen G. Assessment of quality of life in clinical trials. Stat Med 1991; 10:1915–1930. 2. Cox DR, Fitzpatrick R, Fletcher AE, Gore SM, Spiegelhalter DJ, Jones DJ. Quality of life assessment: can we keep it simple? J R Stat Soc A 155:353–393. 3. Gelber RD, Goldhirsch A, Hu¨rny C, Bernhard J, Simes RJ. Quality of life in clinical trials of adjuvant therapies. J Nat Cancer Inst Monogr 1992; 11:127–135. 4. Neymark N, Kiebert W, Torfs K, et al. Methodological and statistical issues of qual-

246

5. 6. 7. 8. 9. 10.

11.

12.

13.

14. 15.

16.

17.

18. 19. 20.

21.

Cole ity of life and economic evaluation in cancer clinical trials: report of a workshop. Eur J Cancer 1998; 34:1317–1333. Karnofsky DA, Abelmann WH, Craver LF, Burchenal JH. The use of nitrogen mustards in the palliative treatment of carcinoma. Cancer 1948; 1:634. Yates JW, Chalmer B, McKegney FP. Evaluation of patients with advanced cancer using the Karnofsky performance status. Cancer 1980; 45:2220–2224. Izsak FC, Medalie JH. Comprehensive follow-up of carcinoma patients. J Chron Dis 1971; 24:179–191. Burge PS, Prankerd TAJ, Richards JDM, et al. Quality of survival in acute myeloid leukemia. Lancet 1975; 2:621–624. Priestman TJ, Baum M. Evaluation of quality of life in patients receiving treatment for advanced breast cancer. Lancet 1976; 1:899–900. Ware JE Jr. The SF-36 health survey. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 337–345. Damiano AM. The sickness impact profile. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 347–354. Derogatis LR, Derogatis MF. SCL-90-R and the BSI. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: LippincottRaven, 1996, pp. 323–335. Bennett KJ, Torrance GW. Measuring health state preferences and utilities: rating scale, time trade-off and standard gamble techniques. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: LippincottRaven, 1996, pp. 253–265. Farquhar PH. A survey of multiattribute utility theory and applications. TIMS Studies Mgmt Sci 1977; 6:59–89. Kind P. The EuroQoL instrument: an index of health-related quality of life. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 191–201. Feeny DH, Torrance GW, Furlong WJ. Health Utilities Index. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 239–252. Weeks J, O’Leary J, Fairclough D, et al. The ‘‘Q-tility index’’: a new tool for assessing health-related quality of life and utilities in clinical trials and clinical practice. Proc ASCO 1994; 13:436. Spitzer WO, Dobson AJ, Hall J, et al. Measuring the quality of life of cancer patients: a concise QL-index for use by physicians. J Chron Dis 1981; 34:585–597. Zee B. Growth curve model analysis for quality of life data. Stat Med 1998; 17: 757–766. Fairclough DL. Methods of analysis for longitudinal studies of health-related quality of life. In: Staquet M, ed. Quality of Life Assessment in Clinical Trials. Oxford: Oxford University Press 1998, pp. 227–247. Bernhard J, Gelber RD, eds. Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues. Stat Med 1998; 17: 511–796.

Complementary Outcomes

247

22. Bonetti M, Cole BF, Gelber RD. A method-of-moments estimation procedure for categorical quality-of-life data with non-ignorable missingness. J Am Stat Assoc 1999; 94:1025–1034. 23. Hu¨rny C, Bernhard J, Coates AS, et al. Impact of adjuvant therapy on quality of life in women with node-positive breast cancer. Lancet 1996; 347:1279–1284. 24. Glasziou PP, Simes RJ, Gelber RD. Quality adjusted survival analysis. Stat Med 1990; 9:1259–1276. 25. Gelber RD, Cole BF, Gelber S, Goldhirsch A. The Q-TWiST method. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 437–444. 26. Gelber RD, Gelman RS, Goldhirsch A. A quality-of-life oriented endpoint for comparing therapies. Biometrics 1989; 45:781–795. 27. Zhao H, Tsiatis AA. A consistent estimator for the distribution of quality adjusted survival time. Biometrika 1997; 84:339–348. 28. Zhao H, Tsiatis AA. Estimating mean quality adjusted lifetime with censored data. Sankhya 2000; 62, Series B, Part 1: 175–188. 29. Murray S, Cole BF. Variance and sample size calculations in quality-of-life adjusted survival analysis (Q-TWiST). Biometrics 2000; 56:173–182. 30. Kirkwood JM, Hunt Strawderman M, Ernstoff MS, et al. Interferon alpha-2b adjuvant therapy of high-risk resected cutaneous melanoma: The Eastern Cooperative Oncology Group Trial EST 1684. J Clin Oncol 1996; 14:7–17. 31. Cole BF, Gelber RD, Kirkwood JM, et al. Quality-of-life adjusted survival analysis of interferon alfa-2b adjuvant treatment of high-risk resected cutaneous melanoma: an Eastern Cooperative Oncology Group Study. J Clin Oncol 1996; 14:2666– 2673. 32. Cole BF, Gelber RD, Anderson KM. Parametric approaches to quality adjusted survival analysis. Biometrics 1994; 50:621–631. 33. Cole BF, Gelber RD, Goldhirsch A. Cox regression models for quality adjusted survival analysis. Stat Med 1993; 12:975–987. 34. Gelber RD, Goldhirsch A, Cole BF. Parametric extrapolation of survival estimates with applications to quality of life evaluation of treatments. Controlled Clin Trials 1993; 14:485–499. 35. Cole BF, Gelber RD, Goldhirsch A. A quality-adjusted survival meta-analysis of adjuvant chemotherapy for premenopausal breast cancer. Stat Med 1995; 14:1771– 1784. 36. Gelber RD, Cole BF, Goldhirsch A, et al. Adjuvant chemotherapy for premenopausal breast cancer: a meta-analysis using quality-adjusted survival. Cancer J Sci Am 1995; 1:114–121. 37. Gelber RD, Cole BF, Goldhirsch A, et al. Adjuvant chemotherapy plus tamoxifen compared with tamoxifen alone for postmenopausal breast cancer: a meta-analysis using quality-adjusted survival. Lancet 1996; 347:1066–1071. 38. Glasziou P, Cole BF, Gelber RD, Hilden J, Simes RJ. Quality-adjusted survival analysis with repeated quality-of-life measures. Stat Med 1998; 17:1215–1229. 39. O’Brien BJ, Drummond MF. Statistical versus quantitative significance in the socioeconomic evaluation of medicines. PharmacoEconom 1994; 5:389–398. 40. Hillner BE, Kirkwood JM, Atkins MB, Johnson ER, Smith TJ. Economic analysis of

248

41. 42.

43. 44. 45. 46.

47.

48.

Cole adjuvant interferon alfa-2b in high-risk melanoma based on projections from Eastern Cooperative Oncology Group 1684. J Clin Oncol 1997; 15:2351–2358. Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996. Gelber R, Gelber S. Quality-of-life assessment in clinical trials. In: Thall PF, ed. Recent Advances in Clinical Trial Design and Analysis. Norwell, Massashusetts: Kluwer Academic Publishers, pp. 225–246. Levine MN, Guyatt GH, Gent M. Quality of life in stage II breast cancer: an instrument for clinical trials. J Clin Oncol 1988; 6:1789–1810. Ganz PA, Schag CA, Lee JJ, et al. The CARES: A generic measure of health-related quality of life for patients with cancer. Qual Life Res 1992; 1:19–29. Aaronson NK, Bullinger M, Ahmedzai S. A modular approach to quality-of-life assessment in cancer clinical trials. Recent Results Cancer Res 1988; 111:231–249. Cella DF, Bonomi AE. (1996). The functional assessment of cancer therapy (FACT) and functional assessment of HIV infection (FAHI) quality of life measurement system. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, pp. 203–214. Clinch JJ. The functional living index—cancer: ten years later. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven, 1996, pp. 215–225. Hu¨rny C, Bernhard J, Gelber RD, et al. Quality of life measures for patients receiving adjuvant therapy for breast cancer: an international trial. Eur J Cancer 1992; 28: 118–124.

14 Health-Related Quality-of-Life Outcomes Benny C. Zee National Cancer Institute of Canada, Kingston, Ontario, Canada

David Osoba Quality of Life Consulting, West Vancouver, British Columbia, Canada

I.

INTRODUCTION

Health-related quality of life (HRQL) is now included as a major end point, in addition to the traditional end points such as tumor response and survival, in many oncological clinical trials. When HRQL was first proposed as an outcome to be measured in clinical trials, there was controversy over how to define the term HRQL and the breadth of the constructs to be included. The construct ‘‘quality of life’’ can be very broad and can include such dimensions as the air we breathe, socioeconomic status, and job satisfaction. However, in health care the term HRQL is restricted to how illness or its treatment affects patients’ ability to function and their symptom burden (1). Thus, HRQL is not in itself only performance status, toxicity ratings, tumor measurement, or laboratory values. Researchers have agreed that at a minimum, it should include physical, social, and emotional dimensions of life (2–4). Thus, HRQL in oncology is a multidimensional construct consisting of subjective indicators of health. These indicators should include physical concerns (symptoms and pain), functional ability (activity), family well-being, emotional well-being, spirituality, treatment satisfaction, future orien249

250

Zee and Osoba

tation, sexuality, social functioning, and occupational functioning. Depending on the purpose of the study, some quality-of-life assessments are targeted to obtain information for decision making in health care policy, whereas others are designed to assess the impact of both symptoms of disease and toxicity of treatment in a phase III randomized trial setting. Information obtained from quality-of-life assessments in phase II testing for new chemotherapeutic agents can also guide quality-of-life evaluations planned in future large randomized studies (5). However, not all dimensions are relevant to a particular study. The general approach in medical studies is to use a generic, or a general (condition-specific), questionnaire that assesses physical, emotional, and social functioning and then, depending on the population being studied and the specific conditions of the study, to add disease-specific or situation-specific modules or checklists (6,7). There are several reasons for applying HRQL assessments in oncology (8): (1) studies in which symptom control is the primary outcome, (2) cancers with a poor prognosis, (3) treatment arms with similar survivals, (4) supportive care interventions, (5) identification of the full range of side effects and impact of treatment, and (6) using quality of life as a predictor of response and survival. From the regulatory point of view, the U.S. Food and Drug Administration recommends that the beneficial effects on HRQL and/or survival be the basis of approval of new anticancer drugs. Therefore, when treatment does not have an impact on survival, demonstration of a favorable effect on HRQL is more important than most other traditional measures of efficacy (9). To perform a clinical trial using HRQL as end points, it is important to develop a protocol with clearly defined objectives and definitions of end points (7). For example, many antineoplastic therapies give rise to a number of distressing side effects with a presumed deterioration in HRQL. However, these side effects are usually reversible when the treatments have been completed, and the patients may have increased survival and improved HRQL (10). When the objective of a study is to evaluate longer term effects, the duration and frequency of the HRQL measurements should be clearly stated in the protocol, so that the design of the study and follow-up schedule are developed to ensure good compliance (HRQL questionnaire completion rates). Eligibility requirements for the HQRL component should be given. The choice of instruments is important; they should be discussed in the protocol and the psychometric properties of the generic instruments should be referenced. When a supplemental disease-specific checklist is needed, the motivation for adding these items should be addressed. Frequency of measurements and logistics of data collection should be considered to minimize potential problems with missing data. In the following sections, we discuss some of these issues in clinical trials incorporating HRQL as an end point.

Health-related Quality of Life Outcomes

251

II. DESIGN CONSIDERATIONS During the design stage of a phase III randomized trial, one of the most important questions is whether the HRQL end point is unambiguously stated and addresses the study question fully. For example, a study of dose-intensive chemotherapy versus standard alternating-dose chemotherapy for patients with extensive-stage small cell lung cancer may expect to have more treatment-related symptoms during the duration of chemotherapy. When the primary objective is to evaluate whether patients survive longer with a dose-intensive regimen, then survival time is an obvious end point. However, extensive-stage small cell lung cancer patients have a rather short median survival of about 1 year, and the emergence and magnitude of a clinically significant benefit may be evident around 4–6 months. It is also likely that dose-intensive treatment produces more side effects and may have an impact on patients HRQL. The improvement in median survival time gained by using highly toxic dose-intensive chemotherapy needs to be justified by the HRQL outcomes to make an informed treatment decision (11). Another example where HRQL is an important end point is the situation where the primary objective of the study, such as survival, shows equivalence in treatment effect but one treatment arm shows less toxicity or improved overall HRQL. Such an outcome may indicate that one treatment is preferable to another, even when it does not extend survival. A. Randomization and Blinding The basic design issues when HRQL is included as an end point are similar to those of most conventional randomized controlled trials. However, there are some additional practical issues that need to be considered in phase III trials with HRQL components. For example, a study may permit patient entry to the treatment protocol without participation in the HRQL component. It is important to consider the proportion of nonparticipants in the HRQL components, and appropriate stratification may be required so that the purpose of the randomization process to reduce bias in treatment selection is preserved. In a double-blind randomized trial, blinding of the allocated treatment for the person administering the HRQL questionnaire(s) is as important as the blinding of the allocated treatment for the primary care personnel. For many nonblinded cancer trials, a standardized procedure must be used to reduce bias from the administration procedure. This is critical when a study is comparing two different treatment modalities or treatment schedules; the HRQL assessments should be patterned on the treatment schedules with an identical administration procedure (12).

252

B.

Zee and Osoba

Eligibility

The effect of a disproportionate number of nonparticipators in HRQL components between two treatment arms may introduce bias. The randomization and the stratification procedure should account for this problem. However, it is difficult to know how much the generalizability of the results from the study population, as defined in the eligibility criteria of the protocol, will be affected. Does the patient population include all those who consent to take part in the treatment aspect or include only those who consent to participate in the HRQL component? Exclusion of those who do not take part in HRQL assessments may limit the generalizability of the treatment outcome (10). For example, a study that includes only patients who consent to take part in HRQL assessments may involve only welladjusted patients. Information about HRQL for this subset of patients may affect the generalizability of the study results. On the other hand, allowing physicians to treat the quality-of-life assessments as optional and then to stratify for those who consent to take part in HRQL assessments versus those who do not may reduce the problem, but the HRQL results of the study must be interpreted with caution since they represent a slightly different patient population than that of the survival end point. A strategy for collecting HRQL data that is appropriate to all patients should be considered. For example, culturally appropriate instruments should be used, translation to other languages for various ethnic groups, assistance for patients with visual or auditory impairment, using an instrument with appropriate length to minimize noncompliance due to deteriorating physical or emotional illness, and the use of proxies who know the patients well are all important considerations.

C.

Psychometric Properties

The choice of instruments with well-established psychometric properties are important for proper interpretation of the HRQL results. There are many good review articles on the choice of quality of life instruments. Moinpour (13) discussed the operational definition of HRQL in clinical trials with respect to health care and the treatment of disease, i.e., how physical, mental, and social well being are affected by medical intervention. Once an instrument has been chosen for a study, previous work on the psychometric properties of the instrument should be referenced in the protocol. The chosen instrument should have demonstrated good internal consistency, high test–retest, and interrater reliability (if rater-completed HRQL assessment is used). Internal consistency is usually measured by Cronbach’s alpha coefficient (14), which is a measure of the extent to which different items represent the same domain content. Test–retest reliability reflects the reproducibility of scores at two different time points between which HRQL is not expected to change. Interrater reliability estimates the degree of agreement among

Health-related Quality of Life Outcomes

253

different raters but is not applicable in most patient self-administered HRQL instruments. The concept of validity is more complex since there is no gold standard for HRQL assessment and it is primarily the repeated use of an instrument in many trials over time that establishes validity. However, an instrument that demonstrates an appropriate content validity (including adequate relevant items in a specific domain) and convergent and divergent validity with other instruments is sufficient to justify its use in most clinical trials. Sometimes a demonstration of criterion validity is possible if the criterion exists at the same time as the measure, e.g., if we know that there is a significant difference between two groups of patients with respect to a criterion (e.g., performance status), criterion validity may be established if the quality of life domains correctly show a difference between groups. However, criterion validity is not easy to establish in HRQL instruments. Finally, construct validity assesses whether the HRQL instrument relates to other observed variables or other constructs in a way that is consistent with a theoretical framework. Factor analysis, multitrait scaling analysis, and structural equation modeling are common methods to assess construct validity (15). D. Global Quality of Life An important aspect of choosing HRQL instruments is to select one that contains relevant items for a specific purpose. For example, if we are to study overall health, global quality of life, and general well-being, a questionnaire with a separate global quality-of-life domain is an important consideration. Questionnaires that aggregate individual items into a total score do not necessarily represent global quality of life or general well-being. This is important because the weighting of various domains in an instrument varies with the individual valuation. As Mor and Guadagnoli (16) pointed out, the interpretation of an aggregate score from a number of existing domains as equaling global quality of life may provide misleading results when the instrument is predominated by certain domains. For example, the Functional Living Index—Cancer (FLIC) and Cancer Rehabilitation Evaluation System (CARES) were used to assess the quality of life of patients treated with either modified radical mastectomy or segmental mastectomy (17). A global score for FLIC was obtained from 22 visual analogue scales including concerns related to pain, stress, and the ability related to work and do household chores. The CARES (18) instrument has 93 to 132 items with five-point Likert scale, and a summary score is obtained from 5 higher order factors including physical, psychosocial, medical interaction, marital, and sexual domains. The Profile of Mood States with 65 items with a five-point Likert scale response format was also used, the average score representing a total mood disturbance. Both the CARES summary scores and FLIC global score showed no significant difference between the two groups, but patients with segmental mastec-

254

Zee and Osoba

tomy had significantly more mood disturbance at one month than did those with total mastectomy. More importantly, at 1 year of follow-up, patients with segmental mastectomy had significantly fewer problems with clothing and body image as indicated in these domains in CARES. These are clear indications of differences in HRQL. However, the conclusion based on aggregate scores indicates no significance difference. One explanation for these results may be that the chosen questionnaires did not include factors that patients believed to be important. Kemeny et al. (19) pointed out that the subject matter in identifying proper domains to study, such as body image, appearance, and femininity, is important. Another likely explanation is that the scores for all the domains in a questionnaire may not change in the same direction, and thus improvement in some domains may be canceled out by worsening in others, giving no change in the aggregate score. It is therefore very difficult to develop an instrument that provides an aggregate score applicable across different studies. Statements about overall or global quality of life based on aggregate scores may be dangerous. Although some investigators have suggested weighting of certain domain scores (e.g., toward physical functioning or emotional functioning), the assigned weights are derived from observers opinions and not necessarily those of patients (2). In contrast to the method of aggregate scoring, one or two questions asking directly for an overall assessment of overall health/global quality of life represents an overall assessment incorporating an individual’s own values. Once this information is obtained, association of global quality of life with physical, emotional, and social domains and other symptoms can be determined. This would provide us with further information on the impact of specific dimensions on patients’ global quality of life. E.

Symptom Checklists and HRQL

One of the design questions in HRQL studies is whether to incorporate extra symptom checklists to validated general HRQL questionnaires (20). Checklists (and modules) are developed to be disease and situation specific, that is, to capture additional information about the effects of a particular cancer, or its treatment in a given clinical trial. They can provide additional information not provided by general (core) questionnaires. There is no theoretical reason why symptom checklists cannot be completed at the same time as are general HRQL questionnaires. One difficulty is that the timing of the symptom checklists intended to only capture the side effects of treatment may differ from that of HRQL assessments. The effects of some symptoms, which are expected to peak within a day or two (i.e., nausea and vomiting after chemotherapy), may not be captured by a questionnaire with only a 7-day time frame, whereas the same questionnaire may be adequate for symptoms that occur over longer periods of time (e.g., fatigue) after treatment. Symptom checklists, in the form of a self-administered daily diary, are often used to capture

Health-related Quality of Life Outcomes

255

symptoms induced by treatment. However, the design of the diary has to be short and simple so that the data collection burden for both patients and the data center is minimized but the chance of capturing crucial symptomatic data is maximized. Some appropriately scheduled quality-of-life assessments have been shown to capture critical information when compared with a symptom checklist. For example, Pater et al. (21) showed that the timing of assessment (either day 4 or day 8 after chemotherapy) and the recall period (either 3 days or 7 days) using the EORTC QLQ-C30 is associated closely with the occurrence of nausea and vomiting as captured by a daily diary using a visual analogue scale. Thus, patients were capable of assimilating the symptomatic experience for the corresponding recall period. It is also possible to add additional HRQL assessments (including checklists) at other times to capture the effects of maximum toxicity (12).

III. DATA COLLECTION It is widely recognized that there is disagreement between physician rating and patients’ self-assessed HRQL scores (22). It has been generally accepted that HRQL should be assessed by patient self-administered questionnaires because of the subjective nature of the content. The data management approach suitable for obtaining HRQL data has to be considered to minimize the chance of missing data. The following are a few points about the general principles of data management practices. A. HRQL Questionnaires The format and layout of the questionnaire should be spacious and clear, using a larger than usual font for instructions and questions to avoid having double answers for one item and missing answers in the adjacent item. A clear information sheet about the HRQL assessment and instructions about how to fill out the questionnaires are useful. Printing should always be done on single page, similar to a case report form, even when the questionnaires are in the form of a booklet (patients sometimes miss an entire page of questions if these are placed on the reverse side). It is also important that the instrument is brief to reduce patient burden as much as possible and questions are easy to understand without using technical terminology or jargon. When a symptom checklist is added to the core questionnaire, the use and wording of a conditional lead-in question must be carefully considered. For example, in a symptom control study of antiemetics, the investigators wished to know how nausea and vomiting affected patients’ functioning and HRQL. A lead-in instruction was used in a nausea and vomiting checklist to identify patients who had experienced nausea and vomiting in the past 7 days, and they were then asked to assess the impact of that nausea and

256

Zee and Osoba

vomiting on their functioning. It was discovered that some patients who had reported no experience of nausea and vomiting in their daily diary provided answers to the checklist of questions and some patients who should have answered did not answer the checklist. However, the core questionnaire in the same study had very good completion rates. It was also noted that the nausea and vomiting domain based on two items from the core HRQL questionnaire had a high correlation with the nausea and vomiting scores from the daily diary (21). B.

Compliance

Some earlier studies reported difficulties collecting HRQL data (23). However, compliance (i.e., HRQL questionnaire completion rates) in later studies has improved. For example, the National Surgical Adjuvant Breast and Bowel Project Breast Cancer Prevention Trial had a very high compliance rate for the first 12month assessments on the placebo arm (23). The general awareness of the value of HRQL data has been raised, and more attention has been given to achieve high compliance (24–29). In a workshop on ‘‘Missing Data in Quality of Life Research in Cancer Clinical Trials’’ (30) held in 1997, most research groups reported that the baseline compliance rates were above 90%. The compliance rate while patients were receiving treatment was in the 80% range and after patients completed treatment, in the 70% range. There are a number of common characteristics of studies with good compliance. Often there has been a training session on HRQL data collection provided for the clinical research personnel both at data collection centers and at central data processing and analysis centers. An ongoing interaction between the clinical research personnel in the clinics and the study coordinator of the central office is useful to understand the problems a particular center might be facing, and possible solutions may be attempted. A clear set of instructions or a full manual should be developed for the clinical research personnel in the centers to follow. If the instructions are specific to a particular trial, they should be included within the protocol. The schedule for HRQL assessments should coincide with the schedule of regular follow-up visits as much as possible to facilitate data collection. It is generally better to complete the HRQL questionnaires at a standardized time in the patient visit while patients are still in the clinic rather than have patients take questionnaires home and mail them back at specific time points. Standardized telephone interviews may also be used, particularly if HRQL assessment is required at times not corresponding to clinic visits. In the data center, it may be difficult to monitor HRQL compliance when the schedules and frequency of collections vary between many trials. Special computer programs are needed to monitor compliance in specific studies. To develop a general computer program for monitoring compliance, a simple question on the case report form asking whether HRQL should be performed at the current visit is extremely helpful. The study coordinator can check this item with

Health-related Quality of Life Outcomes

257

the appropriate protocol schedule, and a query can be sent to the center if the expected HRQL questionnaires were not done. For each HRQL assessment, a front-page (cover sheet) form is required from the center. This front page contains information on the date and location of completion and whether assistance was required at the time of completion. If a quality-of-life assessment was scheduled but patients did not complete a questionnaire, then the reasons of noncompliance are documented (31). Computerization of these items makes the compliance rate calculation simpler than assigning windows for scheduled visits where delay of treatment or visits may increase the complexity of the calculation.

IV. SAMPLE SIZE One of the critical aspects in the design of a phase III clinical trial involving HRQL as an end point(s) is sample size calculation. The study must be adequately powered to be able to make a firm conclusion about the HRQL hypothesis. For example, considering a comparative study between two treatment arms, the primary objective is to compare the study treatment with the control treatment to show the difference in effects on HRQL data. In this setting, a hypothesis-testing approach is used for the quality-of-life data. Since quality of life is a multidimensional construct, the hypothesis of interest should be clearly defined. The study may have a specific question in mind, for example, a study designed to evaluate the impact on cognitive functioning in patients with metastatic breast cancer receiving high-dose chemotherapy versus a treatment that may improve cognitive functioning. The hypothesis for this study is to focus specifically on cognitive functioning. Other studies may focus on different dimensions, for example, in symptom control studies, the primary end point may be a specific symptom such as nausea and vomiting induced by moderately or highly emetogenic chemotherapy or fatigue. In general, if the study question is not about a specific functioning domain or symptom, the global quality-of-life scores can be used as the end point for the sample size estimation. To determine the sample size required for a trial, several quantities must be considered: (1) the significance level at which we wish to perform our hypothesis test (usually 5%) and if it is to be one-tailed or two-tailed; (2) the smallest clinically meaningful difference; (3) the power of the statistical test, which is the probability of rejecting the null hypothesis if the real effect is larger than the smallest clinically meaningful difference specified (usually 80% or 90%); and (4) the variability of the outcome measure estimated from previous studies. For a study with HRQL as an end point, the most difficult quantity for justifying the sample size calculation is the smallest clinically meaningful difference. This is because the meaning of the change on HRQL depends on the perspective of the potential user of the information (31). The societal perspective considers the de-

258

Zee and Osoba

gree of importance at a population level where small differences may be important because of the large number of individuals who may be affected. At the institution level, one may consider a degree of change to be large enough if it leads to the adoption of certain health care policies. In a randomized clinical trial, the magnitude of change may be considered clinically worthwhile when the degree of change is large enough to cause most clinicians to consider using a specified study intervention in a given situation (e.g., discontinuation or alteration of treatment). However, when HRQL is considered as an end point, none of the above definitions seems sensitive enough to provide us with a practical criterion that defines the smallest clinically meaningful difference. In fact, since HRQL is a subjective end point, the change perceived by patients as being meaningful should be considered important. Osoba et al. (31) used a Subjective Significance Questionnaire asking about physical, emotional and social functioning, and global quality of life to assess the degree of change in the EORTC QLQ-C30 scores that were perceptible to patients with breast cancer and small cell lung cancer. The results showed that patients with breast cancer and small cell lung cancer perceived a change of 6.9 to 10.7 (on a 0- to 100-point scale) for global quality of life, respectively. From this information, a sample size formula can be derived for a randomized trial with two treatment arms, using global quality of life as the primary end point. The sample formula reviewed by Lachin (32) can be used: n⫽

2 (Z1-α /2 ⫹ Z1-β)2 σ2 d2

The standard deviation for global quality of life can be determined from previous studies; a sample size of 63 patients per arm is required to detect a difference of 10 points using a two-sided 5% level test with 80% power if the standard deviation is 20. To detect a difference of seven points, 128 patients per arm would be needed. Another consideration in sample size estimation is the multidimensional aspect of HRQL. If global quality of life is not the only specific domain of interest, inclusion of other functioning domains or symptoms may inflate the type I error of the hypothesis tests. It is advisable to control for the increase in type I error using Bonferroni type adjustment for multiple end points when we estimate the required sample size. Here we are adopting the view that global quality of life should be treated as a separate domain and that this information cannot be assimilated from other existing domains and symptoms. Otherwise, a global test statistic for multiple end points such as those proposed by O’Brien (33) or Tandon (34) may be more appropriate. A discussion about matching the clinical questions for multiple end points to appropriate statistical procedures can be found in O’Brien and Geller (35).

Health-related Quality of Life Outcomes

259

In many clinical trials, the HRQL assessments are scheduled to be done repeatedly over time. If the analysis is to be done based on summary statistics for individuals’ repeated measurements, the method by Dawson (36) can be used. Examples of summary statistics include average postrandomization HRQL assessments, last observation, the slope of HRQL scores along time, and area under the curve. For a vector Yij of (K ⫹ 1) of observed repeated HRQL measurements for treatment i and subject j, the observed measures are represented by a random effects model: Yij ⫽ µi ⫹ X′bij ⫹ eij where µ i ⫽ (µ i0, µ i1, . . . , µ iK) are group means for the (K ⫹ 1) repeated measures, and bij is a q ⫻ 1 vector of random subject effects distributed as N(0, D(q⫻q)), and eij is random error distributed as N(0, σ2eI(K⫹1)). Therefore, Yijk is distributed as N(µik, Xk′D xk ⫹ σ2e) and x′k is the (k ⫹ 1)th row of X. The summary statistic can be written as a linear combination of the observations Sij ⫽ c′Yij. For example, the last observation is obtained by defining c′ ⫽ (0, . . . , 0, 1) and the slope by c′ ⫽ (⫺K/2, ⫺K/2 ⫹ 1, . . . , K/2 ⫺ 1, K/2). The average of the summary statistics for the two treatment groups is denoted by ∑j Sij /ni and is distributed as N(c′µi, [c′XDX′c ⫹ c′cσ2e]/ni). The sample size for the two treatment groups is n1 and n2 ⫽ mn1, respectively. n1 ⫽

(c′XDX′c ⴙ c′c σe2) (1 ⫹ 1/m) (Z1⫺α/2 ⫹ Z1⫺β) c′(µ1 ⫺ µ2)2

When there are missing data and the data can be separated into strata according to missing data patterns, the above sample size formula can be modified (36). For analysis based on multivariate analysis of variance for repeated measures, the sample size can be determined using the method proposed by Rochon (37). Let yit ⫽ [yij1, . . . , yijT] denote the set of repeated measures of quality of life for the jth individual in the ith treatment group and assume that each yij ⬃ MVN(µi, ⌺), a multivariate normal distribution with µ′i ⫽ [µi1, . . . , µiT] represents the mean quality-of-life scores for treatment group i ⫽ 1, 2 at time t ⫽ 1 to T and a variance–covariance matrix of ⌺ that can be written as function of a vector of anticipated standard deviations σ and the correlation matrix P, i.e., ⌺ ⫽ Dσ PDσ, where Dσ is a diagonal matrix whose elements consist of σ′ ⫽ [σ1, . . . , σT] and P is a function of ρ. Consider the hypothesis H0 : Hδ ⫽ 0 vs. Ha : Hδ ≠ 0 where δ ⫽ µ1 ⫺ µ2, and H is an (h ⫻ T) matrix, of full row rank, imposing h linearly independent restrictions on δ. To estimate sample size, we need to assume either a compound symmetry (PCS) or autoregressive (PAR) correlation structure.

260

Zee and Osoba

The parameters required to be specified include a vector of δ and a value for ρ, an anticipated difference δ, and the matrix H. In a repeated-measures approach, we need to formulate the hypothesis of interest. In particular, a test of whether the treatment difference is consistent from evaluation to evaluation is of interest, i.e., a test of treatment ⫻ time interaction where H ⫽ [⫺1, IT⫺1] and IT⫺1 is an identity matrix of dimension T ⫺ 1. The sample size can be determined using Hotelling’s T 2 distribution, with h and n1 ⫹ n2 ⫺ 2 degree of freedom, and the noncentrality parameter. Since there is no closed-form expression, an iterative procedure is required. Instead of using mutivariate analysis of variance (MNOVA) approach, the method of Liu and Liang (38) may be used. They proposed a modification of the sample size and power formula of Self and Mauritsen (39) to handle correlated observations based on the generalized estimating equation method and a quasiscore test statistic. This method requires specification of the regression model for the marginal mean and parameters for both null and alternative hypotheses.

V.

ANALYSIS

The analysis of HRQL data presents a number of challenges to statisticians. Some of these problems have been discussed in previous sections and include (1) dealing with multiple end points due to the multidimensional nature of HRQL data, (2) the administration and collection of questionnaires, (3) the careful implementation of data management procedures and monitoring, (4) longitudinal data collection that may happen at irregular intervals due to delay in treatment and missing follow-up visits, (5) dealing with missing data, and (6) difficulty in the interpretation when missing data are nonrandom.

A.

General Methods of Analysis for HRQL Data

The analysis of HRQL data at a cross-sectional time point is the simplest way of looking at the data. Simple, cross-sectional, descriptive statistics at each time interval can be used to describe the overall trend of the two treatment groups. Comparisons using this method do not take into account within-patient variation and may inflate the type I error. The tests between mean values of two treatment groups at different time intervals may also hide significant individual changes and do not take into account different proportions of missing data in the two arms. An alternative method is to summarize the longitudinal data into a summary statistic before performing a between-arms comparison, as suggested by Dawson and Lagakos (40). This method is simple but may overlook important changes in HRQL along time, and it suffers similar problems to the cross-sectional analysis.

Health-related Quality of Life Outcomes

261

Ganz et al. (17) suggested assessing the general change of HRQL over time and called this approach ‘‘pattern analysis.’’ For each individual, the differences between two consecutive quality-of-life scores for a specific domain were calculated, and the sign of the differences was recorded (positive, zero, or negative). Based on the sign of the differences, the patterns of changes were classified into one of three categories: consistent increase over time, no consistent patterns of increase or decrease, or consistent decrease over time. The patterns were compared between the two treatment groups using ordered polytomous logistic regression in which the patterns were treated as ordered categorical variables. The treatment group assessment adjusted for confounding variables was then carried out. However, the definition of consistent increasing or decreasing patterns is rather arbitrary, and it only provides a rough estimate of the general changes in quality of life. Other methods include univariate repeated measures and MNOVA as described in Zee and Pater (41) for continuous repeated measures. For categorical data, Koch et al. (42) suggested a weighted least-squares approach as described in Grizzle et al. (43) to determine a test statistic for testing the hypothesis. A review of other methods by Davis (44) includes generalized estimating equations of Liang and Zeger (45) and the two-stage method of Stram et al. (46). Since the number of repeated measures of HRQL in clinical trials may be quite large, especially for studies with long-term follow-up, the analysis of repeated-measures approach may require a large number of parameters in the model. All these methods require a transformation of the time axis into discrete time intervals to set up for the repeated-measures analysis. A different approach was proposed by Zwinderman (47) using a logistic latent trait model. However, when the number of time points is large, the number of parameters will increase and the estimation procedure for a conditional logistic model will again become complicated. B. Methods of Analysis with Missing Data For HRQL data with missing observations, the summary statistics for repeated measurements approach can be used. Dawson and Lagakos (40) proposed a stratified test for repeated-measures data that contain missing data for comparing two treatment groups in a randomized trial. The type I error of the test is properly retained when the distribution of missing pattern is the same and the distribution of outcome conditional on the pattern of missing data is also the same between the two treatment groups. Suppose that the quality-of-life measurements Yijk were measured at xk (k ⫽ 0, 1, . . . , K) time points for subject j ( j ⫽ 1, 2, . . . , ni) and within group i (i ⫽ 1, 2). The null hypothesis of interest is that the outcome vectors are distributed equally in the two groups. H0 : F1(y) ⫽ F2(y) vs.

H1 : F1( y) ≠ F2( y)

262

Zee and Osoba

Suppose that the summary statistic S is some scalar function of Y. Under H0, the n1 ⫹ n2 vectors Y11, . . . , Y1n1, Y21, . . . , Y2n2, are independent and identically distributed, as are the corresponding values of S, say S11, . . . , S1n1, S21, . . . , S2n2. Consequently, any distribution-free test comparing the summary statistics in the two groups will have the correct size under H0. Shih and Quan (48) further assessed the situations when some sufficient conditions as proposed by Dawson and Lagakos are not met, for example, when data are not missing completely at random or the test statistics across strata indicate a treatment by missing data pattern interaction. Curran et al. (49) gave a general review of the test of missing data assumptions with respect to repeated HRQL data. They focused particularly on the method proposed by Ridout (50), using logistic regression to model the missing data pattern and examine whether the assumption of missing completely at random (MCAR) is valid. Park and Davis (51) provided a test of the missing data mechanism for repeated categorical data. Lipsitz et al. (52) extended the generalized estimating equations method proposed by Liang and Zeger (45) to incorporate repeated multinomial response. However, this class of model requires the assumption MCAR. Little (53) gave a comprehensive review of the modeling approaches for various drop-out mechanisms in repeated-measures analysis and proposed the pattern-mixture model for testing various assumptions. Schluchter (54) proposed a joint mixed-effect and survival model to estimate the association between the repeated-measures and the drop-out mechanism. Fairclough et al. (55) used these methods with respect to quality-of-life examples. Another model-based approach was proposed by Zee (56) based on growth curve models stratified for health state which is likely to have different missing data mechanisms. For example, the health states can be separated by the period of time when patients are receiving protocol treatment versus the period after patients are off protocol treatment. This model requires only missing at random assumption to be satisfied within each of the health states. Methods such as the pattern-mixture model (53) can be adopted in the growth-curve method to assess the missing data mechanism for both the whole data set and within health states to evaluate the appropriateness of the assumption. C.

Other Methods

The incorporation of HRQL end points in a study may be motivated by a highly toxic treatment where clinical benefit with respect to relative gain in survival is discounted by the duration of toxicity experienced from treatment and duration of progression due to disease. There is a trade-off between the duration of quality lifetime gained by the study treatment and the duration of toxicity due to treat-

Health-related Quality of Life Outcomes

263

ment and relapse due to disease. Goldhirsh et al. (57) proposed a Quality-adjusted Time Without Symptoms and Toxicity (Q-TWiST) model to assess the benefit of adjuvant treatment in breast cancer. This method can be used to incorporate both survival and HRQL data in a single comparison between treatment groups. However, a utility measure is needed to summarize the trade-off for individual patients, which is traditionally determined using standard gamble or time tradeoff techniques (58). These methods are rather difficult to implement because they are conceptually complex tools. Other alternative methods include the Health Utility Index (59) and cancer-specific Q-tility Index (60). For the purpose of comparison in a randomized controlled trial, it is not necessary to obtain patientderived utility scores. Instead, a sensitivity analysis, called a threshold utility analysis, can be performed to assess the relative benefits between treatment groups for all combinations of utility scores. Approximate 95% confidence intervals for the differences in Q-TWiST between the two arms for each set of utility values can be determined by bootstrap method (61). D. Conclusion In summary, two different approaches are available to measure HRQL. The first is to measure multidimensional domains of HRQL and follow patients longitudinally. This approach provides a clear picture of the experience of patients with respect to the HRQL domains being measured. The average HRQL scores can be assessed using summary statistics or longitudinal data analysis methods for comparing the effect of treatments. Model-based methods such as mixed effect model or growth-curve model are appropriate. Missing data pattern should be verified to identify potential violation of assumptions. Another approach is to define a totally new end point that incorporates overall survival and various health states of patients during and after treatment, such as the Q-TWiST end point. A major limitation of this method is that it does not address change in psychosocial and emotional functioning. One may try to expand the utility coefficient and the definition of toxic side effects to cover the psychosocial and emotional functioning. One may try to expand the utility coefficient and the definition of toxic side effects to cover the psychosocial and emotional functioning. However, the advantage and usefulness of the Q-TWiST model is being compromised when the summary measure becomes more complicated in its interpretation. REFERENCES 1. Osoba D. Measuring the effect of cancer on quality of life. In: Osoba D, ed. Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp. 26–40.

264

Zee and Osoba

2. Ware JE Jr. Measuring functioning, well-being, and other generic health concepts. In: Osoba D, ed. Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp. 7–23. 3. Kaasa S. Measurement of quality of life in clinical trials. Oncology 1992; 49:288– 294. 4. Bruner D. In search of the ‘‘quality’’ in quality-of-life research. Int J Radiat Oncol Biol Phys 1995; 31:191–192. 5. Seidman A, Portenoy R, Yao T, Lapore J, Mont E, Kortmansky J, Onetto N, Ren L, Grechko J, Beltangady M, Usakewicz J, Souhrada M, Houston C, McCabe M, Salvaggio M, Thaler H, Norton L. Quality of life in phase II trials: a study of methodology and predictive value in patients with advanced breast cancer treated with paclitaxel and granulocytecolony stimulating factor. J Nat Cancer Inst 1995; 87:316– 322. 6. Aaronson NK, Bullinger M, Ahmedzai S. A Modular approach to quality-of-life assessment in clinical trials. In: Scheurlen H, Kay R, Baum M, eds. Cancer Clinical Trials: A Critical Appraisal. Recent Results in Cancer Research. Berlin: Springer, 1988; 111:231–249. 7. Osoba D. Guidelines for measuring health-related quality of life in clinical trials. In: Staquet MJ, Hays RD, Fayers PM; eds. Quality of Life Assessment in Clinical Trials. Methods and Practice. Oxford: Oxford University Press, 1998, pp. 19– 35. 8. Osoba D, Aaronson NK, Till JE. A practical guide for selecting quality-of-life measures in clinical trials and practice. In: Osoba D, ed. Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp. 89–104. 9. Beitz J, Gnecco C, Justice R. Quality of life endpoints in cancer clinical trials: the Food and Drug Administration perspectives. J Natl Cancer Inst Monogr 1996; 20: 7–9. 10. Gotay C, Korn E, McCabe M, Moore T, Cheson B. Quality of life assessment in cancer treatment protocols: research issues in protocol development. JNCI 1992; 84: 575–579. 11. Murray N, Livingston R, Shepherd F, et al. A randomised study of CODE plus thoracic irradiation versus alternating CAV/EP for extensive stage small cell lung cancer (ESCLC) (abstr). Proc ASCO 1997; 16:456a. 12. Osoba D. Rationale for the timing of health-related quality-of-life assessments in oncological palliative therapy. Cancer Treat Rev 1996 22 (suppl A): 69–73. 13. Moinpour C. Cost of quality of life research in South West Oncology Group. J Natl Cancer Inst Monogr 1996; 20:11–16. 14. Cronbach L. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16:297–334. 15. Bollen K. Structural Equations with Latent Variables. New York: John Wiley & Sons. 16. Mor V, Guadagnoli E. Quality of life measurement: a psychometic tower of Babel. J Clin Epidemiol 1988; 41:1055–1058. 17. Ganz P, Schag A, Lee J, Polinsky M, Tan S. Breast conservation versus mastectomy: is there a difference in psychological adjustment or quality of life in the year after surgery? Cancer 1992; 69:1729–1738.

Health-related Quality of Life Outcomes

265

18. Schaq C, Heinrich R, Aadland R, Ganz P. Assessing problems of cancer patients: psychometric properties of the Cancer Inventory of Problems Situations. Health Psychol 1990; 9:83–102. 19. Kemeny M, Wellisch D, Schain W. Psychosocial outcome in a randomized surgical trial for treatment of primary breast cancer. Cancer 1988; 62:1231–1237. 20. Osoba D. Self-rating symptom checklists: a simple method for recording and evaluating symptom control in oncology. Cancer Treat Rev 1993 19 (suppl A): 43– 51. 21. Pater J, Osoba D, Zee B, Lofter W, Gore M, Dempsey E, Palmer M, Chin C. Effects of altering the time of administration and the time frame of quality of life assessments in clinical trials: an example of using the EORTC QLQ-C30 in a large antiemetic trial. Qual Life Res 1998; 7:273–278. 22. Slevin ML, Plant H, Lynch D, et al. Who should measure quality of life, the doctor or the patient? Br J Cancer 1988; 57:109–112. 23. Ganz P, Haskell C, Figlin R, Soto N, Siau J. Estimating the quality of life in a clinical trial of patients with metastatic lung cancer using Karnofsky performance status and the Functional Living Index—Cancer. Cancer 1988; 61:849–856. 24. Osoba D. The Quality of Life Committee of the Clinical Trials Group of the National Cancer Institute of Canada: organization and functions. Qual Life Res 1992; 1:211– 218. 25. Osoba D, Dancey J, Zee B, Myles J, Pater J. Health related quality of life studies of the National Cancer Institute of Canada Clinical Trials Group. J Natl Cancer Inst Monogr 1996; 20:107–111. 26. Sadura A, Pater J, Osoba D, et al. Quality-of-life assessment: patient compliance with questionnaire completion. J Nat Cancer Inst 1992; 84:1023–1026. 27. Coates A, Gebski V. Quality of life studies of the Australian New Zealand Breast Cancer Trials Group: approaches to missing data. Stat Med 1998; 17:533–540. 28. Hahn E, Webster K, Cella D, Fairclough D. Missing data in quality of life research in Eastern Clinical Oncology Group (ECOG) clinical trials: problem and solutions. Stat Med 1998; 17:547–559. 29. Osoba D, Zee B. Completion rates in health-related quality of life assessment: approach of the National Cancer Institute of Canada Clinical Trials Group. Stat Med 1998; 17:603–612. 30. Bernhard J, Gelber RD, eds. Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues. Stat Med 1998; 17: 511–651. 31. Osoba D, Rodrigues G, Myles J, Zee B, Pater J. Interpreting the significance of changes in health-related quality of life scores. J Clin Oncol 1998; 16:139–144. 32. Lachin J. Introduction to sample size determination and power analysis for clinical trials. Control Clin Trials 1981; 2:93–113. 33. O’Brien P. Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40:1079–1087. 34. Tandon P. Application of global statistics in analysing quality of life data. Stat Med 1990; 9:819–827. 35. O’Brien P, Geller N. Interpreting test for efficacy in clinical trials with multiple endpoints. Controlled Clin Trials 1997; 18:222–227.

266

Zee and Osoba

36. Dawson JD. Sample size calculation based on slopes and other summary statistics. Biometrics 1998; 54:323–330. 37. Rochon J. Sample size calculations for two-group repeated-measures experiments. Biometrics 1991; 47:1383–1398. 38. Liu G, Liang K. Sample size calculation for studies with correlated observations. Biometrics 1997; 53:937–947. 39. Self S, Mauritsen R. Powers/sample size calculations for generalized linear models. Biometrics 1988; 44:79–46. 40. Dawson JD, Lagakos SW. Size and power of two-sample tests of repeated measures data. Biometrics 1993; 49:1022–1032. 41. Zee B, Pater J. Statistical analysis of trials assessing quality of life. In: Osoba D, eds. The Effect of Cancer on Quality of Life. Boca Raton: CRC Press, 1991, pp. 113–123. 42. Koch G, Landis R, Freeman J, Freeman D, Lehnen R. A general methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics 1977; 33:133–158. 43. Grizzle J, Starmer F, Koch G. Analysis of categorical data by linear models. Biometrics 1969; 25:489–504. 44. Davis C. Semi-parametric and non-parametric methods for the analysis of repeated measurements with applications to clinical trials. Stat Med 1991; 10:1959–1980. 45. Liang K, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73:13–22. 46. Stram D, Wei L, Ware J. Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. J Am Stat Assoc 1988; 83:631–637. 47. Zwinderman A. The measurement of change of quality of life in clinical trials. Stat Med 1990; 9:931–942. 48. Shih J, Quan H. Stratified testing for treatment effects with missing data. Biometrics 1998; 54:782–787. 49. Curran D, Molenberghs G, Fayers P, Machin D. Incomplete quality of life data in randomized trials: missing forms. Stat Med 1998; 17:697–709. 50. Ridout M. Testing for random dropouts in repeated measurement data. Biometrics 1991; 47:1617–1618. 51. Park T, Davis CS. A test of missing data mechanism for repeated categorical data. Biometrics 1993; 49:631–638. 52. Lipsitz S, Kim K, Zhao L. Analysis of repeated categorical data using generalized estimating equations. Stat Med 1994; 13:1149–1163. 53. Little R. Modelling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 1995; 90:1112–1121. 54. Schluchter MD. Methods for the analysis of informatively censored longitudinal data. Stat Med 1992; 11:1861–1870. 55. Fairclough D, Peterson H, Cella D, Bonomi P. Comparison of several model-based methods for analysing incomplete quality of life data in clinical trials. Stat Med 1998; 17:781–796. 56. Zee B. Growth curve model analysis for quality of life data. Stat Med 1998; 17: 757–766.

Health-related Quality of Life Outcomes

267

57. Goldhirsh A, Gelber R, Simes J, Glasziou P, Coates A. Cost and benefit of adjuvant therapy in breast cancer: a quality adjusted survival analysis. J Clin Oncol 1989; 7: 36–44. 58. Weeks J. Taking quality of life into account in health economic analysis. J Natl Cancer Inst Monogr 1996; 20:23–27. 59. Torrance G, Furlong W, Feeny D, Boyle M. Multi-attribute preference functions: health utilities index. Pharmacoeconomics 1995; 7:503–520. 60. Weeks J, O’Leary J, Fairclough D, Paltiel D, Weinstein M. The Q-tility index: a new tool for assessing the health-related quality of life and utilities in clinical trials and clinical practices. Proc ASCO 1994; 13:436. 61. Glasziou PP, Simes RJ, Gelber RD. Quality adjusted survival analysis. Stat Med 1990; 9:1259–1276.

15 Statistical Analysis of Quality of Life Andrea B. Troxel Joseph L. Mailman School of Public Health, Columbia University, New York, New York

Carol McMillen Moinpour Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

INTRODUCTION

In randomized treatment trials for cancer or other chronic diseases, the primary reason for assessing quality of life (QOL) is to broaden the scope of treatment evaluation. We sometimes characterize QOL and cost outcomes as alternative or complementary because they add to information provided by traditional clinical trials’ end points such as survival, disease-free survival, tumor response, and toxicity. The challenge lies in combining this information in the treatment evaluation context. There is fairly strong consensus that, at least in the phase III setting, QOL should be measured comprehensively (1–3). Although a total or summary score is desirable for the QOL measure, it is equally important to have separate measures of basic domains of functioning (e.g., physical, emotional, social, role functioning) and symptom status. Symptoms specific to the cancer site and/or the treatments under evaluation are also usually included to monitor for toxicities and to gauge the palliative effect of the treatment on disease-related symptoms. In some trials, investigators may study additional areas such as financial concerns, spirituality, family well-being, and satisfaction with care. Specific components of QOL provide information not only on specific interpretation of treatment ef269

270

Table 1

Examples of Comprehensive QOL Questionnaires

Questionnaire SF-36

EORTC QLQ-C30, Version 2

QOL Dimensions (# Items) Physical functioning (10) Role-physical (4) Bodily pain (2) General health (5) Vitality (4) Social functioning (2) Role-emotional (4) Mental health (5)

Health transition (10) Physical component summary score (35)

Core (30) Physical functioning (5) Role functioning (2) Cognitive functioning (2) Emotional functioning (4) Social functioning (2) Symptom scales Fatigue (3) Pain (2) Nausea/vomiting (2) Single-item symptoms (5) Financial impact (1) Global HRQOL (1)

Modules: Cancer-specific: lung (13), breast (23), head and neck (35), esophageal (24), colorectal (38)

Reference 8–14

Mental component summary score (35) No total score

No total score, module scores

Troxel and Moinpour

Others in development: bladder, body image, leukemia, myeloma opthalmic, ovarian, pancreatic, prostate Treatment-specific: high-dose chemotherapy, palliative care

15–21

Version 4

Core* Physical (7) Functional (7) Social (7) Emotional (6) Scores: FACT-G (Core items); TOI; Module; Total score Additional concerns (modules): Cancer-specific: breast (9), bladder (12), brain (19), colorectal (9), CNS (12), cervix (15), esophageal (17), head and neck (11), hepatobiliary (18), lung (9) ovarian (12), prostate (12)

Cancer Rehabilitation Evaluation System—Short Form (CARES-SF)

Physical (10) Psychosocial (17) Medical interaction (4) Marital (6) Sexual (3)

Additional concerns (cont.) Treatment-specific: BMT (23), biologic response modifiers (13) neurotoxicity from systemic chemo (11), taxane (16)

22–26

Symptom-specific: anorexia/cachexia (12), diarrhea (11), fatigue (13), anemia/fatigue (20), endocrine (18), fecal incontinence (12), urinary incontinence (11) Other modules/scales: Spirituality (12), FAHI (47) FAMS (59), FANLT (26) Misc. items contributing to overall score (19)

Statistical Analysis of Quality of Life

Functional Assessment of Cancer Therapy (FACT)† Version 3

27–30

Total score 5 subscale scores

EORTC, European Organization for the Research and Treatment of Cancer. † The Functional Assessment of Chronic Illness Therapy (FACIT) measurement system represents the current version (#4) of the FACT questionnaires. * Relationship with doctor items no longer included in FACT scores; TOI, Trial Outcome Index composed of physical, functional, and cancer-specific module; CNS, central nervous system; BMT, bone marrow transplantation; FAHI, Functional Assessment of HIV Infection; FAMS, Functional Assessment of Multiple Sclerosis; FANLT, Functional Assessment of Non-Life Threatening Conditions.

271

272

Troxel and Moinpour

fects, but also can identify areas in which cancer survivors need assistance in their return to daily functioning. Data on specific areas of functioning can also help suggest ways to improve cancer treatments; Sugarbaker et al. (4) conducted a study in which the radiotherapy regimen was modified as a result of QOL data. QOL data should be generated by patients in a systematic standardized fashion. Interviews can be used to obtain these data, but self-administered questionnaires are usually more practical in the multiinstitution setting of clinical trials. Selected questionnaires must be reliable and valid (5) and sensitive to change over time (6,7); good measurement properties, along with appropriate item content, ensure a more accurate picture of the patient’s QOL. Table 1 describes four QOL questionnaires that meet these measurement criteria and are frequently used in cancer clinical trials. The FACT and EORTC QOL questionnaires have a core section and symptom modules specific to the disease or type of treatment. Others, like the SF-36 and CARES-SF, can be used with any cancer site but may require supplementation with a separate symptom measure to address concerns about prominent symptoms and side effects. When QOL research is conducted in many and often widely differing institutions, quality control is critical to ensure clean complete data. The first step is to make completion of a baseline QOL assessment a trial eligibility criterion. Enforcement of the same requirements for both clinical and QOL follow-up data communicates the importance of the QOL data for the trial. Ongoing training of clinical research associates is mandatory because QOL data are still not considered routine and there is a fair degree of turnover in data management staff. Centralized monitoring of both submission rates and the quality of data submitted must also be considered. This effort requires substantial staff time and therefore cannot be done without adequate resources. Even with the best quality control procedures, submission rates for follow-up QOL questionnaires can be less than desirable, particularly in the advanced-stage disease setting. It is precisely in the treatment of advanced disease, however, that QOL data provide important outcomes, that is, they can document the extent of palliation achieved by an experimental treatment. An increasing focus on QOL studies in the context of clinical trials has resulted in accumulation of QOL data in a wide variety of conditions and patient populations. Although this is a rich source of information, data analysis is often complicated by problems of missing information. Patients sometimes fail to complete QOL assessments because of negative events they experience, such as treatment toxicities, disease progression, or death. Because not all patients are subject to these missing observations at the same rate, especially when treatment failure or survival rates differ between arms, the set of complete observations is not always representative of the total group; analyses using only complete observations are therefore potentially biased.

Statistical Analysis of Quality of Life

273

Several methods have been developed to address this problem. They range in emphasis from the data collection stage, where attention focuses on obtaining the missing values, to the analysis stage, where the goal is adjustment to properly account for the missing values. We first describe methods that are appropriate for complete or nearly complete data and then move on to techniques for incomplete data sets.

II. ANALYSIS OPTIONS FOR ‘‘COMPLETE’’ DATA SETS A. Longitudinal Methods In general the methods described below are applicable to both repeated measures on an individual over time and measurements of different scales or scores on a given individual at the same point in time. Many studies of course use both designs, asking patients to fill out questionnaires comprising several subscales at repeated intervals over the course of the study. 1. Repeated-Measures ANOVA or MANOVA Analysis of variance (ANOVA) and covariance (ANCOVA) or their multivariate versions (MANOVA) represent a very popular class of models for continuous QOL data. The models rely on an assumption of normally distributed data; the total variation in the data is then attributed to between-group and within-group portions. The groups may be defined by treatment arm, prognosis, grade, and so on; hypotheses concerning difference among groups or between combinations of groups may be tested. 2. Generalized Linear Models A second general class of models is the likelihood-based generalized linear model (GLM) (31). This framework is attractive since it accommodates a whole class of data rather than being restricted to continuous Gaussian measurements; it allows a unified treatment of measurements of different types, with specification of an appropriate link function that determines the form of the mean and variance. Estimation proceeds by solving the likelihood score equations, usually using iteratively reweighed least-squares or Newton-Raphson algorithms. GLMs can be fit with generalized linear interactive modeling (GLIM) (32) or with Splus, using the glm() function (33). If missing data are random (see below), unbiased estimates will be obtained. Generalized linear mixed models are a useful extension, allowing for the inclusion of random effects in the GLM framework. SAS macros are available to fit these models.

274

Troxel and Moinpour

3. Change-Score Analysis Analysis of individual or group changes in QOL scores over time is often of great importance in longitudinal studies. Growth curve models can be used to accomplish this, either at a population level or using a two-stage model to allow individual rates of change that are then dependent on characteristics such as treatment group or demographic variables. Change-score analysis has the advantage of inherently adjusting for the baseline score but must also be undertaken with caution, as it is by nature sensitive to problems of regression to the mean (34). 4. Time-to-Event Analysis If attainment of a particular QOL score or milestone is the basis of the experiment, time-to-event or survival analysis methods can be applied. Once the event has been clearly defined, the analysis tools can be directly applied. These include Kaplan-Meier estimates of ‘‘survival’’ functions (35), Cox proportional hazard regression models (36) to relate covariates to the probability of the event, and logrank and other tests for differences in the event history among comparison groups. The QOL database, however, supports few such milestones at this time.

III. TYPES OF MISSING DATA PROBLEMS As mentioned briefly above, QOL data are often subject to missingness. Depending on the nature of the mechanism producing the missing data, analyses must be adjusted differently. Below, we list several types of missing data and provide general descriptions of the mechanisms along with their more formal technical names and terms. A.

Missing Completely at Random (MCAR)

This mechanism is sometimes termed ‘‘sporadic.’’ Missing data probabilities are independent of both observable and unobservable quantities; observed data are a random subsample of complete data. This type of mechanism rarely obtains in real data. B.

Missing at Random (MAR)

Missing data probabilities are dependent on observable quantities (such as covariates like age, sex, stage of disease), and the analysis can generally be adjusted by weighting schemes or stratification. This type of mechanism can hold if subjects with poor baseline QOL scores are more prone to missing values later in the trial or if an external measure of health, such as the Karnofsky performance

Statistical Analysis of Quality of Life

275

status, completely explains the propensity to be missing. Because the missingness mechanism depends on observed data, analyses can be conducted that adjust properly for the missing observations. C. Nonrandom, Missing Not at Random (MNAR), or Nonignorable Missing data probabilities are dependent on unobservable quantities, such as missing outcome values or unobserved latent variables describing outcomes such as general health and well-being. This type of mechanism is fairly common in QOL research. One example is treatment-based differences in QOL compliance, due to worse survival on one arm of the trial. Or, subjects having great difficulty coping with disease and treatment may be more likely to refuse to complete a QOL assessment. D. Evaluating the Missing Data Problem As noted, clinical trials studying QOL will likely suffer somewhat from missing QOL data, even with the best efforts at data collection. Other covariates are collected routinely on patients, however, such as survival and disease status and toxicities. Many groups include a cover sheet with the QOL questionnaire to record the reason for incomplete assessments. This information can be used to model the missing data and QOL processes and determine the extent and nature of the missing data process. To determine which methods of statistical analysis will be appropriate, the analyst must first determine the patterns and amount of missing data and identify the mechanisms that generate missing data. Rubin (37) addressed the assumptions necessary to justify ignoring the missing data mechanism and established that the extent of ignorability depends on the inferential framework and the research question of interest. More precisely, likelihood based and Bayesian inference are valid under both MCAR and MAR, but nonlikelihood based inference is valid under MCAR only. The research question is relevant when considering conditional analyses given complete data; results assuming MCAR and MAR may differ. Identification of missing data mechanisms in QOL research proceeds through two complementary avenues: collecting as much additional patient information as possible and applying simple graphical techniques and using hypothesis testing to distinguish missing data processes. Graphical presentations can be crucial as a first step in elucidating the relationship of missing data to the outcome of interest and providing an overall summary of results that is easily understood by nonstatisticians. A clear picture of the extent of missing QOL assessments is necessary both for selection of the appropriate methods of analysis and for honest reporting of the trial with respect

276

Troxel and Moinpour

to reliability and generalizability. In clinical trials, this means summarizing the proportions of patients in whom assessment is possible (e.g., surviving patients still on study) and then the pattern of assessments among these patients. Machin and Weeden (38) combine these two concepts in Figure 1, using the familiar Kaplan-Meier plot to indicate survival rates and a simple table describing QOL

Figure 1 Kaplan-Meier estimates of the survival curves of patients with small cell lung cancer by treatment group (after MRC Lung Cancer Working Party, 1996). The times at which QOL assessments were scheduled are indicated beneath the time axis. The panel indicates the QOL assessments made for the seven scheduled during the first 6 months as a percentage of those anticipated from the currently living patients. (From Ref. 38, copyright John Wiley & Sons Limited, reproduced with permission.)

Statistical Analysis of Quality of Life

277

assessment compliance. For this study of palliative treatment for patients with small cell lung cancer and poor prognosis, the Kaplan-Meier plot illustrates why the expected number of assessments is reduced by 60% at the time of the final assessment. The table further indicates the increase in missing data even among surviving subjects, from 25% at baseline to 71% among the evaluable patients at 6 months. If the reasons for missing assessments differ over time or across treatment groups, it may be necessary to present additional details about the missing data. A second step is describing the missing data mechanism, especially in relation to patients’ QOL. A useful technique is to present the available data separately for patients with different amounts of and reasons for drop-out. This is illustrated in Figure 2, due to Troxel (39), where estimates of average symptom distress in patients with advanced colorectal cancer are presented by reason for drop-out and duration of follow-up. Higher symptom distress is reported by patients who drop out due to death or illness, and the worsening of symptom status over time is more severe for these patients as well. Patients with a decreasing QOL score may also be more likely to drop out, as demonstrated by Curran et al. (40), where a change score between two previous assessments was predictive of drop-out. Finally, graphical presentations can convey results so that readers may individually balance the importance of early versus late differences among treatment arms or across the different domains of QOL. A particularly simple but informative display is given in Figure 3, due to Coates and Gebski (41). This illustrates the differences in the physician ratings of QOL between patients who completed and did not complete the QOL self-assessment. The average differences, with 95% confidence intervals, clearly communicate the consistent trends across all the time points and the statistical significance at specific points in time. Since baseline QOL scores are often predictive of survival (42), the usual Kaplan-Meier plots, stratified by baseline QOL, can be very informative. 1. Comparing MCAR and MAR Assuming a monotone pattern of missing data, Diggle (43) and Ridout (44) proposed methods to compare MCAR and MAR drop-out. The former proposal involves testing whether scores from patients who drop out immediately after a given time point are a random sample of scores from all available patients at that assessment. The latter proposal centers on logistic regression analysis to test whether observed covariates affect the probability of dropout. 2. Testing for MNAR As mentioned earlier, if likelihood or Bayesian inference is used, then distinguishing between MCAR and MAR is often not the primary concern. Recall that

278

Troxel and Moinpour

Figure 2 Average scores by type and length of follow-up: Symptom Distress Scale. ——, lost-death; - - - , lost-illness; – – –, lost-other; — —, complete follow-up. (From Ref. 39, copyright John Wiley & Sons Limited, reproduced with permission.)

if either MCAR or MAR holds, the missing data mechanism depends on observed quantities only and inferences on Y can be based solely on the observed data. The main issue for likelihood or Bayesian inference is distinguishing between MAR and MNAR. Unfortunately, testing the assumptions of MAR against a hypothesis of MNAR is not trivial; such a procedure rests on strong assumptions

Statistical Analysis of Quality of Life

279

Figure 3 ANZ 8614: difference in the Quality of Life Index (QLI) score (assessed by the physician) between patients who did or did not comply with self-assessment of QOL at the relevant time point, indicated in weeks after randomization. Plots show mean difference and 95% confidence intervals. The possible range of difference in scores is 10. Negative scores indicate that patients who did not comply with self-assessment of QOL had worse QOL as assessed by the physician using the QLI. *Number not complying, number complying with self-assessment. (From Ref. 41, copyright John Wiley & Sons Limited, reproduced with permission.)

that are themselves untestable (40). When fitting a nonignorable model, certain assumptions are made in the specification of the model about the relationship between the missing data process and unobserved data. These assumptions are fundamentally untestable. Molenberghs et al. (45) provide examples where different models produce almost similar fits to the observed data but yield completely different predictions for the unobserved data. Little (46), discussing pattern-mixture models, suggests that underidentifiability is a serious problem with MNAR missing data models and that problems may arise when estimating the parameters of the missing data mechanism simultaneously with the parameters of the underlying data model. Similar problems may exist in the selection model framework (47). Because of the difficulties in identifying the missing data mechanism, analysis of repeated measures with missing data is not trivial. This is especially true for QOL assessments where data may be missing for several reasons. If a sufficient amount of data is collected relating to why QOL questionnaires have not been completed, however, analysts will have a more solid basis for missing data models.

280

E.

Troxel and Moinpour

Special problems

Several special scenarios arise with respect to QOL data collection, and a few are described below. Although these issues are of some concern in their own right, they become increasingly problematic when there is differential missing data between two treatment arms. Differential rates of illness, death, or relapse will generally result in differing amounts of missing data for the study arms, making a valid comparison of study treatments extremely difficult with respect to the QOL end point. 1. Missing Data due to Illness It is perhaps inevitable in a clinical trial studying QOL that some subjects will be at certain times too ill to complete their QOL assessments. Since health status and QOL are almost certainly not independent, this results in nonignorable missingness, as described above. To facilitate modeling of the missing data process in this situation, as much information as possible should be collected regarding the patient’s health status, toxicity episodes, and clinical status; proxy measures of the patient’s QOL can also be useful, although the poor correlation between patient and proxy measures is well documented (48). 2. Missing Data due to Death Obviously subjects who die before completion of the study will have shortened QOL vectors. Because vital status and QOL are almost certainly not independent, this too results in nonignorable ‘‘missing data.’’ This situation is different from that of missing data due to illness described above, however, since in this case the patient’s potentially available data has been simply truncated rather than not observed. It makes little sense to discuss what a patient’s QOL score would have been after death, had it been observed. In this situation, conditional analyses of QOL, given survival up to a certain point, may be the most appropriate. Models that jointly assess both the QOL and survival end points can also be useful. 3. Enforced Missingness due to Study Constraints In some trials, patients who fail or relapse are taken off-study; in general they are still followed for vital status. Although every effort should be made to complete the QOL assessment schedule, subsequent QOL assessments are not always obtained. As with patients who have missing QOL data due to illness, this can result in nonignorable missing data. It is potentially less severe, since all patients who fail are taken off-study rather than some being self-selected for inability to complete a QOL assessment. Nonetheless, differential failure rates on treatment arms can have a devastating effect on the analysis of the QOL data.

Statistical Analysis of Quality of Life

281

IV. METHODS Several extensions to standard mixed models have been proposed in the context of longitudinal measurements in clinical trials. Zee (49) proposed growth curve models where the parameters relating to the polynomial in time are allowed to differ according to the various health states experienced by the patient (e.g., on treatment, off treatment, postrelapse, etc.). The model may also contain other covariates, such as the baseline response value or other relevant clinical information, and these may be either constant or varying across health states. This method requires that the missing data be MAR and may be fit with standard packages by simply creating an appropriate variable to indicate health state; in essence it is a type of pattern-mixture model (see below). Care must be taken to ensure that enough patients remain in the various health states to properly estimate the extra parameters. Schluchter (50) proposed a joint mixed effects model for the longitudinal assessments and the time to drop-out. Suppose the time to drop-out, or censoring, is denoted by Ti. The joint model allows the Ti (or a function of the Ti, in this case the log) to be correlated with the random effects bi. The model is as follows:

冤log(Tb )冥 ⬃ N 冢冤µ0 冥, 冤σB στ 冥冣 i

i

t

′ bt

bt 2

For example, patients with steeper rates of decline in measurements over time (as measured by the random effects bi) may be more likely to fail early. This model allows MNAR data in the sense that the time of drop-out is allowed to depend, through the covariance parameter σbt, on the rate of change in the underlying measurements. If there are intermittent patterns of missing data, these must be assumed to be MCAR. Software to fit this model is not readily available. A. Generalized Estimating Equations (GEEs) GEEs (51) provide a framework to treat disparate kinds of data in a unified way. In addition, they require specification of only the first two moments of the repeated measures, rather than the likelihood. Instead, estimates are obtained by solving an estimating equation of the following form: n

U⫽

冱 DV ′ i

⫺1 i

(Yi ⫺ µi) ⫽ 0

i⫽1

Here µi ⫽ E(Yi|Xi,β) and Di ⫽ ∂µi /∂β are the usual mean and derivative functions and Vi is a working correlation matrix. For Gaussian measurements, the estimating equations resulting from the GEE are equivalent to the usual score equations

282

Troxel and Moinpour

obtained from a multivariate normal maximum likelihood model; the same estimates will be obtained from either method. Software is available in the form of an SAS macro (52). GEEs produce unbiased estimates for data that are MCAR. Extensions to the GEE do exist for data that are MAR: Weighted GEEs will produce unbiased estimates provided the weights are estimated consistently (53,54). When the missingness probabilities depend only on observed covariates, such as the stage of disease, or responses, such as the baseline QOL score, a logistic or probit model can be used to estimate missingness probabilities for every subject; the weights used in the analysis are then the inverses of these estimated probabilities. Robins et al. (54) discuss these equations and their properties in detail; presented simply, the estimating equation takes the form n

U⫽

冱 DV ′ i

i⫽1

⫺1 i

diag

冢冣

Ri (Yi ⫺ µi) ⫽ 0 πˆ i

where πij ⫽ P(Rij ⫽ 1|Y0i , Wi, α), πˆ ij is an estimate of πij, and diag(Q) indicates a matrix of zeroes with the vector Q on the diagonal. Although software exists to fit GEEs, additional programming is required to fit a weighted version. B.

Joint Modeling of Measurement and Missingness Processes

One can model the joint distribution of the underlying complete data Yi and the missingness indicators Ri. If conditioning arguments are used, two types of models can result; the selection model is concerned with f(Yi)f(Ri|Yi), whereas the pattern mixture model is concerned with f(Ri)f(Yi|Ri). The two approaches are discussed and compared in detail by Little (46). Pattern mixture models proceed by estimating the parameters of interest within strata defined by patterns of and/ or reasons for missingness and then by combining the estimates. Selection models proceed by modeling the complete data and then modeling the behavior of the missingness probabilities conditional on the outcome data. Models for continuous data have been proposed by Diggle and Kenward (47) and Troxel et al. (55). Although the computations can be burdensome, the approach will produce unbiased estimates even in the face of MNAR data, provided that both parts of the model are correctly specified. The selection models assume that the complete underlying responses are multivariate normal; any parametric model, such as the logistic, can be used for the missing data probabilities. The type of missingness mechanism is controlled by the covariates and/or responses that are included in the model for the missingness probabilities. In this example, the probabilities may depend on the current, possibly unobserved measurement Yij, implying that the missing data may be MNAR; it is possible to allow dependence on previous values as well. The observed data likelihood is

Statistical Analysis of Quality of Life

283

obtained by integrating the complete data likelihood over the missing values. Estimates are usually obtained through direct maximization of the likelihood surface; numerical integration is generally required. Once estimates are obtained, inference is straightforward using standard likelihood techniques. This method allows analysis of all the data, even when the missingness probabilities depend on potentially unobserved values of the response. The estimates are also likely to depend on modeling assumptions, most of which are untestable in the presence of MNAR missing data. Despite these drawbacks, these models can be very useful for investigation and testing of the missingness mechanism. In addition, the bias that results from assuming the wrong type of missingness mechanism may well be more severe than the bias that results from misspecification of a full maximum likelihood model. Software to fit the Diggle and Kenward (47) model is available (56). For discrete data, methods allowing for nonignorable missing data have been proposed by Fay (57), Baker and Laird (58), and Conaway (59). Here, loglinear models are used for the joint probability of outcome and response variables conditional on covariates. The models can be fit using the EM algorithm (60), treating the parameters of the missingness model as a nuisance. C. Multiple Imputation Imputation, or ‘‘filling-in,’’ of data sets is a way of converting an incomplete to a complete data set. This method is attractive because once the imputation is conducted, the methods for complete data described in Section II can be applied. Simple imputation consists of substituting a value for the missing observations, such as the mean of the existing values, and then adjusting the analysis to account for the fact that the substituted value was not obtained with the usual random variation. Multiple imputation (61) is similar in spirit to simple imputation but with added safeguards against underestimation of variance due to substitution. Several data sets are imputed, and the analysis in question is conducted on each of them, resulting in a set of estimates obtained from each imputed data set. These several results are then combined to obtain final estimates based on the multiple set. Multiple imputation can be conducted in the presence of all kinds of missingness mechanisms. The usual drawback with respect to nonignorable missingness applies, however. A model is required to obtain the imputed values, and in the presence of nonignorable missingness, the resultant estimates are sensitive to the chosen model; even worse, the assumptions governing that model are generally untestable due to the very nature of the missing data. Finally, some question whether it is appropriate to impute values for subjects whose data are missing because of early death. Such an approach allows an estimate of the study population’s experience if the entire group stayed on study for the entire period of QOL

284

Troxel and Moinpour

assessment. That is, QOL results are generalizable to new patients who could be candidates for the treatments where their length of survival is unknown.

V.

AVOIDING PITFALLS: SOME COMMONLY USED SOLUTIONS

A.

Substitution Methods

In general, methods that rely on substitution of some value, determined in a variety of ways, are subject to bias and heavily subject to assumptions made in obtaining the substituted value. For these reasons, they should not be used to produce a primary analysis on which treatment or other decisions are based. One of the most serious problems with substitution methods, especially when the worse score method is used, is that they can seriously damage the psychometric properties of a measure. These properties, such as reliability and validity, rely on variations in the scores to hold. A second problem is that in substituting values and then conducting analyses based on that data, the variance of estimates will be underestimated, since the missing values, had they been observed, would carry with them random variation, which the substituted values do not. Substitution methods can be useful, however, in conducting sensitivity analyses to determine the extent to which the analysis is swayed by differing data sets. 1. Worst Score This method is often used in sensitivity analyses, since the implicit assumption is that the subjects who did not submit an assessment are all as worse off as they can possibly be with respect to QOL. This is usually the most extreme assumption possible, so an analysis robust to worst-score substitution has a strong defense. The comment raised above regarding the psychometric measurement properties warrants caution, however. 2. Last Value Carried Forward This substitution method tries to use each patient’s score to provide information about the imputed value. It assumes, however, that subjects who drop out do not have a changing QOL score, when in practice often it is the subjects in rapid decline who tend to drop out prematurely. For this reason, last value carried forward should be used with extreme care, if at all. 3. Average Score Use of the average score, either within a patient or within a group of patients (such as those on the same treatment arm), is more closely related to classic

Statistical Analysis of Quality of Life

285

imputation methods. Again, it assumes that the imputed values are no different from the observed values, but it does not necessarily force each subject’s score to remain constant. B. Adjusted Survival Analyses Some authors (62,63) proposed analyses in which survival is treated as the primary outcome, but it is adjusted for the QOL experience of the patients. This is an extremely appealing idea, for it clarifies the inherent trade-off between length and quality of life that applies to most patients. It can be difficult to implement satisfactorily in practice, however, because of the difficulty of obtaining the appropriate values with which to weight survival in different periods. The two methods described below have gained some popularity. 1. Quality-adjusted Life Years This method consists of estimating a fairly simple weighted average, in which designated periods of life are weighted according to some utility describing QOL. Because utilities are obtained using lengthy interviews or questionnaires focusing on time trade-offs or standard gambles, investigators commonly substitute utilities obtained from some standard population rather then any information obtained directly from the patient. This renders the analysis largely uninterpretable, in our view. 2. Q-TWiST Q-TWiST (64), or quality-adjusted time without symptoms and toxicity, is a more detailed method of adjustment, though still one that relies on utilities. The patient’s course through time is divided up into intervals in which the patients experiences toxicity due to treatment, toxicity due to disease (i.e., brought on by relapse), and no toxicity. These intervals may be somewhat arbitrary, determined not by the patient’s actual experience with toxicity but by a predefined expectation of the average interval in which patients with a given disease receiving a given treatment experience problems due to treatment, the average interval they spend disease-free after the end of therapy, and the average time until relapse. To compound this arbitrariness, utilities for each period are chosen by the analyst. For example, time spent suffering from treatment-induced toxicities may be rated at 50% of perfect health. This results in an analysis that reflects only a small amount of patient-derived data and a large number of parameters chosen by the investigator. Data from patient rating scales and Q-TWiST analyses can differ (65).

286

Troxel and Moinpour

REFERENCES 1. Moinpour CM, Feigl P, Metch B, Hayden KH, Meyskens FL Jr, Crowley J. Quality of life end points in cancer clinical trials: review and recommendations. J Natl Cancer Inst 1989; 81:485–495. 2. Nayfield S, Ganz PA, Moinpour CM, Cella D, Hailey B. Report from a National Cancer Institute (USA) workshop on quality of life assessment in cancer clinical trials. Quality of Life 1992; 1:203–210. 3. National Cancer Institute (US). Quality of life in clinical trials. Proceedings of a workshop held at the National Institute of Health; March 1–2, 1995. Bethesda, MD: National Institute of Health, 1996. 4. Sugarbaker PH, Barofsky I, Rosenberg SA, Gianola FJ. Quality of life assessment of patients in extremity sarcoma clinical trials. Surgery 1982; 91:17–23. 5. Nunnally J. Psychometric Theory. New York: McGraw-Hill, 1978. 6. Kirshner B, Guyatt GH. A methodological framework for assessing health indices. J Chronic Dis 1985; 1:27–36. 7. Guyatt GH, Deyo RA, Charlson M, Levine MN, Mitchell A. Responsiveness and validity in health status measurement: a clarification. J Clin Epidemiol 1989; 42: 403–408. 8. Stewart AL, Ware JE Jr. Measuring Functioning and Well-Being: The Medical Outcomes Approach. Durham and London: Duke University Press, 1992. 9. Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care 1992; 30:473–483. 10. McHorney C, Ware J, Raczek A. The MOS 36-item short-form health survey (SF36). II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993; 31:247–263. 11. McHorney C, Ware J, Lu R, Sherbourne C. The MOS 36-item short-form health survey (SF-36). III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994; 32:40–46. 12. Ware, JE Jr, Snow KK, Kosinski MA, Gandek B. SF-36 Health Survey: manual and Interpretation Guide. Boston: Nimrod Press, 1993. 13. Ware, JE Jr, Kosinski M, Keller SD, SF-36 Physical and Mental Health Summary Scales: A User’s Manual. Boston: The Health Institute, New England Medical Center, 1994. 14. Ware JE Jr. The SF-36 Health Survey. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials. 2nd ed. Philadelphia: Lippincott-Raven, 1996. 15. Aaronson NK, Bullinger M, Ahmedzai S. A modular approach to quality of life assessment in cancer clinical trials. Recent Results Cancer Res 1988; 111:231–249. 16. Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality of life instrument for use in international clinical trials in oncology. J Natl Cancer Inst 1993; 85:365–373. 17. Bergman B, Aaronson NK, et al. The EORTC QLQ-LC13: a modular supplement to the EORTC QLQ-C30 for use in lung cancer trials. Eur J Cancer 1994; 30A: 635–642. 18. Bjordal K, Ahlner EM, et al. Development of an EORTC questionnaire module to

Statistical Analysis of Quality of Life

19.

20. 21. 22.

23.

24.

25.

26.

27.

28. 29.

30. 31. 32. 33. 34. 35.

287

be used in quality of life assessments in head and neck cancer patients. Acta Oncol 1994; 33:879–885. Sprangers M, Cull A. The European Organization for Research and Treatment of Cancer approach to quality of life: guidelines for developing questionnaire modules. Qual Life Res 1993; 2:287–295. EORTC Quality of Life Study Group. EORTC QLQ-C30 Scoring Manual. Brussels: EORTC Quality of Life Study Group, 1997. EORTC Quality of Life Study Group. EORTC QLQ-C30 Reference Values. Brussels: EORTC Quality of Life Study Group, 1998. Cella D, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi AE, et al. The functional assessment of cancer therapy scale: development and validation of the general measure. J Clin Oncol 1993; 11:570–579. Weitzner M, Meyers C, Gelke C, Byrne K, Cella DF, Levin V. The Functional Assessment of Cancer Therapy (FACTS) scale. Development of a brain subscale and revalidation of the general version (FACT-G) in patients with primary brain tumors. Cancer 1995; 75:1151–1161. Brady MJ, Cella DF, Mo F, Bonomi AE, Tulsky DS, Lloyd SR, Deasy S, Cobleigh M, Shiomoto G. Reliability and validity of the Functional Assessment of Cancer Therapy–Breast Quality-of-Life Instrument. J Clin Oncol 1997; 15:974–986. D’Antonio LL, Zimmerman GJ, Cella DF, Long S. Quality of life and functional status measures in patients with head and neck cancer. Arch Otolaryngol Head Neck Surg 1996; 122:482–487. Cella DF. Manual of the Functional Assessment of Chronic Illness Therapy (FACIT Scales)—Version 4. Outcomes Research and Education (CORE). Evanston Northwestern Healthcare and Northwestern University, 1997. Schag CAC, Ganz PA, Heinrich RL. Cancer Rehabilitation Evaluation System— Short Form (CARES-SF): a cancer specific rehabilitation and quality of life instrument. Cancer 1991; 68:1406–1413. Heinrich RL, Schag CAC, Ganz PA. Living with cancer: the cancer inventory of problem situations. J Clin Psychol 1984; 40:972–980. Schag CAC, Heinrich RL, Ganz PA. The cancer inventory of problem situations: an instrument for assessing cancer patients’ rehabilitation needs. J Psychosoc Oncol 1983; 1:11–24. Schag CAC, Heinrich RL. Development of a comprehensive quality of life measurement tool: CARES. Oncology 1990; 4:135–138. McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. London: Chapman and Hall. 1989. Baker RJ, Nelder JA. The GLIM System, Release 3, Generalized Linear Interactive Modeling. Oxford: Numerical Algorithms Group. 1978. Hastie TJ, Pregibon D. Generalized linear models. In: Chambers JM, Hastie TJ, eds. Statistical Models in S. London: Chapman and Hall, 1993. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: John Wiley and Sons, 1986. Kaplan EL, Meier P. Nonparametric estimator from incomplete observations. J Am Stat Assoc 53:457–481.

288

Troxel and Moinpour

36. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc B 1972; 21:411–421. 37. Rubin DB. Inference and missing data. Biometrika 1976; 63:581–592. 38. Machin D, Weeden S. Suggestions for the presentation of quality of life data from clinical trials. Stat Med 1998; 17:711–724. 39. Troxel AB. A comparative analysis of quality of life data from a Southwest Oncology Group randomized trial of advanced colorectal cancer. Stat Med 1998; 17:767– 779. 40. Curran D, Bacchi M, Schmitz SFH, Molenberghs G, Sylvester RJ. Identifying the types of missingness in quality of life data from clinical trials. Stat Med 1998; 17: 697–710. 41. Coates A, Gebski VJ. Quality of life studies of the Australian New Zealand Breast Cancer Trials Group: approaches to missing data. Stat Med 1998; 17:533–540. 42. Moinpour CM, Savage MJ, Troxel AB, Lovato LC, Eisenberger M, Veith RW, Higgins B, Skeel R, Yee M, Blumenstein BA, Crawford ED, Meyskens FL Jr. Quality of life in advanced prostate cancer: results of a randomized therapeutic trial. JNCI 1998; 90:1537–1544. 43. Diggle PJ. Testing for random dropouts in repeated measurements data. Biometrics 1989; 45:1255–1258. 44. Ridout M. Testing for random dropouts in repeated measurement data. Biometrics 1991; 47:1617–1621. 45. Molenberghs G, Goetghebeur EJT, Lipsitz SR. Non-random missingness in categorical data: strengths and limitations. Am Statist 1999; 53:110–118. 46. Little RJA. Modeling the drop-out mechanism in repeated-measures studies. JASA 1995; 90:1112–1121. 47. Diggle P, Kenward M. Informative drop-out in longitudinal analysis [with discussion]. Appl Stat 1994; 43:49–93. 48. Sprangers MAG, Aaronson NK. The role of health care providers and significant others in evaluating the quality of life of patients with chronic disease: a review. J Clin Epidemiol 1992; 45:743–760. 49. Zee BC. Growth curve model analysis for quality of life data. Stat Med 1998; 17: 757–766. 50. Schluchter MD. Methods for the analysis of informatively censored longitudinal data. Stat Med 1992; 11:1861–1870. 51. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73:13–22. 52. Groemping U. GEE: a SAS macro for longitudinal data analysis. Technical Report, Fachbereich Statistik, Universitaet Dortmund, Germany, 1994. 53. Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. JASA 90:122–129. 54. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. JASA 1995; 90:106–121. 55. Troxel AB, Harrington DP, Lipsitz SR. Analysis of longitudinal data with non-ignorable non-monotone missing values. Appl Stat 1998;47:425–438. 56. Smith DM. The Oswald Manual. Technical Report, Statistics Group, University of Lancaster, Lancaster, England, 1996.

Statistical Analysis of Quality of Life

289

57. Fay RE. Causal models for patterns of nonresponse. JASA 81:354–365. 58. Baker SG, Laird NM. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. JASA 1988; 83:62–69. 59. Conaway MR. The analysis of repeated categorical measurements subject to nonignorable nonresponse. JASA 1992; 87:817–824. 60. Dempster AP, Laird NM, Rubin DB. Maximum likelihood estimation from incomplete data via the EM algorithm [with discussion]. J R Stat Soc B 1977; 39:1–38. 61. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons, 1987. 62. Glasziou PP, Simes RJ, Gelber RD. Quality adjusted survival analysis. Stat Med 1990; 9:1259–1276. 63. Goldhirsch A, Gelber RD, Simes J, Glasziou P, Coates A. Costs and benefits of adjuvant therapy in breast cancer: a quality-adjusted survival analysis. JCO 1989; 7:36–44. 64. Gelber RD, Cole BF, Goldhirsch A. Comparing treatments using quality-adjusted survival: the Q-TWiST method. Am Stat 1995; 49:161–169. 65. Fairclough DL, Fetting JH, Cella D, Wonson W, Moinpour CM. Quality of life and quality adjusted survival for breast cancer patients receiving adjuvant therapy. Qual Life Res 2000; 8:723–731.

16 Economic Analysis of Cancer Clinical Trials Gary H. Lyman Albany Medical College, and State University of New York at Albany School of Public Health, Albany, New York

I.

INTRODUCTION

A. Costs of Cancer Care Health care expenditures in the United States have risen dramatically, now exceeding one trillion dollars annually and constituting 14% of the gross domestic product (Fig. 1) (1,2). Approximately 10% of health care expenditures are allocated for cancer care, totaling more than $100 billion annually. More than 90% of medical costs for cancer are associated with five diagnoses: breast cancer (24%), colorectal cancer (24%), lung cancer (18%), prostate cancer (18%), and bladder cancer (8%) (3,4). Hospital care represents the largest single cost component accounting for approximately 50% of total cancer care costs. Other major components of health care costs include physician/professional costs (25%) and pharmaceutical and home care costs (approximately 10% each). Cancer care costs vary over time and are generally greater during the period immediately after diagnosis and during the last few months before death (3). B. Health Care Outcome Measures There is increasing interest in the assessment of health care outcomes beyond traditional clinical measures of efficacy. Alternative measures of interest include 291

292

Lyman

Figure 1 Annual health care expenditures. Annual U.S. health care expenditures for selected years from 1960 to 1998 reported by the Health Care Financing Administration. Total annual expenditures are reported in units of $100 billion, whereas per capita expenditures are presented in $ thousands. Total U.S. health expenditures projected for the year 2007 are $2.1 trillion.

health-related quality of life and economic outcomes. The primary economic measure in most economic studies is the mean cost or cost difference between treatment groups. To facilitate the comparison of different treatment strategies, combined measures have been developed that bring together clinical, quality of life, and economic outcomes into summary measures such as the quality-adjusted life year (QALY) and cost-effectiveness and cost-utility ratios. C.

Economic Analyses

A number of different types of economic evaluations have been developed, including cost-minimization, cost-effectiveness, and cost-utility analyses. The analysis of economic outcomes is complicated by the multiple outcomes, skewed distributions, and frequent missing data. Nevertheless, combined clinical and economic outcome measures permit more rational comparisons of different clinical strategies for purposes of medical decision making, patient counseling, clinical practice guideline development, and health care policy formulation.

Economic Analysis of Clinical Trials

293

D. Economic Analysis in Controlled Clinical Trials Performing economic analyses in association with controlled clinical trials (CCTs) has gained increasing enthusiasm in recent years. Figure 2 compares published cost-effectiveness measures for several types of cancer treatment derived from CCTs. Such analyses, however, are associated with several important methodological challenges. Economic measures are often of secondary interest in such trials lacking a priori hypotheses with frequent missing data and inadequate sample size for valid statistical inference. The variability in the cost measures and the lack of agreement on clinically meaningful cost differences further limit the conclusions derived from such studies. The addition of economic outcomes to traditional measures of clinical efficacy increases the complexity and cost of CCTs. Economic analyses, therefore, should be limited to large phase III trials where important trade-offs between efficacy and cost are anticipated. This chapter focuses attention on the design, conduct, analysis, and

Figure 2 Cost effectiveness of cancer treatment. Estimated cost effectiveness for various cancer treatment modalities adapted from Smith et al. (41). Cost effectiveness is expressed in terms of incremental cost ($U.S. thousands) per life year saved. AML, acute myelogenous leukemia; NSCLC, non small cell lung cancer; HD, Hodgkin’s disease; ABMT, autologous bone marrow transplantation; CMF, cyclophosphamide, methotrexate, 5-fluorouracil; CAE, cyclophosphamide, adriamycin(doxorubicin), etoposide; IFN, interferon; adj, adjuvant; met, metastatic; adv, advanced.

294

Lyman

reporting of economic analyses in the setting of cancer clinical trials. The strengths and the limitations of such analyses are discussed, and guidelines are offered for the proper conduct, evaluation, and interpretation of such economic analyses.

II. HEALTH CARE OUTCOMES A.

Clinical Efficacy

Response and survival often represent the primary clinical end points for the assessment of efficacy upon which sample size and power calculations are based. Alternatively, clinical outcome can be measured in terms of life expectancy or the average number of years of life remaining at a given age. The life expectancy of a population can be thought of as representing the area under the corresponding survival curve (5). The gain in life expectancy or life years saved with treatment represents the marginal efficacy and can be thought of as the area between the survival curves with and without intervention. This represents a more powerful method for assessing treatment effect than comparing median survivals or the proportion event free at a given time (Fig. 3). Changes in life expectancy are often used in economic analyses to express the efficacy of treatment. Such measures are limited by the difficulty in judging a clinically important gain in life years and extrapolating censored survival data beyond the trial period. B.

Health-related Quality of Life (HRQOL)

In recent years, there has been increasing interest in the assessment of the impact of cancer and cancer treatment on quality of life. Health profiles derived from psychosocial theory attempt to assess HRQOL through one of a variety of scales addressing the relevant dimensions associated with quality of life, such as functional ability, emotional well-being, sexuality/intimacy, family well-being, treatment satisfaction, and social functioning (6). Alternatively, utility measures derived from economic and decision theory attempt to assess HRQOL by eliciting patient preferences for specific outcome states (7). Patient preferences can be assessed through a time trade-off method incorporating a standard reference gamble generating a single value of health status along a linear continuum from death (0) to full health (1). The major advantage of measures of patient preference or utility is that they can then be used to adjust measures of longevity such as life expectancy (e.g., quality-adjusted life years or QALYs). The QALY represents the time in full health considered by the patient equivalent to actual time in the diseased state. Serial measurement of patient preferences over time can be used to estimate the cumulative impact of treatment on HRQOL. The sum over all health care states of the product of the time spent in each state and the utility associated with the state will yield the quality-adjusted time without symptoms

Economic Analysis of Clinical Trials

295

Figure 3 Gain in life expectancy. Hypothetical survival curves of control and treatment subjects displaying the probability of survival over time since randomization adapted from Naimark and from Wright and Weinstein (5). The gain in median survival and probability of 5-year survival are shown. The area between the curves represents the life years gained with the intervention.

of disease or toxicity of treatment or Q-TWIST described by Gelber et al. (8). The value of such measures is limited by the time and cost involved in their assessment through direct patient encounters and the lack of elucidation of the multidimensional aspects of HRQOL. The assessment of HRQOL in conjunction with conventional clinical efficacy measures in CCTs has gained increasing interest over the past several years (9–11). Several authors have addressed the methodological challenges of HRQOL outcomes associated with the design and analysis of clinical trials (12,13). Guidelines have been proposed for the incorporating of HRQOL measures into CCTs (14). C. Economic Outcomes Economic outcome measures differ in several respects from traditional clinical outcome measures. The most important economic outcome of interest for clinical decision making and health policy formation is cumulative total cost, which considers both the activity level over time and unit costs. The activity level represents the amount of various resources used and the time expended in providing medical

296

Lyman

Table 1

Types of Economic Analysis

Methodology

Cost unit

Effect unit

Cost Cost Cost Cost Cost

Monetary Monetary Monetary Monetary Monetary

— Equal LYS* QALYS† Monetary

of illness minimization effectiveness utility benefit

* Life years saved. † Quality-adjusted life years saved.

care. Unit costs represent the cost associated with each unit of activity. The total cost of illness represents the weighted sum of the unit costs where the weights are represented by the units of activity for each cost item such that n

Total cost ⫽

冱 [unit activity ⫻ unit cost]

The major focus of such economic analyses relates to those resources and costs that might differ between treatment groups. The types of economic analyses associated with cancer CCTs are summarized in Table 1. Direct medical costs represent the costs of providing medical services for the prevention, diagnosis, treatment, follow-up, rehabilitation, and palliation of disease. These costs include those associated with hospitalization, professional services, pharmaceuticals, radiologic and laboratory testing, and home health care services. Direct nonmedical costs represent additional expenditures incurred while receiving medical care, such as transportation costs to and from the institution and child care expenses. Indirect costs include those associated with the morbidity of disease and treatment, such as days lost from work and the economic impact of lost economic output due to premature death. Intangible costs are those associated with pain and suffering and the loss of companionship. Although it is very difficult to express such concerns in monetary terms, these represent real social and emotional costs to the patient and family. Often economic outcome measures are combined with clinical and/or quality of life measures to provide a summary outcome measure reflecting the simultaneous difference in cost and the change in survival or quality-adjusted survival. III. ECONOMIC ANALYSIS A.

When Should Economic Analyses Be Performed?

Economic analyses are most useful when the clinical and economic outcomes of interest are discordant, that is, when an intervention is associated with equal or improved outcome but a greater cost or when the cost of an intervention is the

Economic Analysis of Clinical Trials

297

same or less but with less effectiveness. Clearly, interventions that are associated with large or uncertain resource consequences and small or unclear efficacy are most likely to be candidates for an economic analysis. The proper timing of an economic evaluation in the development of a new intervention is important. Introduction too early in the process before efficacy and standard procedures have been established may lead to the waste of limited resources, whereas incorporation too late in the process may limit the ability of the evaluation to alter the dissemination of the technology. Economic analyses, therefore, should generally be limited to definitive or confirmatory studies of promising approaches likely to have considerable economic consequences or for which a trade-off between efficacy and cost is anticipated. B. Types of Economic Analysis 1. Noncomparative Evaluations Noncomparative (descriptive) economic studies generally are performed for either health administrative or public health purposes and do not involve explicit comparisons of treatment options, although implicit comparisons are often made. A common approach is that of burden-of-illness or cost-of-illness studies where the cost of disease in a population is summarized by tabulating the incidence or prevalence of disease, the associated morbidity or mortality, and the total costs of illness. 2. Comparative Evaluations Comparative economic studies evaluate possible interventions in cohorts of individuals comparing the benefits and the costs. As shown in Table 1, several types of economic evaluations are available (15–21). When clinical effectiveness is not an issue or is considered equal between therapeutic alternatives, the evaluation may be most reasonably based on differences in resource utilization or cost through a cost-minimization analysis where the strategy associated with the lowest total cost is identified. Clinical benefits are sometimes converted into the same economic measure in a cost-benefit analysis to combine them into a single measure. Such an approach, however, is limited by the requirement that a monetary value is placed on clinical and quality of life outcome measures. When important differences in both clinical efficacy and cost are anticipated, it is often preferable to combine economic measures with those of clinical efficacy (Fig. 4). In this situation, the measures of interest are generally the additional cost of one strategy over another (marginal cost) and the additional clinical benefit (marginal efficacy) or quality-adjusted clinical benefit (marginal utility). Costeffectiveness analysis compares interventions based on the ratio of the marginal cost and the marginal effectiveness (marginal cost-effectiveness) expressed as the added cost per life year saved (Fig. 5). Cost-utility analysis compares treat-

298

Lyman

Figure 4 Combined outcome measures. Relationship between clinical and economic outcome measures. Clinical measures such as survival or life expectancy and quality of life may be combined with economic outcome measures such as cost to simultaneously evaluate cost and efficacy in terms of cost-effectiveness or cost-utility ratios.

ments based on the ratio of the marginal cost and the marginal utility (marginal cost-utility) expressed as the added cost per QALY saved. Ultimately, these latter two approaches attempt to identify the most efficient approach, that is, the least costly strategy associated with the greatest effectiveness or utility. C.

Limitations of Economic Analyses

The evaluation and interpretation of an economic analysis will often differ substantially depending on the perspective from which it was undertaken, for example, the patient or family, a health care provider or institution, a third party payor, or that of society as a whole. Indirect and intangible costs, although very important to the patient and family, may not even be considered in economic analyses from most other perspectives. From the narrowest perspective, the lowest cost will be associated with the absence of care or no intervention and the shortest survival. Likewise, lifetime costs will often be less in those with the shortest life expectancy such as the elderly. From a more global perspective, public health efforts aimed at screening and early detection and disease prevention assume greater importance since these will ultimately improve clinical outcome. In addition, marginal summary measures do not reflect the absolute benefit or cost of an intervention. A strategy associated with a lower absolute effectiveness may

Economic Analysis of Clinical Trials

299

Figure 5 Cost-effectiveness plane. Plane displaying the relationship between incremental cost (ordinate) and incremental effectiveness (abscissa). Any point on the plane represents cost effectiveness expressed as the ratio of incremental cost to incremental effectiveness. The straight line from the lower left of the plane to the upper right represents the maximum acceptable cost-effectiveness ratio determined by society. Interventions associated with greater effectiveness and lower cost (lower right) are always considered acceptable, whereas those associated with greater cost and less effectiveness (upper left) are always unacceptable. The acceptability of cost-effectiveness ratios in the other boxes depends on whether it lies below or above the maximum cost-effectiveness line. Any estimated cost effectiveness below that line represents an acceptable ratio, whereas those above the line are considered unacceptable.

actually appear superior in terms of cost-effectiveness or cost-utility. It is important, therefore, to measure both the absolute and the marginal benefit and cost in such analyses.

IV. ECONOMIC ANALYSIS AND CCTs A. Why Perform Economic Analyses in Association with Clinical Trials? 1. Strengths The quality of an economic analysis depends upon the precision and validity of the underlying data best provided by CCTs. Just as CCTs are thought to represent

300

Lyman

the most definitive way to evaluate interventions for efficacy, economic analyses based on such trials may represent the best means to evaluate the cost and costefficiency of treatment. Such economic analyses will be based on the most reliable estimates of treatment efficacy, and they will facilitate the comprehensive comparison of therapeutic options. Study design, data collection, and planned analyses are generally detailed in a written protocol. The care taken in the design, conduct, and analysis of such trials may provide the best available information on resource utilization and treatment efficacy. The importance of randomized controlled trials is evident in efforts of observational studies and nonrandomized trials to emulate their careful design and analysis procedures to achieve the same conclusions. Economic analyses associated with CCTs should be sought before wide dissemination of new technologies, especially when the resource consequences or costs are large. 2. Weaknesses Economic outcomes measured in association with clinical trials are often considered of secondary importance with no a priori hypothesis, small sample size, and frequent missing data. Even when properly designed and conducted, economic analyses with CCTs may have low external validity related to the lack of representativeness and limited generalizability due, in part, to strict eligibility criteria. The study population also must adhere to clinical monitoring that may not be representative of clinical practice and will be associated with resource utilization and costs differing considerably from routine. The costs involved with exploratory or early clinical trials may not be representative of what they would be with more experience. Finally, economic analysis will add to the cost and complexity of CCTs and should generally be limited to use with large, prospective, phase III, randomized clinical trials. Careful consideration should be given to the importance of the economic information and the appropriateness of the clinical trial design prior to incorporating economic assessment into a CCT, and care must be utilized in selecting only the most relevant and objective measures of resource utilization for inclusion in the trial. The same methodological rigor should be applied to the economic analysis as is commonly used in the assessment of therapeutic efficacy. The appropriate use of economic analyses in association with CCTs, therefore, requires careful attention to the proper design, conduct, analysis, and reporting of such analyses (22–27). B.

Design Considerations

1. Types of Studies As shown in Table 2 three general types of economic analysis related to CCTs are described that vary in the nature and source of the economic data. In type I

Economic Analysis of Clinical Trials

301

Table 2

Economic Analyses Associated with Cancer Clinical Trials

Type

Efficacy

Activity

Unit cost

Precision

Generalizability

Prospective Prospective Prospective

Retrospective Prospective Prospective

Retrospective Retrospective Prospective

⫹ ⫹⫹ ⫹⫹⫹

⫹⫹⫹ ⫹⫹ ⫹

I II III

Prospective, data from all or a sample of trial institutions; Retrospective, retrospective data from study institutions or other sources; Efficacy, primary outcomes; Activity, resources used.

economic analyses, all cost information is obtained either from an independent source in an unsampled fashion or from a subsample of study subjects. Such studies can often be performed rapidly at relatively low cost, but there is little information on measure variability and subsequent analysis is based on sensitivity analysis to assess the robustness of the assumptions. In type II economic studies, resource utilization is sampled concurrently with measures of clinical efficacy. Such an approach provides information on variability for estimation and hypothesis testing but may limit generalizability to other economic environments and time periods. Missing data cannot be assumed to be missing at random, which allows for the introduction of a measurement bias. In type III economic studies, complete cost information is obtained on the trial subjects, including resource utilization and unit costs. The amount of information collected often requires limiting sampling to a subgroup of the study population, usually at a few institutions. Such an analysis has limited generalizability and requires considerable effort and justification addressing concerns about sampling and measurement bias. 2. Study Hypotheses The major study questions related to economic measures should be clearly stated in terms of testable hypotheses. All primary economic questions and secondary hypotheses relating to outcome differences among specified subgroups should be stated in advance of the trial. The clinical and economic relevancy of the study hypotheses should be clearly stated. The economic importance of specific interventions are likely to be greatest when considering diseases of clinical and public health significance and interventions associated with considerable cost trade-offs. 3. Study Design The design of a clinical investigation, including any economic analysis, should attempt to minimize the potential for systematic error or bias, including that associated with subject selection, measurement, and confounding (28). Confounding represents the modification of the true treatment effect by a factor associated with

302

Lyman

both the outcome of interest and treatment group assignment. Confounding can obscure a true outcome difference when it exists or create an apparent difference that does not exist. The potential for confounding is most effectively addressed in the design of a trial by incorporating appropriate controls, basing treatment assignment on randomization, and by blinding subjects and investigators to the assigned treatment (double blinding). Randomization ensures that both known and unknown confounding factors will be distributed equally in the treatment groups on average. The balance of important prognostic factors within treatment groups can be enhanced by randomization separately within subgroups (stratification) but should be confirmed in the analysis. 4. Study Population All subjects in the study should be described and accounted for. The nature, location, and setting of the study should be fully detailed. Eligibility criteria, including any inclusion and exclusion criteria, should be presented. A balance between narrow eligibility to enhance study power and limiting restrictions to increase generalizability should be sought. 5. Sample Size The goal of a clinical trial is to confirm the treatment effect accurately or to refute it unambiguously. The sample size necessary to adequately address primary study hypotheses should be stated in advance based on the likely treatment effect or the number of events anticipated, measurement variation, maximum tolerable alpha error (false positive), and maximum beta error (false negative) considered acceptable. It is imperative that sufficient numbers of subjects are included in the trial that a negative study is unlikely to be a false negative (29). Sample sizes large enough to achieve a power of 80–95% are generally considered desirable for detecting meaningful differences. Multiinstitutional CCTs may increase study accrual and sample size and external validity of both the clinical and the economic outcomes. When the primary outcomes represent failure time data (time-toevent), sample size estimation should consider the event (or cost) rate and the anticipated duration of observation and the expected censoring rate. Failure to consider censoring in an economic analysis may further compromise the power of the study. When the sample size of the trial is appropriately targeted to the economic outcomes, it must be anticipated that a longer duration of accrual or follow-up may be needed. This longer period of observation may not always be justified or even ethical, especially when meaningful differences in clinical outcome are already apparent. In small trials, even relatively large and clinically important differences in outcome may be statistically insignificant because of low study power. In studies with insufficient sample size to address subgroup

Economic Analysis of Clinical Trials

303

analyses, results should be presented descriptively for purposes of hypothesis generation only. Because of greater variability, skewed distributions and frequent missing data, CCTs with primary economic hypotheses may require larger sample sizes to achieve the desired ability to demonstrate an economic effect. It may be difficult to estimate sample size requirements given the limited information on what constitutes meaningful differences in economic outcomes. Sample size estimates should consider any adjustment needed for multiple testing due to the multiple outcome measures involved. Sample size estimation in economic studies is complicated by the limited efficiency of conventional methods used with such data. As a first-order approximation, sample size requirements can be estimated on the basis of the approximately log-normal distribution of cost data. Sample size estimates based on ratios of cost and efficacy should consider the variance and covariance of both measures and the desired level of precision. Although interim analyses of large trials of expensive technologies might be desirable, they are seldom designed with early stopping rules based on secondary outcomes such as cost, which are often not available until later or even after trial completion. 6. Outcome Measures and Analysis Economic outcomes should include measures of activity level (including time and resources used) and unit costs of such activity. The economic resource and cost information collected should be objective and comprehensive and yet limited to that needed to address prestated hypotheses matching clinical measures in style and frequency. Resource utilization measures should be specified in advance and applied equally to each intervention group ideally by blinding both the patient and investigator to the treatment assignment. Where this is not feasible or ethical, standardized measurement of economic outcomes should be applied equally to treatment groups. The quality and completeness of observation, measurement, and data recording are important to minimize bias and random error in a trial. Missing data associated with death, disability, treatment delay, loss to followup, or noncompliance may result in either item nonresponse at a specific point in time or unit nonresponse where most information on a resource component is missing. Missing data, even when randomly missing, will reduce the power of the study analysis. However, of greater concern is the possibility that missing data, including subject withdrawal, may be missing nonrandomly and bias group comparisons. Although the prospective concurrent collection of outcome data in a CCT generally reduces the potential for missing data, methods for minimizing and dealing with missing data should be discussed in advance and explicitly handled in the analysis. The primary data analysis and any planned subgroup analysis should be described in advance in sufficient detail to provide the reader with a full understanding of the planned analysis. Even the most elegant analysis, however, will not salvage an underpowered or biased clinical trial.

304

C.

Lyman

Study Conduct Considerations

1. Resource Utilization Data Patient monitoring and data collection procedures for conventional clinical outcomes in the conduct of a clinical trial are relatively standardized. Unfortunately, economic data are often considered of secondary importance or are added to a clinical trial as an afterthought and relegated to a low level of importance. Resource utilization data in a CCT is most accurate and complete when collected concurrently with efficacy data. The types of resource utilization generally considered in such studies are summarized in Table 3. Economic analyses in association with CCTs often do not adequately address changes in resource utilization and cost that occur over time. Answers to economic questions depend on resource utilization and cost through the period of full recovery or death requiring longer patient monitoring than for the estimation of clinical efficacy. It is essential that the same systematic effort and precision are applied to the collection of economic outcomes as are used to measure clinical efficacy. It is also important that resources consumed and unit costs are measured separately since they vary quite differently. Resource utilization depends primarily on the clinical situation, whereas unit costs vary considerably between institutions, regions, and health systems and over time. It is essential to distinguish between resource utilization related to the intervention and that related to the conduct of the trial, including data collection and altered patterns of care and follow-up. 2. Unit Cost Data Costing methodology varies considerably between studies. When the focus is on internal validity and maintaining the direct association between resource utilization and cost, concurrent and prospective collection of cost information should be considered. Even when concurrent costing is not feasible or desirable, the use of site-specific cost information should be applied to the pooled resource data. When the focus is on external validity or institutional data are not considered representative, the use of more representative external unit cost data may be considered. It must always be kept in mind, however, that resource utilization and unit cost information are generally not independent of one another or of the clinical trial design. D.

Analysis Considerations

1. Type of Study The type of analysis appropriate for an economic evaluation depends on the study design and the nature of the data (30,31). A survey of published randomized trials, including an economic evaluation with cost values suitable for statistical

Economic Analysis of Clinical Trials Table 3 Sources of Resource Utilization in Economic Analyses Associated with Cancer Clinical Trials Direct: medical 1. Hospitalization* Routine vs. intensive care Frequency Duration Physician/nursing services Laboratory/radiology services Type and number of tests Pharmacy services (medications, chemotherapy) Radiation therapy services Drugs/treatments Surgical procedures Blood bank services (transfusions) Other services: support services 2. Ambulatory (clinic) Frequency Outpatient tests/procedures Outpatient treatment (surgery, radiation, chemotherapy, etc.) 3. Nursing home/hospice care Visits (M.D., R.N., Social Services, other) Direct: nonmedical 1. Loss of work time by patient, family, and friends during treatment. Lost wages, distance traveled, time spent Indirect: medical 1. Medical/nursing services Home visits Interim testing 2. Physical therapy 3. Social Services 4. Other medical support services Indirect: nonmedical 1. Impact on family resources Days lost from work Transportation costs Out-of-pocket expenses * Direct and indirect institutional expenditures including overhead (utilities, rent, etc.), equipment maintenance and depreciation, consumables.

305

306

Lyman

analysis, was recently reported (32). Economic outcomes collected in the context of CCTs are often considered of secondary importance with limited attention given to prestated economic hypotheses, sample size requirements, missing data, or multiple testing issues. The source of economic data in such studies is often derived from small subsamples or from separate nonsampled sources. The results of such economic studies should be viewed as exploratory or hypothesis generating and should be presented descriptively. When information on variability is available, it is often informative to review the distribution of each outcome measure along with some percentile range such as the interquartile range. Economic measures are often skewed with frequent outliers and greater variability than most clinical measures. Measurement variability is often greater for indirect costs where missing or incomplete data are also more likely to be a problem. Calculated mean costs or combined measures of cost and efficacy, such as cost effectiveness, frequently ignore the inherent variability between subjects relying on sensitivity analyses to assess the robustness of any conclusions. In such analyses, the investigator controls the variation and range of parameters, the potential interaction between parameters is ignored, and robustness is arbitrarily defined. In larger studies incorporating a limited number of a priori economic hypotheses, the same rigor of statistical analysis should be applied as is used for assessing clinical efficacy. Study evaluation should be based on an intention-to-treat analysis and appropriately powered to measure effect sizes of economic importance carefully considering measurement distributions, missing data, and multiple comparisons. The source of unit cost information and any discounting considered should be justified, and the external validity or generalizability of results should be discussed. 2. Missing Data Missing data may have an impact not only on study precision by reducing the number of subjects with complete data but also study validity by biasing outcome estimates if the missing data are associated with outcome measures or treatment group assignment. Missing data can also complicate multivariate modeling which considers only cases for which data are available on all covariates considered. The relationship between missing data and treatment group assignment, efficacy and cost outcomes, or important covariates should be studied. If missing data are independent of observed and unobserved data, they are considered missing completely at random and can be dealt with by complete case analysis with some loss in power or by simple imputation of missing values such as the last observed or mean values with some underestimation of variance. When missing data is missing at random but related to the observed data, multiple imputation techniques and bootstrapping can provide more reasonable point estimates and variance. The most difficult situation is that associated with informative missing

Economic Analysis of Clinical Trials

307

data, which depends on missing data or the parameters of interest. It is generally not considered necessary to have unit cost information on all subjects as long as resource utilization data are complete. Unit cost data can be collected on a subset of patients or from an independent data source, which may actually increase external validity with total costs estimated by regression or multiple imputation techniques. When the variance of unit costs and the covariance between cost and resources used are not available, the robustness of the assumptions may be assessed with a sensitivity analysis. Such an analysis is limited by the potential bias in selecting variables for analysis, the range of values considered, the lack of standard criteria for ‘‘robustness,’’ and the inability to address interaction. 3. Statistical Analysis Economic outcomes of a clinical trial such as activity level or cost are seldom equal among the study groups. The observed differences in outcome may represent either true effects or differences due to random error (variability) or systematic error (bias). A true difference is supported by a large treatment effect, small variability, large sample size, low false-positive rate, and low false-negative rate. In the analysis of a clinical trial, random error is addressed through statistical inference, including estimation and hypothesis testing. Estimation summarizes the distribution of outcomes providing measures of central tendency, such as means or proportions, and measures of variability or precision, such as confidence intervals, which represents the upper and lower bounds likely to contain the true value of a variable. Hypothesis testing involves an assessment of the probability of obtaining the observed difference in outcome under the null hypothesis of no true difference between the groups. Economic analyses are often faced with multiple outcome measures and repeated measures over time, which increases the chance of observing a statistically significant difference due to chance alone. Appropriate adjustment in significance levels for multiple testing in the analysis is necessary. Although it is sometimes useful to compare cumulative cost distributions between groups using a general nonparametric technique such as the Kolmogorov-Smirnov test, more powerful methods exist for comparing specific distribution parameters such as mean and median costs. Cost Differences. Statistical inference in economic studies is most commonly based on differences in arithmetic mean costs between treatment groups since only these estimates permit ready calculation of the total costs of interest. Inferences on cost differences between treatment groups should be supported by measures of precision (e.g., confidence intervals) of the estimated difference in mean costs or appropriate hypothesis testing considering outcome distributions. Cost data, however, are often highly skewed due to high costs incurred by a few patients. When dealing with very large data sets that are reasonably well behaved,

308

Lyman

greater power will be associated with the use of parametric analyses such as Student’s t test, which may be reasonably applied with confidence intervals calculated on the mean cost difference. Log transformation of costs will reduce the impact of outliers and may be useful when it results in normal and similarly sized distributions (30,33). However, inference based on log transformed costs compare geometric means, which do not address the primary issue of importance related to arithmetic mean cost differences. Zhou and Gao (34) proposed a Zscore for differences in means when group variances are not equal since the log of the mean of the untransformed costs equals the mean of the log of the transformed costs plus one half of the variance. Alternative methods, including the truncation of outliers, will result in loss of economically important information and may yield misleading results. When faced with smaller data sets or unresolved skewed distributions, analysis with nonparametric methods such as rank and log-rank tests is more appropriate. The Wilcoxon rank or Mann-Whitney tests are often used in this situation since they are much more efficient for comparing asymmetric distributions and yet relatively efficient even when comparing normal distributions. The product-limit estimation method of Kaplan and Meier represents a reasonable approach for dealing with cumulative costs over time, particularly when censoring is present (35,36). Methods that ignore censoring will potentially bias mean costs and cost-effectiveness ratios. A number of difficulties may be encountered in assessing costs in failure time studies, however. The assumption of independent censoring is often violated in cost-to-event type analyses. This is illustrated by the nonconstant changes in cost over time and the informative relationship of costs to health status exemplified by the increase in costs immediately before death. When cost data are censored before death, censoring is informative with regard to costs and survival. The different scales for death and censoring can result in informative censoring even if no deaths are observed (27). When dealing with the need for covariate adjustment, the log-rank test related to the proportional hazards regression method of Cox may have advantages. Rank procedures, however, generally assume that group distributions have the same variance and shape and replace economically relevant information with ranks. In addition, they compare the median and the distributions of costs rather than arithmetic mean cost differences. Recently, nonparametric bootstrap methods based on the original data have been proposed which make no assumption about the shape or equality of the underlying distributions (36). The observed data are treated as an empirical probability distribution that can be sampled repeatedly with replacement providing a distribution of outcomes from which confidence limits and hypothesis testing can be developed. In addition, Bayesian methods based on subjective prior beliefs have been proposed but the need to determine a priori distributions and computational complexity limit their applicability.

Economic Analysis of Clinical Trials

309

Combined Outcome Measures. Statistical inference on combined measures of cost and effectiveness is complicated by the lack of information on the variance and covariance structure of costs and clinical efficacy. These are often dealt with conservatively by presenting variance or confidence limits around point estimates of efficacy and resource utilization or cost separately, ignoring correlation between cost and benefit. A ‘‘confidence box’’ may be defined by estimating confidence limits separately for incremental effect and incremental cost. The resulting confidence limits on the cost-effectiveness plane are generally considered overly conservative. In addition, the confidence limits are problematic when the uncertainty includes different quadrants of the cost-effectiveness plane (37). Several methods for estimating confidence intervals for cost-effectiveness ratios based on the joint variance of cost and efficacy have been proposed, none of which is entirely satisfactory (38). Parametric estimation of the joint density of incremental effect and cost considers the covariance generally defining an ellipse on the cost-effectiveness plane. Van Hout et al. (39) calculated the probability that the cost-effectiveness ratio falls below a defined maximum acceptable ratio on the cost-effectiveness plane (Fig. 4), which they claim is equal to integrating under the appropriate regions of the joint probability distribution f(E, C) around maximum likelihood point estimates for cost effectiveness, where E and C represent observed incremental mean effectiveness and mean cost, respectively. Hlatky et al. (40) reported the use of the bootstrap technique to obtain a nonparametric estimate of the joint density based on the probability of results falling below a specified threshold level of cost effectiveness. Assessing the probabilities associated with varying ceiling cost-effectiveness ratios define an acceptability curve where the 50th percentile defines the point estimate. Acceptability curves can be used to summarize uncertainty in cost-effectiveness studies. Such a curve crosses the probability axis at the one-sided p value for the incremental cost (∆C) and is asymptotic to 1 minus the one-sided p value for the incremental effectiveness (∆E). Therefore, confidence limits may be defined for the ceiling cost-effectiveness ratio from the acceptability curve. Confidence limits may also be derived from the net-benefit statistic where the net benefit (NB) is defined as NB ⫽ CER ceiling ∆E ⫺ ∆C The net benefit can be shown to be normally distributed with variance and confidence limits defined as Var(NB) ⫽ CER 2ceiling var(∆E) ⫹ var(∆C) ⫺ 2CER 2ceiling cov(∆E, ∆C) Confidence limits ⫽ (NB ⫺ z α/2 √var(NB) , NB ⫹ z α/2 √var(NB) The net-benefit statistic offers some advantages for handling uncertainty in costeffectiveness analysis, including sample size calculation (37). It has also been

310

Lyman

suggested that a Bayesian approach allows a more direct method for estimating cost-effectiveness ratios (41). 4. Adjustment Covariate adjustment is generally undertaken for one of three reasons: to increase precision or tighten confidence intervals on estimates of treatment effect, to increase validity by controlling for confounding bias, and to estimate outcomes in patient subgroups. Covariate adjustment of treatment effect and costs will nearly always increase power. Covariates of particular interest in economic analyses include demographic factors (age, sex, race, marital status), socioeconomic factors (income, education, occupation, employment status, residence, family/caregiver status, type of health insurance and provider organization), and comorbidities (functional status, prior treatment). The outcomes of interest in economic analysis are the absolute cost difference and absolute treatment effect that depend on the control survival and the relative survival advantage with treatment. Even when relative treatment effects are the same across subgroups, covariate adjustment is necessary to estimate absolute effects because of the heterogeneity in prognostic factors. Despite efforts to minimize bias in the design and conduct of a clinical trial, the distribution of known prognostic factors within treatment groups should be evaluated. Any covariate found to be associated with both treatment group assignment and the outcome of interest must be considered a possible confounding factor and addressed further in the analysis. If actual confounding has occurred, the apparent relationship between treatment and outcome will be either strengthened or weakened with adjustment through either stratified analysis or multivariate modeling. While multiple regression is commonly used in covariate adjustment, the skewness of cost distributions may result in overestimates of variance and broad confidence limits. Regression on linear and logarithmic transformations of costs may not yield normal residuals limiting the interpretation of results. The proportional hazards regression method of Cox has been proposed for skewed resource or cost data providing estimates of mean cost differences by including treatment assignment as a covariate (42). Such models are complex, make no assumption about the distribution of costs for an individual, and must deal with the proportional hazards and linearity assumptions of the model. Nevertheless, they permit cost analyses to consider the issue of censoring that might otherwise result in low cost estimates when considering a severe illness or an intervention associated with high early mortality or withdrawal. 5. Cost Discounting It is also important to adjust changes in cost or benefit measures for changes over time and place. Cost discounting considers preferences for immediate over future benefit and for delaying present costs to the future. Price adjustment is necessary

Economic Analysis of Clinical Trials

311

when observations extend over time (⬎1 year) or geographical region to present economic results in a common framework. In the United States, cost adjustments are generally based on the Consumer Price Index or the Fixed Weight Index. All future and past costs are generally expressed in terms of the present or some fixed point in time. The Cost Discount Rate (CDR) represents the cost discount (future cost ⫺ present cost) as a proportion of the present cost. The present cost, therefore, represents the future cost divided by (CDR ⫹ 1) n when discounting is conducted over n years. 6. Subgroup Analyses Treatment effects and costs in a clinical trial often differ between subgroups of the study population. Although such differences may represent an interaction between the intervention and a covariate (e.g., prognostic or predictive factor), the observed differences may also be the result of random error or study bias. Statistically significant treatment effects in one group and not in another may reflect the variation in power when one group has a more favorable outcome with fewer events and therefore lower power to show an effect. Even when the treatment effect is uniform across subsets, multiple testing in subgroup analyses is associated with an increase probability of finding significant differences due to chance alone (type I error). Therefore, multiple subgroup analyses should be discouraged and limited to those of major interest and stated in advance of the trial for which a difference in efficacy or cost effectiveness might be anticipated (e.g., stratification factors). Subgroup analyses should include measures of variability in the effect measures such as confidence limits. Unless there are valid reasons to expect such subgroup differences in treatment effect, strong evidence for such effect modification should be provided. The best approach to subgroup analyses is to perform a test for interaction to assess the homogeneity of treatment effect across patient subsets rather than reporting difference in outcomes between subgroups. It is reasonable to view any differences with considerable skepticism utilizing more restrictive criteria for judging statistical significance. 7. Modeling Modeling of the relationship between treatment and outcome is used for a variety of purposes: adjustment for known confounding variables, development of clinical prediction models, and decision modeling of clinical and economic outcomes. Adjustment for confounding factors may improve both validity and precision providing more accurate estimates. Such models may also permit estimation of outcome differences within subgroups when heterologous. Clinical prediction models for patient selection may improve the cost efficiency of an intervention. Ideally, such models should be externally validated on an independent data set and some measure of goodness-of-fit of the model to the data reported. Clinical

312

Lyman

decision models represent valuable methods for the economic evaluation of data from comparative studies of intervention strategies permitting simultaneous consideration of more than one type of outcome measure (e.g., cost-effectiveness analysis or cost-utility analysis). The analysis of decision models requires specification of the model structure, including choices, chance events, and outcomes; probabilities of all chance events; and outcome values, including benefits and costs. The analysis of decision models is based on calculating the expected value of each choice by a process of folding back, which involves multiplying the estimated outcome value by the probability of that outcome occurring and summing over all branches of the immediately preceding chance event. This weighted sum then represents the expected value of the outcome, which now becomes the outcome value for the immediately preceding step. When a decision point is reached, the choice associated with the greatest expected benefit or lowest expected cost represents the preferred choice. Sensitivity analyses based on such models permit an assessment of the robustness of the optimal strategy by assessing how changes in parameter values effect the expected value of the choices and the threshold where expected outcome values are equal. The threshold probability of an event relates to the ratio of benefits and costs reflected in the values or utilities incorporated into the model (See the Appendix). In such models, emphasis on descriptive and graphical displays is often more rewarding than any formal statistical testing. Despite certain limitations, Markov modeling provides a valuable tool for economic evaluation of chronic diseases with simultaneous assessment of effectiveness and cost, including discounting with disease progression over time (43). Particular attention must be paid to the Markovian assumption that state transition probabilities are independent of previous health states requiring the use of a combination of distinct health states to model the medical history.

E.

Interpretation and Reporting Considerations

1. Individual Studies The interpretation and reporting of economic analyses should always consider other possible explanations for the observed differences in outcome, including low study power (sample size), measurement variability, differences in study populations, missing data, and multiple comparisons (44,45,46). The generalizability of the results for patients outside of the context of the individual CCT should be discussed (47). The costs measured and details of the cost analysis should be presented and discussed. A review of cost analyses associated with randomized clinical trials revealed that only one half of the studies actually reported cost figures and few reported indirect costs or study-related costs (44). Economic analyses related to CCTs are subject to the same sources of varia-

Economic Analysis of Clinical Trials Table 4

313

Guidelines for Economic Analyses Associated with Cancer Clinical Trials

Design 1. Study hypotheses: Define before study initiation, a limited number of testable economic hypotheses along with their relevance. 2. Study rationale: The logical basis and importance of an economic analysis should be laid along with the rationale for conducting an economic evaluation in relationship to a CCT. 3. Perspective: The viewpoint from which the study is to be conducted and analyzed should be specified and justified. 4. Study population: Define the source and nature of the study population (treatment and control groups) including eligibility and exclusion criteria. 5. Sample size: Sample size should be sufficient for valid conclusions concerning primary and major secondary hypotheses: effect size, variability, alpha and beta error (power), and multiple comparisons, repeated measures over time. 6. Treatment assignment: Treatment assignment should be randomized or at least standardized and the rationale presented. 7. Outcome measures: Measures of clinical efficacy and economic cost should be specified in advance. 8. Planned analyses: The type of economic analysis should be specified and justified in advance including any subgroup analysis planned and any model to be used. Data collection 1. Outcome measures: All pertinent clinical efficacy, quality of life, and economic outcomes should be measured using valid instruments specified in advance. 2. Activity measures (quantities of resources used): These and unit costs (direct and indirect) should be collected and reported separately. 3. Unit costs: Measure and record unit cost data including adjustments/conversions. 4. Missing data: Every effort should be made to minimize incomplete data. Data analysis 1. Separate analysis: Resources used and unit costs should be analyzed separately before any combined analysis. 2. Estimation: After careful examination of distributions, calculate summary estimates and confidence limits for treatment effect and resource quantities and unit costs. 3. Combined outcomes: Focus subsequent estimation on combined outcomes of incremental cost and incremental efficacy (cost-effectiveness or utility ratios) 4. Hypothesis testing: Apply appropriate method of statistical inference to group comparisons based on the observed data distributions. 5. Multiple testing: Correct for multiple testing related to multiple outcomes and repeated measures over time. 6. Power: Estimate statistical power for evaluating group comparisons and assessing confidence in reported results. 7. Discounting: Any applied discount rate for inflation/time should be specified and justified.

314 Table 4

Lyman Continued

8. Cost-effectiveness/utility: Treatment groups should be compared on the basis of an incremental analysis, e.g., marginal cost effectiveness. 9. Quality of Life: HRQOL measures should be reported separately and in combined measures with other clinical measures. 10. Modeling: Any model used should be justified and model parameters, including probabilities and outcome values, should be appropriately estimated and justified. 11. Sensitivity analysis: Sensitivity analyses should be based on valid models with justification for the range of variable variation. Data interpretation and reporting 1. Methods: Present methods fully including a priori hypotheses, study population, sample size originally planned, treatment assignment, outcome measures including resource utilization and cost estimates, data analysis including statistical inference and modeling. 2. Resource utilization: Present resources used and cost estimates separately and utilizing appropriate aggregate or combined measures. 3. Results: Discuss results in the context of the primary and any secondary hypotheses. 4. Limitations: Discuss the limitations of the study design, study population, measurements obtained, analysis performed. 5. Validity: Discuss the issues of internal and external validity; including generalizability to other settings. 6. Relevance: Discuss importance of study question, including relevance to clinical decision making, cost efficiency, and health policy formulation.

tion in results as other clinical investigations. In a recent review of 45 randomized trials that included individual cost data, 25 (56%) presented statistical tests or measures of precision on the cost comparisons between groups, whereas only 9 (20%) reported adequate measures of variability (48). The authors of this study concluded that only 36% provided conclusions justified on the basis of the data presented. Preliminary guidelines for the design, conduct, analysis, and interpretation of economic analyses associated with clinical trials are offered in Table 4. 2. Meta-Analysis If the existing information already suggests that the intervention in question is efficacious, then it may be reasonable to base an economic analysis on either a systematic review or formal meta-analysis. Meta-analysis can form the basis of an economic evaluation by systematically summarizing the results of several

Economic Analysis of Clinical Trials

315

studies of a given clinical intervention providing greater confidence of treatment effect and resource utilization than individual studies. Such an analysis is limited by the type and quality of economic data collected or reported. Meta-analysis of economic evaluations related to clinical trials must consider the same methodological challenges as other meta-analyses. The principle difficulty consists if identifying and accessing all relevant results on a particular issue considering publication bias due to failure to publish negative study results, studies independent of the primary trial, and studies commissioned for specific administrative purposes. Computerized literature searches are inadequate for identifying unpublished analyses. Clinical trial data banks may identify additional clinical trials with concurrent economic evaluations but will not detect independent economic evaluations. Any economic analysis based on the results of a meta-analysis is constrained by the potential bias from incomplete ascertainment and by the incomplete collection and reporting of resource use by most CCTs. Nevertheless, when properly designed and conducted, economic analyses based on such comprehensive data may provide powerful information on the cost efficiency of an intervention.

V.

SUMMARY AND CONCLUSIONS

In conclusion, cancer care is associated with both clinical and economic outcomes of interest. Economic analyses have gained increasing importance in the evaluation of costly cancer treatments in the setting of limited resources (49–59). CCTs appear to represent an excellent source of carefully obtained information for incorporation into economic analyses. In many ways, such trials provide a desirable environment for assessing complementary outcomes such as costs and quality of life in addition to measures of clinical efficacy. Recent reviews of the analysis and interpretation of economic data in randomized controlled trials reveal a lack of awareness about important statistical issues. Guidelines have been provided here for the design and analysis of economic studies in association with CCTs. Clearly, the investigator must first decide whether an economic analysis is needed and whether a CCT is a reasonable framework. When such analyses are warranted, the same methodological rigor in design, conduct, analysis, and reporting should be applied as used for conventional measures of clinical efficacy. Ideally, the economic analysis will be incorporated into a written and approved protocol, including a priori hypotheses, the population to be studied, the clinical and economic measurements to be obtained, and the planned statistical analysis. Attention to the many important issues in planning, conducting, analyzing, or reporting an economic analysis related to a CCT discussed in this chapter will enhance the quality and validity of the study. The investigator must ultimately decide how

316

Lyman

to interpret and present the data and what it means in the broader health care setting. Perhaps of greatest importance is the generalizability of economic results of a clinical trial for the routine application of such interventions within a larger population. In the years to come, such analyses will play an increasingly important role in clinical decision making, individual patient counseling, evidencebased clinical guideline development, reimbursement, and national and international health policy formulation. The ability to properly measure and analyze such data will greatly aid clinicians and health care planners in providing optimal quality and cost-effective care to patients with cancer (60).

APPENDIX: DECISION MODEL THRESHOLD ANALYSIS BASED ON BENEFITS AND COSTS Each possible outcome in a realistic clinical situation can be considered to have a certain value or utility (U) and a certain probability of disease (p). The expected value of the treatment and no treatment strategies is therefore EVtreatment ⫽ p ⋅ U treat/disease ⫹ (1 ⫺ p) ⋅ U treat/no disease EVno treatment ⫽ p ⋅ U no treat/disease ⫹ (1 ⫺ p) ⋅ U no treat/no disease The treatment strategy associated with the greatest expected value should be chosen to optimize the likelihood of the best result. The benefits and costs can be derived from utility estimates as shown: Benefit of treatment ⫽ U treat/disease ⫺ U no treat/disease Cost of treatment ⫽ U no treat/no disease ⫺ U treat/no disease A sensitivity analysis could be conducted comparing the expected value functions as the probability of disease is varied. Most often, however, we are interested in determining the threshold probability at which point the expected value of the treatment strategies are equal, i.e., EVtreatment ⫽ EVno treatment p ⋅ U treat/disease ⫹ (1 ⫺ p) ⋅ U treat/no disease ⫽ p ⋅ U no treat/disease ⫺ (1 ⫺ p) ⋅ U no treat/no disease U no treat/no disease ⫺ U treat/no disease U treat/disease ⫺ U no treat/disease ⫹ U no treat/no disease ⫺ U treat/no disease ⫽ cost/(benefit ⫹ cost) ⫽ 1/(benefit/cost) ⫹ 1

p threshold ⫽

From such a relationship, it is evident that as the ratio of benefit to cost increases, the threshold probability of disease decreases. Above the threshold

Economic Analysis of Clinical Trials

317

probability of disease, treatment will be associated with a greater expected value and will therefore be the favored strategy. The indications for treatment therefore broaden as the ratio of benefit to cost increases.

REFERENCES 1. Brown ML. The national economic burden of cancer: an update. J Natl Cancer Inst 1990; 82:1811–1814. 2. Brown ML, Fintor L. The economic burden of cancer. In: Cancer Prevention and Control. New York: Marcel Dekker. 1995. 3. Gaumer GL, Stavins J. Medicare use in the last 90 days of life. Med Care 1991; 29:725–742. 4. Baker MS, Kessler LC, et al. Site-specific treatment costs. In: Cancer in Cancer Care and Cost. Health Administration Press 1989. 5. Wright JC, Weinstein MC. Gains in life expectancy from medical interventionsstandardizing data on outcomes. N Engl J Med 1998; 330:380–404. 6. Cella DF, Bonomi AE. Measuring quality of life: 1995 update. Oncology 1995; 9: 47–60. 7. Weeks J. Measurement of utilities and quality-adjusted survival. Oncology 1995; 9: 67–70. 8. Gelber RD, Goldhirsch A, Cavelli F. Quality-of-life-adjusted evaluation of adjuvant therapy for operable breast cancer. Ann Intern Med 1991; 114:621–628. 9. Gotay CC, Korn EL, McCabe MS, Moore TD, Cheson BD. Quality-of-life assessment in cancer treatment protocols: research issues in protocol development. J Natl Cancer Inst 1992; 84:575–579. 10. Drummond MF. Resource allocation decisions in health care: a role for quality of life assessments. J Chronic Dis 1987; 40:605–616. 11. Staquet MJ, Hays RD, Fayers PM. Quality of Life Assessment in Clinical Trials: Methods and Practice. Oxford: Oxford University Press, 1998. 12. Olschewski M, Schulgen G, Schumacher M, Altman DG. Quality of life assessment in clinical cancer research. Br J Cancer 1994; 70:1–5. 13. Pocock SJ. A perspective on the role of quality-of-life assessment in clinical trials. Controlled Clin Trials 1991; 12:257S–265S. 14. Fayers PM, Hopwood P, Harvey A, Girling DJ, Machin D, Stephens R. Quality of life assessment in clinical trials—guidelines and a checklist for protocol writers: the U.K. Medical Research Council Experience. Eur J Cancer 1997; 33:20–28. 15. Detsky AS, Naglie IG. A clinician’s guide to cost-effectiveness analysis. Ann Intern Med 1990; 113:147–154. 16. Task Force on Principles for Economic Analyses of Health Care Technology. Economic analyses of health care technology: a report on principles. Ann Intern Med 1995; 122:61–70. 17. American Society of Clinical Oncology. Outcomes of cancer treatment for technology assessment and cancer treatment guidelines. J Clin Oncol 1996; 14:671– 679.

318

Lyman

18. Russell LB, Gold MR, Siegel JE, Daniels N, Weinstein MC. The role of costeffectiveness analysis in health and medicine. JAMA 1996; 276:1172–1177. 19. Weinstein MC, Siegel JE, Gold MR, Kamiet MS, Russell LB. Recommendations of the panel on cost-effectiveness in health and medicine. JAMA 1996; 276:1253– 1258. 20. Siegel JE, Weinstein MC, Russell LB, Gold MR. Recommendations for reporting cost-effectiveness analysis. JAMA 1996; 276:1330–1341. 21. Udvarhelyi IS, Colditz GA, Rai A, Epstein AM. Cost-effectiveness and cost benefit analyses in the medical literature: are the methods being used correctly? Ann Intern Med 1992; 116:238–244. 22. Drummond MF, Davies L. Economic analysis alongside clinical trials. Revisiting the methodological issues. Int J Technol Assess Health Care 1991; 7:561–573. 23. Drummond MF, Stoddart GL. Economic analysis and clinical trials. Controlled Clin Trials 1984; 5:115–128. 24. Bennett CL, Golub R, Waters TM, Tallman MS, Rowe JM. Economic analyses of phase III cooperative cancer group clinical trials: are they feasible? Cancer Invest 1997; 15:227–236. 25. Bennett CL, Armitage JL, Buchner D, Gulati S. Economic analysis in phase III clinical cancer trials. Cancer Invest 12:336–342. 26. Bennett CL, Westerman IL. Economic analysis during phase III clinical trials: who, what, when, where, and why? Oncology 1994; 9:169–175. 27. Brown M, Glick HA, Harrell F, et al. Integrating economic analysis into cancer clinical trials: the National Cancer Institute-American Society of Clinical Oncology Economics Workbook. J Natl Cancer Inst Monogr 1998; 24:1–28. 28. Coyle D, Davies L, Drummond MF. Trials and tribulations: emerging issues in designing economic evaluations alongside clinical trials. Int J Technol Assess Health Care 1998; 14:135–144. 29. O’Brien BJ, Drummond MF, Labelle RJ, Willan A. In search of power and significance: issues in the design and analysis of stochastic cost-effectiveness studies in health care. Med Care 1994; 32:150–163. 30. Rutten-Van Mo¨lken MPMH, Van Doorslaer EKA, Van Vliet RCJA. Statistical analysis of cost outcomes in a randomized controlled clinical trial. Health Econ 1994; 3:333–345. 31. Grieve AP. Issues for statisticians in pharmaco-economic evaluations. Stat Med 1998; 17:1715–1723. 32. Barber JA, Thompson SG. Analysis and interpretation of cost data in randomized controlled trials: review of published studies. Br Med J 1998; 317:1195–1200. 33. Thompson SG, Barber JA. How should cost data in pragmatic randomized trials be analyzed? Brit Med J 2000; 320:1197–1200. 34. Zhou XH, Gao S. Confidence intervals for log-normal means. Stat Med 1997; 16: 783–790. 35. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53:457–481. 36. Desgagne A, Castilloux A-M, Angers J-F, LeLorier J. The use of the bootstrap statistical method for the pharmacoeconomic cost analysis of skewed data. Pharmacoeconomics 1998; 13:487–497.

Economic Analysis of Clinical Trials

319

37. Briggs AH, Fenn P. Confidence intervals or surfaces? Uncertainty on the cost effectiveness plane. Health Econ 1998; 7:723–740. 38. Willan AR, O’Brien BJ. Confidence intervals for cost-effectiveness ratios: an application of Fieller’s theorem. Health Econ 1996; 5:297–305. 39. Van Hout BA, Maiwenn JA, Gilad S. Costs, effects and C/E ratios alongside a clinical trial. Health Econ 1994; 3:309–319. 40. Hlatky MA, Boothroyd DB, Johnstone IM, et al. Long-term cost-effectiveness of alternative management strategies for patients with life-threatening ventricular arrhythmias. J Clin Epidemiol 1997; 50:185–193. 41. Heitjan DF, Moskowitz AJ, Whang W. Bayesian estimation of cost-effectiveness ratios from clinical trials. Health Econ 1999; 8:191–201. 42. Dudley RA, Harrell FE, Smith LR, et al. Comparison of analytic models for estimating the effect of clinical factors on the cost of coronary artery bypass graft surgery. J Clin Epidemiol 1993; 46:261–271. 43. Briggs A, Sculpher M. An introduction to Markov modelling for economic evaluation. Pharmacoeconomics 1998; 13:397–409. 44. Balas EA, Rainer ACK, Gnann W, et al. Interpreting cost analysis of clinical interventions. JAMA 1998; 279:54–57. 45. Drummond MF, Jefferson TO, on behalf of the BMJ Economic Evaluation Working Party. Guidelines for authors and peer reviewers of economic submissions to the B.J. Br Med J 1996; 313:275–283. 46. Torgerson DS, Campbell MK. Cost effectiveness calculations and sample size. Brit Med J 2000; 321:697. 47. Fayers PM, Hand DJ. Generalization from phase III clinical trials: survival, quality of life, and health economics. Lancet 1997; 350:1025–1027. 48. Barber JA, Thompson SG. Analysis and interpretation of cost data in randomised controlled trials: review of published studies. Br Med J 1998; 317:1195–1200. 49. Earle CC, Coyle D, Evans WK. Cost-effectiveness analysis in oncology. Ann Oncol 1998; 9:475–482. 50. Smith TJ, Hillner BE, Desch CE. Efficacy and cost-effectiveness of cancer treatment: rational allocation of resources based on decision analysis. J Natl Cancer Inst 1993; 85:1460–1474. 51. Smith, TJ, Hillner BE. The efficacy and cost-effectiveness of adjuvant therapy of early breast cancer in pre-menopausal women. J Clin Oncol 1993; 11:771–776. 52. Reeves GAG. Cost effectiveness in oncology. Lancet 1985; 2:1405–1408. 53. Goodwin PJ, Feld R, Evans WK, Pater J. Cost-effectiveness of cancer chemotherapy: an economic evaluation of a randomized trial in small-cell lung cancer. J Clin Oncol 1988; 6:1537–1547. 54. Smith TJ, Hillner BE, Neighbors DM, McSorley PA, LeChevalier T. Economic evaluation of a randomized clinical trial comparing vinorelbine, vinorelbine plus cisplatin, and vindesine plus cisplatin for non-small cell lung cancer. J Clin Oncol 1995; 13:2166–2173. 55. Jaakkimainen L, Goodman PJ, Pater J, et al. Counting the costs of chemotherapy in a National Cancer Institute of Canada randomized trial of non-small cell lung cancer. J Clin Oncol 1990; 8:1301–1309. 56. Hillner BE, Smith TJ, Desch CE. Efficacy and cost-effectiveness of autologous bone

320

57. 58. 59. 60.

Lyman marrow transplantation in metastatic breast cancer: estimates using decision analysis while awaiting clinical trial results. JAMA 1992; 267:2055–2061. Emanuel EJ, Emanuel LL. The economics of dying: the illusion of cost savings at the end of life. N Engl J Med 1994; 330:540–544. Bailes JS. Cost Aspects of palliative cancer care. Semin Oncol 1995; 22:64–66. Torgerson DJ, Campbell MK. Use of unequal randomization to aid the economic efficiency of clinical trials. Brit Med J 2000; 321:759. Goldman DP, Schoenbaum ML, Potsky AL, Weeks JC, Berry SH, Escarce JJ, Weidmer BA, Kilore ML, Wagle N, Adams JL, Figlin RA, Lewis JH, Kaplan R, McCabe M. Measuring the incremental cost of clinical cancer research. J Clin Oncol 2000; 19:105–110.

17 Prognostic Factor Studies Martin Schumacher, Norbert Holla¨nder, Guido Schwarzer, and Willi Sauerbrei Institute of Medical Biometry and Medical Informatics, University of Freiburg, Freiburg, Germany

I.

INTRODUCTION

Besides investigations on etiology, epidemiology, and the evaluation of therapies, the identification and assessment of prognostic factors constitutes one of the major tasks in clinical cancer research. Studies on prognostic factors attempt to determine survival probabilities or, more generally, a prediction of the course of the disease for groups of patients defined by the values of prognostic factors, and to rank the relative importance of various factors. In contrast to therapeutic studies, however, where statistical principles and methods are well developed and generally accepted, this is not the case for the evaluation of prognostic factors. Although some efforts toward an improvement of this situation have been undertaken (1–4), most studies investigating prognostic factors are based on historical data lacking precisely defined selection criteria. Furthermore, sample sizes are often far too small to serve as a basis for reliable results. As far as the statistical analysis is concerned, a proper multivariate analysis considering simultaneously the influence of various potential prognostic factors on overall or event-free survival of the patients is not always attempted. Missing values in some or all prognostic factors constitute a serious problem that is often underestimated. In general, the evaluation of prognostic factors based on historical data has the advantages that follow-up and other basic data of patients might be readily available in a database and that the values of new prognostic factors obtained 321

322

Schumacher et al.

from stored tissue or blood samples may be added retrospectively. However, such studies are particularly prone to some of the deficiencies mentioned above, including insufficient quality of data on prognostic factors and follow-up data and heterogeneity of the patient population due to different treatment strategies. These issues are often not mentioned in detail in the publication of prognostic studies but might explain, at least to some extent, why prognostic factors are discussed controversially and why prognostic models derived from such studies are often not accepted for practical use (5). There have been some ‘‘classic’’ articles on statistical aspects of prognostic factors in oncology (6–10) that describe the statistical methods and principles that should be used to analyze prognostic factor studies. These articles, however, do not fully address the problem that statistical methods and principles are not adequately applied when analyzing and presenting the results of a prognostic factor study (4,5,11,12). It is therefore a general aim of this chapter not only to present updated statistical methodology but also to point out the possible pitfalls when applying these methods to prognostic factor studies. Statistical aspects of prognostic factor studies are also discussed in the monograph on prognostic factors in cancer (13) and in some recent textbooks on survival analysis (14,15). To illustrate important statistical aspects in the evaluation of prognostic factors and to examine the problems associated with such an evaluation in more detail, data from three prognostic factor studies in breast cancer serve as illustrative examples. In this disease, the effects of more than 160 potential prognostic factors are currently controversially discussed; more than 500 papers have been published in 1997. This illustrates the importance and the unsatisfactory situation in prognostic factors research. A substantial improvement of this situation seems possible with an improvement in the application of statistical methodology in this area. Throughout this chapter we assume that the reader is familiar with standard statistical methods for survival data to that extent as is presented in more practically orientated textbooks (14–19); for a deeper understanding why these methods work, we refer to the more theoretically oriented textbooks on survival analysis and counting processes (20–22).

II. ‘‘DESIGN’’ OF PROGNOSTIC FACTOR STUDIES The American Joint Committee on Cancer has established three major criteria for prognostic factors: Factors must be significant, independent, and clinically important (23). According to Hermanek et al. (13), significance implies that the prognostic factor rarely occurs by chance, independent means that the prognostic factor retains its prognostic value despite the addition of other prognostic factors, and clinically important implies clinical relevance, such as being capable (at least in principle) of influencing patient management and thus outcome.

Prognostic Factor Studies

323

From these criteria it becomes obvious that statistical aspects will play an important role in the investigation of prognostic factors (13,24–26). That is also emphasized by Simon and Altman (4), who give a concise and thoughtful review on statistical aspects of prognostic factor studies in oncology. Recognizing that these will be observational studies, the authors argue that they should be carried out in a way that the same careful design standards are adopted as are used in clinical trials, except for randomization. For confirmatory studies that may be seen comparable with phase III studies in therapeutic research, they listed 11 important requirements, given in a somewhat shortened version in Table 1. From these requirements it can be deduced that prognostic factors should be investigated in carefully planned prospective studies with sufficient numbers of patients and sufficiently long follow-up to observe the end point of interest (usually eventfree or overall survival). Thus, a prospective observational study where treatment is standardized and everything is planned in advance emerges as the most desirable study design. A slightly different design is represented by a randomized controlled clinical trial where in addition to some therapeutic modalities, various prognostic factors are investigated. It is important in such a setting that the prognostic factors of interest are measured either in all patients enrolled into the clinical trial or in those patients belonging to a predefined subset. Both designs, however, usually require enormous resources and especially a long time until results will be available. Thus, a third type of ‘‘design’’ is used in most prognostic factor studies that can be termed a ‘‘retrospectively defined historical cohort’’ where stored tumor tissue or blood samples are available and basic and follow-up data of the patients are already documented in a database. To meet the requirements listed in Table 1 in such a situation, it is clear that inclusion and exclusion criteria have to be carefully applied. Especially, treatment has to been given in a standard-

Table 1 Requirements for Confirmatory Prognostic Factor Studies According to Simon and Altman (4) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Documentation of intra- and interlaboratory reproducibility of assays Blinded conduct of laboratory assays Definition and description of a clear inception cohort Standardization or randomization of treatment Detailed statement of hypotheses (in advance) Justification of sample size based on power calculations Analysis of additional prognostic value beyond standard prognostic factors Adjustment of analyses for multiple testing Avoidance of outcome-orientated cutoff values Reporting of confidence intervals for effect estimates Demonstration of subset-specific treatment effects by an appropriate statistical test

324

Schumacher et al.

ized manner, at least to some sufficient extent. Otherwise, patients for whom these requirements are not fulfilled have to be excluded from the study. If the requirements are followed in a consistent manner, this will usually lead to a drastic reduction in the number of patients eligible for the study as compared with that number of patients originally available in the database. In addition, follow-up data are often not of such quality as should be the case in a wellconducted clinical trial or prospective study. Thus, if this design is applied, special care is necessary to arrive at correct and reproducible results regarding the role of potential prognostic factors. The three types of designs described above will also be represented by the three prognostic studies in breast cancer that we use as illustrative examples and that are dealt with in more detail in the next section. It is interesting to note that other types of designs (e.g., nested case-control studies, case-cohort studies, or other study types often used in epidemiology [27]) have only been rarely used for the investigation of prognostic factors. Their role and their potential use for prognostic factor research has not yet been fully explored. There is one situation where the randomized controlled clinical trial should be the design type of choice: the investigation of so-called predictive factors that indicate whether a specific treatment works in a subgroup of patients defined by the predictive factor but not—or is even harmful—in another subgroup of patients. Since this is clearly an investigation of treatment–covariate interactions, this ideally should be performed in the setting of a large-scaled randomized trial where information on the potential predictive factor is recorded and analyzed by means of appropriate statistical methods (28–31).

III. EXAMPLES: PROGNOSTIC STUDIES IN BREAST CANCER A.

Freiburg DNA Study

The database of the first study consisted of all patients with primary previously untreated node-positive breast cancer who were operated between 1982 and 1987 in the Department of Gynecology at the University of Freiburg and whose tumor material was available for DNA investigations. Some exclusion criteria (history of malignoma, T 4 and/or M 1 tumors according to the TNM classification system of the International Union Against Cancer (13), without adjuvant therapy after primary surgery, older than 80 years, etc.) were defined retrospectively. This left 139 of 218 patients originally investigated for the analysis. This study is referred to as the Freiburg DNA study. Eight patients characteristics were investigated. Besides age, number of positive lymph nodes, and size of the primary tumor, the grading score according to Bloom and Richardson (32) and estrogen and progesterone receptor status were

Prognostic Factor Studies

325

recorded. DNA flow cytometry was used to measure ploidy status of the tumor (using a cutpoint of 1.1 for the DNA index) and S-phase fraction, which is the percentage of tumor cells in the DNA synthetizing phase obtained by cell cycle analysis. The distribution of these characteristics in the patient population is shown in Table 2A. The median follow-up was 83 months. At the time of analysis, 76 events were observed for event-free survival, which was defined as the time from surgery to the first of the following events: occurrence of locoregional recurrence, distant metastasis, second malignancy, or death. Event-free survival was estimated as 50% after 5 years. Further details of the study which we are using solely for illustrative purposes can be found elsewhere (33).

Table 2A Patient Characteristics in the Freiburg DNA Breast Cancer Study Factor Age No. of positive lymph nodes

Tumor size

Tumor grade

Estrogen receptor

Progesterone receptor

Ploidy status S-phase fraction

Category

n

(%)

ⱕ50 yr ⬎50 yr 1–3 4–9 ⱖ10 ⱕ2 cm 2–5 cm ⬎ 5 cm Missing 1 2 3 Missing ⱕ20 fmol ⬎ 20 fmol Missing ⱕ20 fmol ⬎20 fmol Missing Diploid Aneuploid ⬍3.1 3.1–8.4 ⬎8.4 Missing

52 87 66 42 31 25 73 36 5 3 81 54 1 32 99 8 34 98 7 61 78 27 55 27 30

(37) (63) (48) (30) (22) (19) (54) (27) (2) (59) (39) (24) (76) (26) (74) (44) (56) (25) (50) (25)

326

B.

Schumacher et al.

GBSG-2 Study

The second study is a prospective, controlled, clinical trial on the treatment of node-positive breast cancer patients conducted by the German Breast Cancer Study Group (GBSG) (34); this study is referred to as GBSG-2 study. The principal eligibility criterion was a histologically verified primary breast cancer of stage T1a-3aN ⫹ MO, that is, with positive regional lymph nodes but no distant metastases. Primary local treatment was by a modified radical mastectomy (Patey) with en bloc axillary dissection with at least six identifiable lymph nodes. Patients were not older than 65 years of age and presented with a Karnofsky index of at least 60. The study was designed as a comprehensive cohort study (35), that is, randomized and nonrandomized patients who fulfilled the entry criteria were included and followed according to the study procedures. The study had a 2 ⫻ 2 factorial design with four adjuvant treatment arms: three versus six cycles of chemotherapy with and without hormonal treatment. Prognostic factors evaluated in the trial were patient’s age, menopausal status, tumor size, estrogen and progesterone receptor, tumor grading according to Bloom and Richardson (32), histological tumor type, and number of involved lymph nodes. Histopathological classification was reexamined, and grading was performed centrally by one reference pathologist for all cases. Event-free survival

Table 2B Patient Characteristics in GBSG-2 Study Factor Age

Menopausal status Tumor size

Tumor grade

No. of positive lymph nodes

Progesterone receptor Estrogen receptor

Category

n

(%)

ⱕ45 yr 46–60 yr ⬎60 yr Pre Post ⱕ20 mm 21–30 mm ⬎30 mm 1 2 3 1–3 4–9 ⱖ10 ⬍20 fmol ⱖ20 fmol ⬍20 fmol ⱖ20 fmol

153 345 188 290 396 180 287 219 81 444 161 376 207 103 269 417 262 424

(22) (50) (27) (42) (58) (26) (42) (32) (12) (65) (24) (55) (30) (15) (39) (61) (38) (62)

Prognostic Factor Studies

327

was defined as time from mastectomy to the first occurrence of either locoregional or distant recurrence, contralateral tumor, secondary tumor, or death. During 6 years, 720 patients were recruited, of whom about two thirds were randomized. Complete data on the seven standard prognostic factors as given in Table 2B were available for 686 patients (95.3%), who were taken as the basic patient population for this study. After a median follow-up of nearly 5 years, 299 events for event-free survival and 171 deaths were observed. Event-free survival was about 50% at 5 years. The data of this study as used in this chapter are available from http:/ /www.blackwellpublishers.co.uk/rss/. C. GBSG-4 Study As a third example we use data from a prospective study in node-negative breast cancer conducted by the GBSG (36) that is referred to as GBSG-4 study. From 1984 to 1989, 662 patients were enrolled into the study, all having mastectomy

Table 2C Patient Characteristics in GBSG-4 Study Factor Age Menopausal status Tumor size

Estrogen receptor

Progesterone receptor

Tumor grade

Histologic tumor type

Category

n

(%)

ⱕ40 yr ⬎40 yr Pre Post ⱕ10 mm 11–20 mm 21–30 mm 31–50 mm ⬎50 mm ⬍20 fmol 20–49 fmol 50–299 fmol ⱖ300 fmol ⬍20 fmol 20–49 fmol 50–299 fmol ⱖ300 fmol 1 2 3 Solid Invasive duct or lob Others

62 541 215 388 45 236 236 74 12 270 98 181 54 283 81 175 64 136 325 142 300 124 179

(10) (90) (36) (64) (7) (39) (39) (12) (2) (45) (16) (30) (9) (47) (13) (29) (11) (23) (54) (24) (50) (21) (30)

328

Schumacher et al.

and one cycle of chemotherapy given perioperatively as standardized treatment. Age, menopausal status, tumor size, tumor grade, histological tumor type and estrogen and progesterone receptor were recorded as prognostic factors. We restrict ourselves to 603 patients with complete data on the seven prognostic factors considered. The distribution of these factors is summarized in Table 2C. Median follow-up is about 5 years; the end point of primary interest is event-free survival defined as the time from treatment to the first of the following events: locoregional recurrence, distant metastases, second cancer, and death. There have been 155 events observed so far; the Kaplan-Meier estimate of event-free survival at 5 years is 0.73.

IV. CUTPOINT MODEL In prognostic factor studies, values of the factors considered are often categorized in two or three categories. This may sometimes be done according to medical or biological reasons or may just reflect some consensus in the scientific community. When a ‘‘new’’ prognostic factor is investigated, the choice of such a categorization represented by one or more cutpoints is by no means obvious. Thus, often an attempt is made to derive such cutpoints from the data and to take those cutpoints that give the best separation in the data at hand. In the Freiburg DNA breast cancer study we consider S-phase fraction (SPF) as a new prognostic factor but indeed was some years old (11). For simplicity, we restrict ourselves to the problem of selecting only one cutpoint and to a so-called univariate analysis. This means that we consider only one covariate Z—in the Freiburg DNA breast cancer data the SPF—as a potential prognostic factor. If this covariate has been measured on a quantitative scale, the proportional hazards (37) cutpoint model is defined as λ (t |Z ⬎ µ) ⫽ exp(β) λ (t| Z ⱕ µ), t ⬎ 0 where λ (t | ⋅) ⫽ lim h →0 (1/h)(Pr (t ⱕ T ⬍ t ⫹ h|T ⱖ t, ⋅)) denotes the hazard function of the event free survival time random variable T. The parameter θ ⫽ exp (β) is referred to as the relative risk of observations with Z ⬎ µ with respect to observations with Z ⱕ µ and is estimated through θˆ ⫽ exp(βˆ ) by maximizing the corresponding partial likelihood (37) with given cutpoint µ. The fact that µ is usually unknown makes this a problem of model selection where the cutpoint µ has to be estimated from the data too. A popular approach for such a datadependent categorization is the so-called minimum p value method where— within a certain range of the distribution of Z, the selection interval—the cutpoint µˆ is taken such that the p value for the comparison of observations below and above the cutpoint is a minimum. Applying this method to SPF in the Freiburg

Prognostic Factor Studies

329

DNA breast cancer data we obtain, based on the logrank test, a cutpoint of µˆ ⫽ 10.7 and a minimum p value of p min ⫽ 0.007 when using the range between the 10% and the 90% quantile of distribution of Z as the selection interval. Figure 1A shows the resulting p values as a function of the possible cutpoints considered; Figure 1B displays the Kaplan-Meier estimates of the event-free survival functions of the groups defined by the estimated cutpoint µˆ ⫽ 10.7. The difference in event-free survival looks rather impressive, and the estimated relative risk with respect to the dichotomized covariate I(Z ⬎ µˆ ) using the ‘‘optimal’’ cutpoint µˆ ⫽ 10.7, θˆ ⫽ 2.37 is quite large; the corresponding 95% confidence interval is [1.27; 4.44]. Simulating the null hypothesis of no prognostic relevance of SPF with respect to event-free survival (β ⫽ 0), we illustrate that the minimum p value method may lead to a drastic overestimation of the absolute value of the logrelative risk (38). By a random allocation of the observed values of SPF to the observed survival times, we simulate independence of these two variables, which is equivalent to the null hypothesis β ⫽ 0. This procedure was repeated 100 times, and in each repetition we selected a cutpoint by using the minimum p value method, which is often also referred to as an optimal cutpoint. In the 100 repetitions, we obtained 45 significant ( p min ⬍ 0.05) results for the logrank test

Figure 1 p Values of the logrank test as a function of all possible cutpoints for S-phase fraction (A) and Kaplan-Meier estimates of event-free survival probabilities by S-phase fraction (B) in the Freiburg DNA study.

330

Schumacher et al.

corresponding well to theoretical results as outlined in Lausen and Schumacher (39). The estimated optimal cutpoints of the 100 repetitions and the corresponding estimates of the log-relative risk are shown in Figure 2A. We obtained no estimates near the null hypothesis β ⫽ 0 as a result of the optimization process of the minimum p value approach. Because of the well-known problems resulting from multiple testing, it is obvious that the minimum p value method cannot lead to correct results of the logrank test. However, this problem can be solved by using a corrected p value p cor as proposed in Lausen and Schumacher (39), which has been developed by taking the minimization process into account. The formula reads

冤

p cor ⫽ ϕ (u) u ⫺

冥冤

冥

1 ϕ (u) (1 ⫺ ε)2 log ⫹4 u ε2 u

where ϕ denotes the probability density function and u is the (1 ⫺ p min /2) quantile of the standard normal distribution. The selection interval is characterized by the proportion ε of smallest and largest values of Z that are not considered as potential cutpoints. It should be mentioned that other approaches of correcting the minimum p value could be applied; a comparison of three approaches can be found in an article by Hilsenbeck and Clark (40). Especially, if there are only a few

Figure 2 Estimates of cutpoints and log-relative risks in 100 repetitions of randomly allocated observed SPF values to event-free survival times in the Freiburg DNA study before (A) and after (B) correction.

Prognostic Factor Studies

331

cutpoints, an improved Bonferroni inequality can be applied (41–43). Using the correction formula in the 100 repetitions of our simulation experiment, we obtained four significant results ( p cor ⬍ 0.05) corresponding well to the significance level of α ⫽ 0.05. Four significant results were also obtained with the usual p value when using the median of the empirical distribution of SPF in the original data as a fixed cutpoint in all repetitions. To correct for overestimation, a so-called shrinkage factor has been proposed (44) to shrink the parameter estimates. Considering the cutpoint model, the log-relative risk should then be estimated by βˆ cor ⫽ cˆ ⋅ βˆ where βˆ is based on the minimum p value method and cˆ is the estimated shrinkage factor. Values of cˆ close to one should indicate a minor degree of overestimation, whereas small values of cˆ should reflect a substantial overestimation of the logrelative risk. Obviously, with maximum partial likelihood estimation of c in a model λ (t | SPF ⬎ µ) ⫽ exp(cβˆ ) λ (t | SPF ⱕ µ) using the original data, we get cˆ ⫽ 1 since βˆ is the maximum partial likelihood estimate. Recently, Schumacher et al. (45) compared several methods to estimate cˆ. In Figure 2B the results of the correction process in the 100 simulated studies are displayed when a heuristic estimate cˆ ⫽ (βˆ 2 ⫺ var (βˆ ))/βˆ 2 was applied where βˆ and var(βˆ ) result from the minimum p value method (46). This heuristic estimate performed quite well when compared with more elaborated cross-validation and resampling approaches (45). In general, it has to be recognized that the minimum p value method leads to a dramatic inflation of the type I error rate; the chance of declaring a quantitative factor as prognostically relevant when in fact it does not have any influence on event-free survival is about 50% when a level of 5% has been intended. Thus, correction of p values is essential but leaves the problem of overestimation of the relative risk in absolute terms. The latter problem that is especially relevant when sample sizes and/or effect sizes are of small or moderate magnitude could at least partially be solved by applying some shrinkage method. It should be noted, however, that the optimal cutpoint approach has further disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. (11) pointed out this problem for studies of the prognostic relevance of SPF in breast cancer published in the literature. They identified 19 different cutpoints used in the literature, and some of them were solely used because they emerged as the optimal cutpoint in a specific data set. Thus, other approaches as regression modeling, for example, might be preferred. In the Freiburg DNA breast cancer data, we obtain a corrected p value of

332

Schumacher et al.

p cor ⫽ 0.123 that provides no clear indication that S-phase is of prognostic relevance for node-positive breast cancer patients. The correction of the relative risk estimate by applying some shrinkage factor leads to a value of θˆ cor ⫽ 2.1 for the heuristic method and to θˆ cor ⫽ 2 for the cross-validation and bootstrap approaches. Unfortunately, confidence intervals are not straightforward to obtain; bootstrapping the whole model-building process including the estimation of a shrinkage factor would be one possibility. In contrast, taking S-phase as a continuous covariate with an assumed log-linear relationship in a conventional Cox regression model, λ (t |Z) ⫽ λ 0 (t) exp(β˜ Z) leads to a p value of p ⫽ 0.061 for testing the null hypothesis β˜ ⫽ 0. For comparison, the estimated log-relative risks for both approaches are displayed in Figure 3.

Figure 3 Log-relative risk for S-phase fraction in the Freiburg DNA study estimated by the minimum p value method, before and after correction, and by a Cox model assuming a log-linear relationship.

Prognostic Factor Studies

V.

333

REGRESSION MODELING AND RELATED ASPECTS

The standard tool for analyzing the prognostic relevance of various factors—in more technical terms usually called covariates—is the Cox proportional hazards regression model (37,47). If we denote the prognostic factors under consideration by Z 1, Z 2, . . . , Z k, then the model is given by λ (t | Z 1 ,Z 2 . . . , Z k ) ⫽ λ 0 (t) exp(β 1Z 1 ⫹ β 2 Z 2 ⫹ . . . ⫹ β k Zk ) where λ (t |⋅) denotes the hazard function of the event-free or overall survival time random variable T and λ 0 (t) is the unspecified baseline hazard. The estimated logrelative risks βˆ j can then be interpreted as estimated ‘‘effects’’ of the factors Z j ( j ⫽ 1, . . . , k). If Z j is measured on a quantitative scale, then exp (βˆ j ) represents the increase or decrease in risk if Z j is increased by one unit; if Z j is a binary covariate, then exp (βˆ j ) is simply the relative risk of category 1 to the reference category (Z j ⫽ 0), which is assumed to be constant over the time range considered. It has to be noted that the ‘‘final’’ multivariate regression model is often the result of a more or less extensive model-building process that may involve the categorization and/or transformation of covariates and the selection of variables in an automatic or a subjective manner. This model-building process should in principle be taken into account when judging the results of a prognostic study, in practice it is often neglected. We come back to this problem at several occasions below, especially in sections VII and VIII. We demonstrate various approaches with the data of the GBSG-2 study. The factors listed in Table 2B are investigated with regard to their prognostic relevance. Since all patients received adjuvant chemotherapy in a standardized manner and there appeared no difference between three and six cycles (34), chemotherapy is not considered any further. Because of the patients’ preference in the nonrandomized part and because of a change in the study protocol concerning premenopausal patients, only about a third of the patients received hormonal treatment. Age and menopausal status had a strong influence on whether this therapy was administered. Therefore, all analyses were adjusted for hormonal treatment. Since the impact of hormonal treatment is not of primary interest in this prognostic study, this was done by using a Cox regression model stratified for hormonal treatment, that is, the baseline hazard is allowed to vary between the two strata while keeping the regression coefficient of the other factors constant over strata. In a first attempt all quantitative factors are included as continuous covariates assuming a log-linear relationship. Age is taken in years, tumor size in mm, and so on, menopausal status is a binary covariate per se; specifically we coded ‘‘0’’ for premenopausal and ‘‘1’’ for postmenopausal patients. Grade is considered as quantitative covariate in this approach, that is, the risk between grade

334

Schumacher et al.

categories 1 and 2 is the same as between grade categories 2 and 3. The results of this Cox regression model are given in Table 3 in terms of estimated relative risks and p values of the corresponding Wald tests under the heading ‘‘full model.’’ In a publication, this should at least be accompanied by confidence intervals for the relative risks that we have omitted here in order not to present too many numbers. From this full model it can be seen that tumor size, tumor grade, number of positive lymph nodes, and the progesterone receptor have a significant impact on event-free survival, when a significance level of 5% is used. Age, menopausal status, and the estrogen receptor do not exhibit prognostic relevance. The full model has the advantage that the regression coefficients of the factors considered can be estimated in an unbiased fashion; it is, however, hampered by the fact that the assumed log-linear relationship for quantitative factors may be in sharp contrast to the real situation and that also irrelevant factors are included that will not be needed in subsequent steps, for example, in the formation of risk groups defined by the prognostic factors. In addition, correlation between various factors may lead to undesirable statistical properties of the estimated regression coefficients, such as inflation of standard errors or problems of instability caused by multicollinearity. It is therefore desirable to arrive at a simple and parsimonious ‘‘final model’’ that only contains those prognostic factors that strongly affect event-free survival (48). The three other columns of Table 3 contain the results of the Cox regression models obtained after backward elimination (BE) for three different selection levels (49). For selection of a single factor, backward elimination with a selection level of 15.7% (BE(0.157)) corresponds asymptotically to the well-known Akaike information criterion whereas selection levels of 5% or even 1% lead to a more stringent selection of factors (50). In general, backward elimination can be recommended because of several advantages compared with other stepwise variable selection procedures (48,51,52). In the GBSG-2 study, tumor grade, lymph nodes, and progesterone receptor are selected for all three selection levels considered; when using 15.7% as the selection level, tumor size is included in addition. Thus, the results of the full model and the three backward elimination procedures do not differ too much in these particular data; this, however, should not be expected in general. One reason might be that there is a relatively clearcut difference between three strong factors (and tumor size that seems to have a borderline influence) and the others that show only a negligible influence on event-free survival in this study. The previous approach implicitly assumes that the influence of a prognostic factor on the hazard function follows a log-linear relationship. By taking lymph nodes as the covariate Z, for example, this means that the risk is increased by the factor exp (βˆ ) if the number of positive lymph nodes is increased from l to l ⫹ 1 for l ⫽ 1, 2, . . . . This could be a questionable assumption at least for large numbers of positive lymph nodes. For other factors even monotonicity of the log-relative risk may be violated, which could result in overlooking an impor-

Prognostic Factor Studies

Table 3 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study; Quantitative Prognostic Factors Are Taken as Continuous Covariates Assuming a Log-Linear Relationship Full model Factor Age Menopausal status Tumor size Tumor grade Lymph nodes Progesterone receptor Estrogen receptor

BE (0.157)

BE (0.05)

BE (0.01)

RR

p value

RR

p value

RR

p value

RR

p value

0.991 1.310 1.008 1.321 1.051 0.998 1.000

0.31 0.14 0.049 0.009 ⬍0.001 ⬍0.001 0.67

— — 1.007 1.325 1.051 0.998 —

— — 0.061 0.008 ⬍0.001 ⬍0.001 —

— — — 1.340 1.057 0.998 —

— — — 0.006 ⬍0.001 ⬍0.001 —

— — — 1.340 1.057 0.998 —

— — — 0.006 ⬍0.001 ⬍0.001 —

335

336

Schumacher et al.

tant prognostic factor. Because of this uncertainty, the prognostic factors under consideration are often categorized and so-called dummy variables for the different categories are defined, thus avoiding that the categorized factors are treated as quantitative covariates. In the GBSG-2 study, the categorization presented in Table 2B is used that was specified independently of the specific data set in accordance with the literature (34). For those factors with three categories, two binary dummy variables were defined contrasting the corresponding category with the reference category chosen as that with the lowest values. So, for example, lymph nodes were categorized into 1–3, 4–9, and ⱖ 10 positive nodes; 1–3 positive nodes serves as the reference category. Table 4 displays the results of the Cox regression model for the categorized covariates; again, the results of the full model are supplemented by those obtained after backward elimination with three selection levels. Elimination of only one dummy variable corresponding to a factor with three categories would correspond to an amalgamation of categories (8). In these analyses where tumor grade, lymph nodes, and progesterone receptor show again the strongest effects, age and menopausal status are also marginally significant and are included into the model by backward elimination with a selection level of 15.7%. For age, there is some indication that linearity or even monotonicity of the log-relative risk may be violated. Grade categories 2 and 3 do not seem well separated as is suggested by the previous approach presented in Table 3 where grade was treated as ordinal covariate; the latter one would lead to estimated relative risks of 1.321 and 1.745 ⫽ (1.321) 2 for grade 2 and grade 3, respectively, in contrast to values of 1.723 and 1.746 when using dummy variables. The use of dummy variables may also be the reason that grade is no longer included by backward elimination with a selection level of 1%. In Table 4 we give the p values of the Wald tests for the two dummy variables separately; alternatively, we could also test the two-dimensional vector of corresponding regression coefficients to be zero. In any case this needs two degrees of freedom, whereas when treating grade as a quantative covariate, one degree of freedom would be sufficient. The data of the GBSG-2 study suggest that grade categories 2 and 3 could be amalgamated into one category (grade 2–3); this would lead to an estimated relative risk of 1.728 and a corresponding p value of 0.019. The results of the two approaches presented in Tables 3 and 4 show that model building within the framework of a prognostic study has to find a compromise between sufficient flexibility with regard to the functional shape of the underlying log-relative risk functions and simplicity of the derived model to avoid problems with serious overfitting and instability. From this point of view the first approach assuming all relationships to be log-linear may not be flexible enough and may not capture important features of the relationship between various prognostic factors and event-free survival. On the other hand, the categorization used in the second approach can always be criticized because of some degree of arbitrariness and subjectivity concerning the number of categories and the specific

Full model Factor Age

Menopausal status Tumor size

Tumor grade

Lymph nodes

Progesterone receptor Estrogen receptor

ⱕ45 45–60 60 Pre Post ⱕ20 21–30 ⬎30 1 2 3 1–3 4–9 ⱖ10 ⬍20 ⱖ20 ⬍20 ⱖ20

BE (0.157)

BE (0.05)

BE (0.01)

RR

p value

RR

p value

RR

p value

RR

p value

1 0.672 0.687 1 1.307 1 1.240 1.316 1 1.723 1.746 1 1.976 3.512 1 0.545 1 0.994

— 0.026 0.103 — 0.120 — 0.165 0.089 — 0.031 0.045 — ⬍0.001 ⬍0.001 — ⬍0.001 — 0.97

1 0.679 0.692 1 1.304 — — — 1 1.718 1.783 1 2.029 3.687 1 0.545 — —

— 0.030 0.108 — 0.120 — — — — 0.032 0.036 — ⬍0.001 ⬍0.001 — ⬍0.001 — —

— — — — — — — —

— — — — — — — — — 0.033 0.037 — ⬍0.001 ⬍0.001 — ⬍0.001 — —

— — — — — — — — — — —

— — — — — — — — — — — — ⬍0.001 ⬍0.001 — ⬍0.001 — —

1 1.709 1.778 1 2.071 3.661 1 0.536 — —

1 2.110 3.741 1 0.494 — —

Prognostic Factor Studies

Table 4 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study; Prognostic Factors Are Categorized as in Table 2B

337

338

Schumacher et al.

cutpoints chosen. In addition, it will not fully exploit the information available and will be associated with some loss in efficiency. For a more flexible modeling of the functional relationship, a larger number of cutpoints and corresponding dummy variables would be needed. We will therefore sketch a third approach that will provide more flexibility while preserving simplicity of the final model to an acceptable degree. The method has been originally developed by Royston and Altman (53) and has been termed the ‘‘fractional polynomial’’ (FP) approach. For a quantitative covariate Z it uses functions β 0 ⫹ β 1 Z p ⫹ β 2 Z q to model the log-relative risk; the powers p and q are taken from the set {⫺2, ⫺1, ⫺0.5, 0, 0.5, 1, 2, 3} and Z 0 is defined as log Z. This simple extension of ordinary polynomials generates a considerable range of curve shapes while still preserving simplicity when compared with smoothing splines or other nonparametric techniques, for example. Sauerbrei and Royston (54) extended the proposed multivariate FP approach to a model-building strategy considering transformation and selection of variables. Without going into the details of this model-building process reported elsewhere (54,55), we summarize the results in Table 5. For age, the powers ⫺2 and ⫺0.5 have been estimated and provide significant contributions to the logrelative risk function. This function is displayed in Figure 4A in comparison with the corresponding functions derived from the two other approaches. It provides some further indication that there is a nonmonotonic relationship that would be overlooked by the log-linear approach. Grade categories 2 and 3 have been amalgamated as has been pointed out above. For lymph nodes a further restriction has been incorporated by assuming that the relationship should be monotone with an asymptote for large numbers of positive nodes. This was achieved by using the simple primary transformation exp(⫺0.12 ⋅ lymph nodes) where the factor 0.12 was estimated from the data (54). The estimated power for this transformed variable was equal to one and a second power was not needed. Likewise, for progesterone receptor, a power of 0.5 was estimated that gives a significant contri-

Table 5 Estimated Regression Coefficients and Corresponding p Values in the Final Cox Regression Model for the GBSG-2 Study Using the Fractional Polynomial Approach Factor/function (Age/50) ⫺2 (Age/50) ⫺0.5 Tumor grade 1 Tumor grade 2–3 exp (⫺0.12 ∗ Lymph nodes) (Progesterone receptor ⫹ 1) 0.5

Regression coefficient

p Value

1.742 ⫺7.812 0 0.517 ⫺1.981 ⫺0.058

⬍0.001 ⬍0.001 — 0.026 ⬍0.001 ⬍0.001

Prognostic Factor Studies

339

Figure 4 Estimated log-relative risk functions for age (A), lymph nodes (B), and progesterone receptor (C) obtained by the FP, categorization and log-linear approach in the GBSG-2 study.

bution to the log-relative risk functions. Figure 4 shows these functions for lymph nodes (B) and progesterone receptor (C) in comparison with those derived from the log-linear and from the categorization approach. For lymph nodes, it suggests that the log-linear approach underestimates the increase in risk for small numbers of positive nodes, whereas it substantially overestimates it for very large numbers. The categorization approach seems to provide a reasonable compromise for this factor. At the end of this section, some general comments are in order. First, there is a variety of other flexible methods available that has not been presented here. Sauerbrei and Royston (54) provide a graphical comparison of the FP approach with generalized additive models (56) for the log-relative risk functions displayed in Figure 4, A–C. Various other nonparametric methods could in principle be used. It should be stressed, however, that there must always be a compromise

340

Schumacher et al.

between flexibility and simplicity and that simple models have the additional advantage that they can be interpreted by clinical colleagues more easily. The aspect of model complexity is discussed in more detail by Sauerbrei (48). Second, the analyses presented for the GBSG-2 study concentrated on the Cox regression model, the standard statistical tool for prognostic studies and without any doubt the one that is most commonly used. This, however, has important consequences: Model checking with regard to the assumptions of this model has to be carefully undertaken. Some special aspects have already been addressed above (e.g., the log-linear relationship in a ‘‘standard’’ Cox model); numerous others can be found in textbooks and review articles on survival analysis (14,57,58). If important assumptions appear to be seriously violated, extensions of the Cox model (e.g., with time-varying regression coefficients) or other models should be taken into consideration, some alternative approaches are discussed in Sections VI and VIII. Third, when dealing with prognostic factor studies, other features than fulfillment of model assumptions are getting more important. One is stability and addresses the question whether we could replicate the selected final model having different data. Bootstrap resampling has been applied to investigate the stability of the selected ‘‘final model’’ (59–61). In each bootstrap sample, the whole model selection or building process is repeated and the results are summarized over the bootstrap samples. We illustrate this procedure for backward elimination with a selection level of 5% in the Cox regression model with quantitative factors included as continuous covariates (Table 3). Because of simplicity we do not consider the selection process including transformation of covariates as was used by Sauerbrei and Royston (54). In Table 6 the inclusion frequencies over 1000 bootstrap samples are given for the prognostic factors under consideration. These frequencies underline that tumor grade, lymph nodes, and progesterone receptor

Table 6 Inclusion Frequencies over 1000 Bootstrap Samples Using the Backward Elimination Method (BE (0.05)) with a Selection Level of 5% in the GBSG-2 Study Factor Age Menopausal status Tumor size Tumor grade Lymph nodes Progesterone receptor Estrogen receptor

Inclusion frequency 18.2% 28.8% 38.1% 62.3% 100% 98.1% 8.1%

Prognostic Factor Studies

341

are by far the strongest factors; lymph nodes are always included, progesterone receptor in 98% and tumor grade in 62% of the bootstrap samples, respectively. The percentage of bootstrap samples where exactly this model—containing these three factors only—is selected is 26.1%. In 60.4% of the bootstrap samples a model is selected that contains these three factors, possibly with other selected factors. These figures might be much lower in other studies where more factors with a weaker effect are investigated. Bootstrap resampling of this type also provides insight into the interdependencies between different factors by inspecting the bivariate inclusion frequencies (61).

VI. CLASSIFICATION AND REGRESSION TREES Analysis by building a hierarchical tree is one approach for nonparametric modeling of the relationship between a response variable and several potential prognostic factors. Breiman et al. (62) gives a comprehensive description of the method of classification and regression trees (CART) that has been modified and extended in various directions (63). We concentrate solely on the application to survival data (64–68) and use the abbreviation CART as a synonym for different types of tree-based analyses. Briefly, the idea of CART is to construct subgroups that are internally as homogeneous as possible with regard to the outcome and externally as separated as possible. Thus, the method leads directly to prognostic subgroups defined by the potential prognostic factors. This is achieved by a recursive tree-building algorithm. As in Section V, we start with k potential prognostic factors Z 1 , Z 2 , . . . , Z k that may have an influence on the survival time random variable T. We define a minimum number of patients within a subgroup, n min say, and prespecify an upper bound for the p values of the logrank test statistic, p stop . Then the tree-building algorithm is defined by the following steps (42): 1. The minimal p value of the logrank statistic is computed for all k factors and all allowable splits within the factors. An allowable split is given by a cutpoint of a quantitative or an ordinal factor within a given range of the distribution of the factor or some bipartition of the classes of a nominal factor. 2. The whole group of patients is split into two subgroups based on the factor and the corresponding cutpoint with the minimal p value if the minimal p value is smaller or equal to p stop. 3. The partition procedure is stopped if no allowable split exists, if the minimal p value is greater than p stop, or because the size of the subgroup is smaller than n min . 4. For each of the two resulting subgroups, the procedure is repeated.

342

Schumacher et al.

This tree-building algorithm yields a binary tree with a set of patients, a splitting rule, and the minimal p value at each interior node. For the patients in the resulting final nodes that may again be combined by some amalgamation, various quantities of interest, as Kaplan-Meier estimates of event-free survival or relative risks with respect to some reference, can be computed. Since the potential prognostic factors are usually measured on different scales, the number of possible partitions will also be different. This leads to the problems that have already been extensively discussed in Section IV. Thus, correction of p values and/or restriction to a set of few prespecified cutpoints may be useful to overcome the problem that factors allowing more splits have a higher chance of being selected by the tree-building algorithm because of multiple testing and may be preferred to binary factors with prognostic relevance. We illustrate the procedure by means of the GBSG-2 study. If we restrict the possible splits to the range between the 10% and 90% quantile of the empirical distribution of each factor, then the factor age, for example, will allow 25 splits whereas the binary factor menopausal status will allow only 1 split. Likewise, tumor size will allow 32 possible splits and tumor grade, 2. Lymph nodes will allow for 10 possible splits, and progesterone and estrogen receptor offer 182 and 177 possible cutpoints, respectively. Thus, we decide to use the p value correction as outlined in Section IV, and we define n min ⫽ 20 and p stop ⫽ 0.05. As a splitting criterion we use the test statistic of the logrank test; for simplicity, the logrank test was not stratified for hormonal therapy that could have been done in principle. We start with the whole group of 686 patients (the ‘‘root’’) where a total of 299 events (crude event rate 43.6%) has been observed. The factor with the smallest corrected p value is lymph nodes and the whole group is split at an estimated cutpoint of nine positive nodes ( p cor ⬍ 0.0001) yielding a subgroup of 583 patients with less than or equal to nine positive nodes (event rate 38.8%) and a subgroup of 103 patients with more than nine positive patients (event rate 70.9%). The procedure is then repeated with the left node (patients with number of positive lymph nodes, ⱕ9) and the right node (patients with number of positive lymph nodes, ⬎9). At this level, in the left node, lymph nodes again appeared to be the strongest factor. With a cutpoint of three positive nodes ( p cor ⬍ 0.0001) this yields a subgroup of 376 patients with less than or equal to three positive nodes (event rate 31.6%) and a subgroup of 207 patients with four to nine positive nodes (event rate 51.7%). For the right node (patients with more than nine positive nodes), progesterone receptor is associated with the smallest corrected p value and the cutpoint is obtained as 23 fmol ( p ⫽ 0.0003). This yields subgroups with 43 patients (progesterone receptor ⬎ 23, event rate 51.2%) and 60 patients (progesterone receptor ⱕ 23, event rate 85%), respectively. In these two subgroups, no further splits are possible because of the p stop criterion; thus they are regarded as final nodes.

Prognostic Factor Studies

343

The subgroups of patients with one to three and four to nine positive nodes allow further splits; again, progesterone receptor is the strongest factor with cutpoints of 90 fmol ( p cor ⫽ 0.006) and 55 fmol ( p cor ⫽ 0.0018), respectively. Because of the p stop criterion, no further splits are possible and the resulting subgroups are considered as final nodes too. The result of the tree-building procedure is summarized in Figure 5. In this graphical representation, the size of the subgroups is taken proportionally to the width of the boxes, whereas the centers of the boxes correspond to the observed event rates. This presentation allows an immediate visual impression about the resulting prognostic classification obtained by the final nodes of the tree. As already outlined above, a variety of definitions exists of CART type algorithms that usually consist of tree building, pruning, and amalgamation (62,63,69,70). We present a somewhat different algorithm that concentrates on the tree-building process. To protect against serious overfitting of the data—that in other algorithms is accomplished by tree pruning—we define various restrictions like the p stop and the n min criteria and the use of corrected p values. Applying these restrictions, we obtain the tree displayed in Figure 5 that is parsimonious in the sense that only the strongest factors, lymph nodes and the progesterone

Figure 5 Classification and regression tree obtained for the GBSG-2 study; p value correction but no prespecification of cutpoints was used.

344

Schumacher et al.

receptor, are selected for the splits. However, the values of the cutpoints obtained for progesterone receptor (90, 55, and 23 fmol) are somewhat arbitrary and may not be reproducable and/or not comparable with those obtained in other studies. Thus, another useful restriction may be the definition of a set of prespecified possible cutpoints for each factor. In the GBSG-2 study we used 35, 40, 45, 50, 55, 60, 65, and 70 years for age: 10, 20, 30, and 40 mm for tumor size; and 5, 10, 20, 100, and 300 fmol for progesterone and estrogen receptors. The resulting tree is displayed in Figure 6A. It is only different from the one without this restriction in that the selected cutpoints for the progesterone receptor are now

Figure 6 Classification and regression trees obtained for the GBSG-2 study. p Value correction and prespecification of cutpoints (A); no p value correction with (B) and without (C) prespecification of cutpoints.

Prognostic Factor Studies

345

100, 20, and 20 fmol in the final nodes. For comparison, the trees without using the p value correction with and without prespecification of a set of possible cutpoints are presented in Figure 6, B and C. Since lymph nodes and progesterone receptor are the dominating prognostic factors in this patient population, the resulting trees are identical at the first two levels to those where the p values have been corrected. The final nodes in the latter ones, however, will again be split, leading to a larger number of final nodes. In addition, other factors like age, tumor size, and estrogen receptor are now used for the splits at subsequent nodes too. A more detailed investigation on the influence of p value correction and prespecification of possible cutpoints on resulting trees and their stability is given by Sauerbrei (71).

VII. FORMATION AND VALIDATION OF RISK GROUPS The final nodes of a regression tree define a prognostic classification scheme per se; to be useful in practice, however, some combination of final nodes to a prognostic subgroup might be indicated. This is especially important if the number of final nodes is large and/or if the prognosis of patients in different final nodes is comparable. So, for example, from the regression tree presented in Figure 6A ( p value correction and predefined cutpoints), the prognostic classification given in Table 7 can be derived that is very much in agreement to current knowledge about the prognosis of node-positive breast cancer patients. The definition of subgroups III and IV reflects that patients with more than nine positive lymph nodes can still be further separated by other prognostic factors, in particular by subdivision of progesterone-positive (⬎20) and -negative patients (72) and that progesterone-negative patients with four to nine positive lymph nodes have a similarly poor prognosis as progesterone-positive patients with more than nine

Table 7 Prognostic Classification Scheme Derived from the Regression Tree ( p Value Correction and Predefined Cutpoints) in the GBSG-2 Study Prognostic subgroup I II III IV

Definition of subgroup (LN (LN (LN (LN

ⱕ 3 and PR ⬎ 100) ⱕ 3 and PR ⱕ 100) or (LN 4–9 and PR ⬎ 20) 4–9 and PR ⱕ 20) or (LN ⬎ 9 and PR ⬎ 20) ⬎ 9 and PR ⱕ 20)

LN, no. of positive lymph nodes; PR, progesterone receptor.

346

Schumacher et al.

positive lymph nodes. Among the other patients, subgroup I with a relatively favorable prognosis can be defined by one to three positive lymph nodes and a ‘‘markedly’’ positive progesterone receptor (⬎100). The results in terms of estimated event free survival are displayed in Figure 7A; the Kaplan-Meier curves show a good separation of the four prognostic subgroups. Since in other studies or in clinical practice progesterone receptor may often be only recorded as positive or negative, the prognostic classification scheme in Table 7 may be modified in the way that the definition of subgroups I and II are replaced by I*: (LN ⱕ 3 and PR ⬎ 20) and II*: (LN ⱕ 3 and PR ⱕ 20) or (LN 4–9 and PR ⬎ 20) respectively, where LN is lymph nodes and PR is progesterone receptor, since 20 fmol is a more commonly agreed cutpoint. The resulting Kaplan-Meier estimates of event-free survival are depicted in Figure 7B. For two of the regression approaches outlined in Section V, prognostic subgroups have been formed by dividing the distribution of the so-called prognostic index, βˆ 1 Z 1 ⫹ ⋅ ⋅ ⋅ ⫹ βˆ k Z k , into quartiles. The results in terms of estimated event-free survival are displayed in Figure 8A (Cox regression model with continuous factors, BE(0.05), Table 3) and in Figure 8B (Cox regression model with categorized covariates, BE(0.05), Table 4). It should be noted that in the definition of the corresponding subgroups, tumor grade enters in addition to lymph nodes and progesterone receptor.

Figure 7 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from the CART approach (A) and the modified CART approach (B) in the GBSG-2 study.

Prognostic Factor Studies

347

Figure 8 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from a Cox model with continuous (A) and categorized (B) covariates and according to the Nottingham Prognostic Index (C) in the GBSG-2 study.

For comparison, Figure 8C shows the Kaplan-Meier estimates of eventfree survival for the well-known Nottingham Prognostic Index (NPI) (73,74) that is the only prognostic classification scheme based on standard prognostic factors that enjoys widespread acceptance (75). This index is defined as NPI ⫽ 0.02 ⫻ size (in mm) ⫹ lymph node stage ⫹ tumor grade where lymph node stage is equal to 1 for node-negative patients, 2 for patients with one to three positive lymph nodes, and 3 if four or more lymph nodes were involved. It is usually divided into three prognostic subgroups: NPI-I (NPI ⬍ 3.4), NPI-II (3.4 ⱕ NPI ⱕ 5.4), and NPI-III (NPI ⬎ 5.4). Since it was developed for node-negative and node-positive patients, there seems room for improvement by taking other factors (e.g., progesterone receptor) into account (76).

348

Schumacher et al.

Since the NPI has been validated in various other studies (75), we can argue that the degree of separation displayed in Figure 8C could be achieved in general. This, however, is by no means true for the other proposals derived by regression modeling or CART techniques where some shrinkage has to be expected (46,77,78). We therefore attempted to validate the prognostic classification schemes defined above with the data of an independent study that—in more technical terms—is often referred to as a ‘‘test set’’ (79). As a test set we take the Freiburg DNA study that covers the same patient population and in addition comprises the same prognostic factors as in the GBSG-2 study. Some complications have to be resolved, however. So only progesterone and estrogen receptor status (positive, ⬎20 fmol; negative, ⱕ20 fmol) is recorded in the Freiburg DNA study and the original values are not available. Thus, only those classification schemes where progesterone receptor enters as positive or negative can be considered for validation. Furthermore, we restrict ourselves to those patients where the required information on prognostic factors is complete. Table 8A shows the estimated relative risks for the prognostic groups derived from the categorized Cox model and from the modified CART classifications scheme defined above. The relative risks have been estimated by using dummy variables defining the risk groups and by taking the group with the best prognosis as reference. When applying the classification schemes to the data of the Freiburg DNA study, the definitions and

Table 8A Estimated Relative Risks for Various Prognostic Classification Schemes Derived in the GBSG-2 Study and Validated in the Freiburg DNA Study Estimated relative risks (no. of patients) Prognostic groups Cox I II III IV CART I* II* III IV NPI II III

GBSG-2 study

Freiburg DNA study

1 2.68 3.95 9.92

(52) (218) (236) (180)

1 1.78 3.52 7.13

(33) (26) (58) (14)

1 1.82 3.48 8.20

(243) (253) (133) (57)

1 1.99 3.19 4.34

(50) (38) (33) (11)

1 2.15

(367) (301)

1 2.91

(46) (87)

Prognostic Factor Studies

349

categorization derived in the GBSG-2 study are used. Note that the categorization into quartiles of the prognostic index does not yield groups with equal number of patients since the prognostic index from the categorized Cox model takes only few different values. From the values given in Table 8A, it can be seen that there is some shrinkage in the relative risks when estimated in the Freiburg DNA study that we used as a test set. This shrinkage is more pronounced in the modified CART classification scheme (reduction by the factor 0.53 in the high-risk group) as compared with the categorized Cox model (reduction by the factor 0.72 in the high-risk group). To get some idea of the amount of shrinkage that has to be anticipated in a test set, based on the original data where the classification scheme has been developed—the so-called training set (79)—cross-validation or other resampling methods can be used. For classification schemes derived by regression modeling, similar techniques as already outlined in Section IV can be used. These consist essentially in estimating a shrinkage factor for the prognostic index (44,46). The relative risks for the prognostic subgroups are then estimated by categorizing the shrinked prognostic index according to the cutpoints used in the original data. In the GBSG-2 study we obtained an estimated shrinkage factor of cˆ ⫽ 0.95 for the prognostic index derived from the categorized Cox model indicating that we would not expect a serious shrinkage of the relative risks between the prognostic subgroups. Compared with the estimated relative risks in the Freiburg DNA study (Table 8A), it is clear that the shrinkage effect in the test set can only be predicted to a limited extent. This deserves at least two comments. First, we have used leave-one-out cross-validation that possibly could be improved by bootstrap or other resampling methods (45), second, we did not take the variable selection process into account. By doing so, we would expect more realistic estimates of the shrinkage effect in an independent study. Similar techniques can in principle also be applied to classification schemes derived by CART methods. How to do this best, however, is still a matter of ongoing research (71).

VIII. ARTIFICIAL NEURAL NETWORKS During the last years, the application of artificial neural networks (ANNs) for prognostic and diagnostic classification in clinical medicine has attracted growing interest in the medical literature. So, for example, a ‘‘miniseries’’ on neural networks that appeared in the Lancet contained three more or less enthusiastic review articles (80–82) and an additional commentary expressing some scepticism (83). In particular, feed-forward neural networks have been used extensively, often accompanied by exaggerated statements of their potential. In a recent review article (84), we identified a substantial number of articles with application of ANNs to prognostic classification in oncology.

350

Schumacher et al.

The relationship between ANNs and statistical methods, especially logistic regression models, has been described in several articles (85–90). Briefly, the conditional probability that a binary outcome variable Y is equal to one, given the values of k prognostic factors Z 1 , Z 2 , . . . , Z k is given by a function f (Z,w). In feed-forward neural networks, this function is defined by

冢

f (Z,w) ⫽ Λ W 0 ⫹

r

冱 j⫽1

冢

W j ⋅ Λ w 0j ⫹

k

冱w i⫽1

ij

冣冣

Zi

where w ⫽ (W 0 , . . . , W r , w 01 , . . . , w kr) are unknown parameters (called ‘‘weights’’) and Λ (⋅) denotes the logistic function (Λ (u) ⫽ (1 ⫹ exp (⫺u))⫺1), called ‘‘activation-function.’’ The weights w can be estimated from the data via maximum likelihood, although other optimization procedures are often used in this framework. The ANN is usually introduced by a graphical representation like that in Figure 9. This figure illustrates a feed-forward neural network with one hidden layer. The network consists of k input units, r hidden units, and one output unit and corresponds to the ANN with f (Z, w) defined above. The arrows indicate the ‘‘flow of information.’’ If there is no hidden layer (r ⫽ 0), the ANN reduces to a common logistic regression model which is also called the ‘‘logistic perceptron.’’ In general, feed-forward neural networks with one hidden layer are universal approximators (91) and thus can approximate any function defined by the conditional probability that Y is equal to one given Z with arbitrary precision by

Figure 9 Graphical representation of an artificial neural network with one input, one hidden, and one output layer.

Prognostic Factor Studies

351

increasing the number of hidden units. This flexibility can lead to serious overfitting that can again be compensated by introducing some weight decay (79,92) that is, by adding a penalty term

冢冱 r

⫺λ

j⫽1

冱冱w 冣 r

W 2j ⫹

k

2 ij

j⫽1

i⫽1

to the log-likelihood. The smoothness of the resulting function is then controlled by the decay parameter λ. It is interesting to note that in our literature review of articles published between 1991 and 1995, we have not found any application in oncology where weight decay has been used (84). Extension to survival data with censored observations is associated with various problems. Although there is a relatively straightforward extension of ANNs to handle grouped survival data (93), several naive proposals can be found in the literature. To predict outcome (death or recurrence) of individual breast cancer patients, Ravdin and Clark (94) and Ravdin et al. (95) used a network with only one output unit but using the number j of the time interval as additional input. Moreover, they consider the unconditional probability of dying before t j rather than the conditional one as output. Their underlying model then reads

冢

P (T ⬍ t j | Z ) ⫽ Λ w 0 ⫹

k

冱w Z ⫹w i

i⫽1

i

k⫹1

⋅j

冣

for j ⫽ 1, . . . , J. T denotes again the survival time random variable, and the time intervals are defined through t j⫺1 ⱕ t ⬍ t j, 0 ⫽ t 0 ⬍ t 1 ⬍ ⋅ ⋅ ⋅ ⬍ t J ⬍ ∞. This parameterization ensures monotonicity of the survival probabilities but also implies a rather stringent and unusual shape of the survival distribution, since in the case that no covariates are considered this reduces to P (T ⬍ t j ) ⫽ Λ (w 0 ⫹ w k⫹1 ⋅ j) for j ⫽ 1, . . . , J. Obviously, the survival probabilities do not depend on the length of the time intervals, which is a rather strange and undesirable feature. Including a hidden layer in this expression is a straightforward extension retaining all the features summarized above. De Laurentiis and Ravdin (96) call such type of neural networks ‘‘time-coded models.’’ Another form of neural networks that has been applied to survival data are the so-called single time point models (96). Since they are identical to a logistic perception or a feed-forward neural network with a hidden layer, they correspond to fitting of logistic regression models or their generalizations to survival data. In practice, a single time point t* is fixed

352

Schumacher et al.

Prognostic Factor Studies

353

and the network is trained to predict the survival probability. The corresponding model is given by

冢

P (T ⬍ t*| Z ) ⫽ Λ w 0 ⫹

冱w Z冣 k

i

i

i⫽1

or its generalization when introducing a hidden layer. This approach is used by Burke (97) to predict 10-year survival of breast cancer patients based on various patient and tumor characteristics at time of primary diagnosis. McGuire et al. (98) used this approach to predict 5-year event-free survival of patients with axillary node-negative breast cancer based on seven potentially prognostic variables. Kappen and Neijt (99) used it to predict 2-year survival of patients with advanced ovarian cancer obtained from 17 pretreatment characteristics. The neural network they actually used reduced to a logistic perceptron. Of course, such a procedure can be repeatedly applied for the prediction of survival probabilities at fixed time points t 1 ⬍ t 2 ⬍ ⋅ ⋅ ⋅ ⬍ t J . For example, Kappen and Neijt (99) trained several (J ⫽ 6) neural networks to predict survival of patients with ovarian cancer after 1, 2, . . . , 6 years. The corresponding model reads

冢

P (T ⬍ t j | Z ) ⫽ Λ w 0 j ⫹

k

冱w i⫽1

ij

Zi

冣

in the case that no hidden layer is introduced. Note that without restriction on the parameters such an approach does not guarantee that the probabilities P (T ⬍ t j |Z ) increase with j and hence may result in life-table estimators suggesting nonmonotone survival function. Closely related to such an approach are the socalled multiple time point models (96) where one neural network with J output units with or without a hidden layer is used. The common drawback of these naive approaches is that they do not allow one to incorporate censored observations in a straightforward manner, which is closely related to the fact that they are based on unconditional survival probabilities instead of conditional survival probabilities as is the Cox model. Neither omission of the censored observations—as suggested by Burke (97)—nor treating censored observations as uncensored are valid approaches but a serious source of bias, which is well known in the statistical literature. De Laurentiis and Ravdin (96) propose imputed estimated conditional survival probabilities for the censored cases from a Cox regression model, that is, they use a well-established statistical

Figure 10 Estimated event-free survival probabilities at 2 years vs. at 1 year for various artificial neural networks in the GBSG-2 study.

354

Schumacher et al.

Prognostic Factor Studies

355

procedure just to make an artificial neural network work. The latter approach is also used by Ripley (92), who emphasized that the resulting bias may be negligible. We illustrate some of the points made above with data from the GBSG-2 study. First, we have used the approach of single time point models for the prediction of event-free survival at 1, 2, 3, 4, and 5 years, respectively. All seven prognostic factors were considered as quantitative covariates, except menopausal status, and were scaled to the interval [0, 1]; they, and hormonal therapy in addition were used as inputs for the neural nets. Censored observations occurring before the corresponding time points were omitted. Figure 10 shows the results of various ANNs in terms of estimated event-free survival at 2 years versus estimated event-free survival at 1 year for those 623 patients who were not censored before 2 years. The ANN with no hidden unit corresponds to an ordinary logistic regression model for event-free survival at 1 and 2 years, respectively. It can be seen that in this model estimated event-free survival probabilities are still monotone. We then increased the number of hidden units to 2 and 5 and varied the degree of weight decay resulting in severe violations of monotonicity of the estimated event-free survival probabilities for a considerable number of patients. For a decay parameter of λ ⫽ 0.1 we then obtain nearly the same results as for the logistic regression model (9 parameters), although five hidden units and a corresponding number of 42 additional parameters were introduced. In a second stage, we illustrate the impact of insufficient handling of censored observation. Figure 11, A and B, shows the estimated event-free survival probabilities at 5 years when censored observations are omitted or replaced by imputed values from a Cox model, respectively. For this imputation, we used the Cox model presented in Table 4 (full model). Both ways of handling censored observations are contrasted by the estimated event-free survival probabilities at 5 years derived from the final FP Cox model (Sec. V, Table 5). The figures demonstrate the possible bias resulting from the omission of censored observations; the bias is smaller for the imputation method, but this may be still not considered a fully satisfactory method. We therefore come to a third approach that has been originally suggested by Faraggi and Simon (100) and extended by others (101). The idea is to replace the function exp (β 1 Z 1 ⫹ ⋅ ⋅ ⋅ ⫹ β kZ k ) in the definition of the Cox model by a more flexible function motivated by the function f (Z, w) used in ANNs. This leads to a neural network generalization of the Cox regression model defined by

Figure 11 Estimated event-free survival probabilities at 5 years when censored observations are omitted (A) and replaced by imputed values (B) vs. estimated event-free survival probabilities at 5 years obtained from the final FP Cox model in the GBSG-2 study.

Prognostic Factor Studies

357

λ (t | Z 1 , . . . , Z k ) ⫽ λ 0 (t) exp ( f FS (Z,w)) where r

f FS (Z,w) ⫽

冱 j⫽1

冢

W j Λ w 0j ⫹

k

冱w i⫽1

ij

Zi

冣

Note that the constant W 0 is omitted in the framework of the Cox model. Estimation of weights is then done by maximizing the partial likelihood that includes the correct and usual handling of censored observations in that these patients contribute to the partial likelihood as long as they are at risk. Although the problem of censoring is satisfactorily solved in this approach, problems remain with potentially serious overfitting of the data, especially if the number r of hidden units is large. For illustration, we again used the data of the GBSG-2 study where we applied some preselection of variables in that we took those factors that were included in the final FP model (Sec. V, Table 5). Thus, we used the four factors, age, grade, lymph nodes, and progesterone receptor (all scaled to the interval [0; 1]), and hormone therapy as inputs for the Faraggi and Simon (F&S) network. Figure 12 shows the results for various F&S networks compared with the FP approach in terms of Kaplan-Meier estimates of event-free survival in the prognostic subgroups defined by the quartiles of the corresponding prognostic indices. It should be noted that the F&S network contains 5 ⫹ (6 ⫻ 5) ⫽ 35 parameters when 5 hidden units are used and 20 ⫹ (6 ⫻ 20) ⫽ 140 when 20 hidden units are used. The latter one must be suspected to serious overfitting with a high chance that the degree of separation achieved could never be reproduced in other studies. To highlight this phenomenon we trained a slightly different F&S network where, in addition to age, tumor size, tumor grade, and number of lymph nodes, estrogen and progesterone status (positive, ⬎20 fmol; negative; ⱕ20 fmol) were used as inputs. This network contained 20 hidden units (20 ⫹ (7 ⫻ 20) ⫽ 160 parameters) and showed a similar separation than the one where estrogen and progesterone receptor entered as quantitative inputs. Table 8B contrasts the results from the GBSG-2 study used as training set and the Freiburg DNA study used as test set in terms of estimated relative risks where the predicted eventfree survival probabilities are categorized in quartiles. In the training set, we observe a 20-fold increase in risk between the high-risk and the low-risk group, whereas the F&S network turns out to yield a completely useless prognostic clas-

Figure 12 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from various Faraggi & Simon networks and from the FP approach in the GBSG-2 study.

358

Schumacher et al.

Table 8B Estimated Relative Risks for Various Prognostic Classification Schemes Based on F&S Neural Networks Derived in the GBSG-2 Study and Validated in the Freiburg DNA Study Estimated relative risks (no. of patients) Prognostic groups F&S* I II III IV F&S† I II III IV F&S‡ I II III IV F&S§ I II III IV

GBSG-2 study

Freiburg DNA study

1 3.24 7.00 22.03

(179) (178) (159) (170)

1 0.34 0.98 1.39

(37) (16) (38) (35)

1 1.45 2.62 5.75

(171) (172) (171) (172)

1 1.57 3.09 4.27

(23) (25) (32) (46)

1 1.64 3.27 8.49

(172) (171) (171) (172)

1 1.03 1.89 2.72

(20) (31) (28) (47)

1 1.46 2.62 5.77

(172) (171) (171) (172)

1 1.57 3.22 4.14

(23) (25) (33) (45)

* Twenty hidden units, weight decay ⫽ 0. † Twenty hidden units, weight decay ⫽ 0.1. ‡ Five hidden units, weight decay ⫽ 0. § Five hidden units, weight decay ⫽ 0.1

sification scheme in the test set where the estimated relative risks are not even monotone increasing. It is obvious that some restrictions either in terms of a maximum number of parameters or by using some weight decay are absolutely necessary to avoid such an amount of overfitting as observed in the two prognostic classification schemes based on F&S networks where weight decay was not applied. The results for an F&S network with five hidden units are very much comparable with the FP approach, especially when some weight decay is introduced. It should be noted that the FP approach contains at most eight parameters if we ignore the preselection of the four factors.

Prognostic Factor Studies

359

Summarizing our experience with ANNs for prognostic classification, it should be emphasized that they have to be regarded as very flexible nonlinear regression models deserving the same careful model building as other statistical models of similar flexibility. Especially, the dangers of serious overfitting have to be taken into account. When applying ANNs to survival data, one has to take care that the standard requirements for such data as the proper incorporation of censored observations or the modeling of conditional survival probabilities are met. In our literature survey we did not find a satisfactory application (84), although some progress has been made in recent methodological contributions (92,101,102).

IX. ASSESSMENT OF PROGNOSTIC CLASSIFICATION SCHEMES Once a prognostic classification scheme is developed and defined, the question arises how its predictive ability can be assessed and how its performance can be compared with that of competitors. It is interesting to note that there is no commonly agreed approach available in the statistical literature and most measures that are used have some ad-hoc character. Suppose that a prognostic classification scheme consist of g prognostic groups—called risk strata or risk groups—then one common approach is to present the Kaplan-Meier estimates for event-free or overall survival in the g groups. This is the way in that we also presented the results of prognostic classification schemes derived by various statistical methods in previous sections. The resulting figures are often accompanied by p values of the logrank test for the null hypothesis that the survival functions in the g risk strata are equal. It is clear that a significant result is a necessary but not a sufficient condition for good predictive ability. Sometimes, a Cox model using dummy variates for the risk strata is fitted and the log-likelihood and/or estimated relative risks of risk strata with respect to a reference are given. Recently, we proposed a summary measure of separation (36) defined as

冤冱 nn |βˆ |冥 g

SEP ⫽ exp

j

j

j⫽1

where n j denotes the number of patients in risk stratum j and βˆ j is the estimated log-hazard ratio or log-relative risk of patients in risk stratum j with respect to a baseline reference. In particular, we used the baseline reference estimated in a Cox model where the dummy variates for risk strata were centered to have mean zero. SEP is the weighted geometric mean of ‘‘absolute’’ relative risks between strata and baseline, ‘‘absolute’’ meaning that 1/RR replaces RR for relative risks

360

Schumacher et al.

RR ⬍ 1 (103). Often this model-based baseline reference turns out to be very similar to the estimated marginal distribution of T, i.e., to the pooled KaplanMeier estimate Sˆ (t). Therefore, SEP essentially compares risks within strata with the risk in the entire population. In fact, the pooled Kaplan-Meier estimate has been used previously as baseline reference (36), although the model-based approach may be preferable for formal reasons. In this section, we use the GBSG-4 study for illustration. In the nodenegative breast cancer patients we compare the NPI (73,74) that has already been defined in Section VII and two classification schemes that have been derived from a Cox regression model and a CART approach, respectively. The first scheme is defined through a simplified prognostic index COX ⫽ I (age ⱕ 40 years) ⫹ I (size ⬎ 20mm) ⫹ grade where I(⋅) denotes the indicator function being equal to 1 if ‘‘⋅’’ holds true and 0 otherwise. Three risk groups (I, II, and III with numbers of patients in the GBSG-4 study) are defined through COX ⫽ 1, 2 (n 1 ⫽ 277), Cox ⫽ 3 (n 2 ⫽ 205), and Cox ⫽ 4, 5 (n 3 ⫽ 121), respectively. For the second one, four risk groups have been obtained (36), given as CART I: Grade 1 and age ⱕ 60 years (n 1 ⫽ 78), CART II: Size ⱕ 20 mm and [(grade 2–3 and age ⬎ 40 years) or (grade 1 and age ⬎ 60 years)] (n 2 ⫽ 222), CART III: (age ⱕ 40 years and grade 2–3) or (size ⬎ 20 mm and age ⬎ 60 years and grade 1) or (size ⬎ 20 mm and grade 2–3 and estrogen receptor ⱕ 300 fmol) (n 3 ⫽ 284), CART IV: Size ⬎ 20 mm and grade 2–3 and estrogen receptor ⬎ 300 fmol (n ⫽ 19). As can be seen from the numbers of patients, this leads to two relatively large medium risk strata, a smaller low risk stratum, and a very small high risk stratum in the GBSG-4 study. Figure 13, A–C, shows the Kaplan-Meier estimates in the various risk strata corresponding to the three prognostic classification schemes. Table 9A summarizes the results of some ad hoc measures applied to the data of the GBSG-4 study. For all three prognostic classification schemes considered, the p values of the logrank test are highly significant ( p ⬍ 0.0001). Thus, this measure does not prove to be particularly useful. There is some improvement in the log-likelihood for the NPI and some further improvement for the simplified Cox Index. The CART classification scheme shows the best result. A formal comparison, however, is hampered by the fact that the corresponding regression models are not nested. The summary measure SEP yields an average absolute risk with respect to baseline of about 1.56 for the NPI and 1.55 for the simplified Cox Index. With a value of 1.71, the CART classification scheme again shows the best performance.

Prognostic Factor Studies

361

Figure 13 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups according to the Nottingham Prognostic Index (A), derived by a Cox model (B), and a CART approach (C) in the GBSG-4 study in node-negative breast cancer.

Table 9A Ad Hoc Measures for Predictive Ability of Three Prognostic Classification Schemes in the GBSG-4 Study in Node-negative Breast Cancer Ad hoc measure

Pooled Kaplan-Meier

NPI

COX index

CART index

p value ⫺2 log L SEP

— 1856.1 1

⬍0.0001 1827.9 1.559

⬍0.0001 1817.3 1.550

⬍0.0001 1802.9 1.710

362

Schumacher et al.

Since these ad hoc measures are only of limited value, we now briefly outline some recent developments; a detailed description can be found elsewhere (103). First, it is of central importance to recognize that the time-to-event itself cannot adequately be predicted (104–108). The best one can do at t ⫽ 0 is to try to estimate the probability that the event of interest will not occur until a prespecified time horizon represented by some time point t*, given the available covariate information for a particular patient at t ⫽ 0. Consequently a measure of inaccuracy that is aimed to assess the value of a given prognostic classification scheme should compare the estimated event-free probabilities with the observed individual outcome. Thus, we consider an approach based directly on the estimates of eventfree probabilities S (t*| Z ⫽ z) for patients with Z ⫽ z. As outlined above, it is the aim of a prognostic classification scheme to provide estimated event-free probabilities Sˆ (t*| j) for patients in risk stratum j ( j ⫽ 1, . . . , g). These estimated probabilities may be used as predictions of the event status Y ⫽ I (T ⬎ t*). To determine the mean square error of prediction in this case, the observed survival or event status at t*, Y ⫽ I (T ⬎ t*) has to be compared with the estimated probability, Sˆ (t*| j), leading to BS(t*) ⫽

1 n

n

冱 (I(T ⬎ t*) ⫺ Sˆ (t*| j )) i

i

2

i⫽1

where the sum goes over all n patients. This quantity is known as the quadratic score. Multiplied by a factor of 2 (omitted here for simplicity), it is equal to the Brier score, which was originally developed for judging the inaccuracy of probabilistic weather forecasts (109–111). The expected value of the Brier score may be interpreted as a mean square error of prediction if the event status at t* is predicted by the estimated event-free probabilities Sˆ (t*| j). In the extreme case where the estimated event-free probabilities are 0 or 1 for all patients—this corresponds to the assertion that the event-free status a t* can be predicted without error—BS(t*) will be zero if Sˆ (t*| j) coincides with the observed event status. It will attain its maximum value of 1 only if the estimated event-free probabilities happen to be equal to 1 minus the observed event status for all patients. In the absence of any knowledge about the disease under study, a trivial constant prediction Sˆ (t*) ⫽ 0.5 for all patients would be the most plausible approach. This yields a Brier score equal to 0.25. If some closer relationship to the likelihood is intended, the so-called logarithmic score may be preferred. This is given by LS(t*) ⫽ ⫺

1 n

n

冱 {I(T ⬎ t*) log Sˆ (t*| j ) i

i

i⫽1

⫹ I(T i ⱕ t*) log (1 ⫺ Sˆ (t*| j i ))}

Prognostic Factor Studies

363

where we adopt conventions ‘‘0 ⋅ log 0 ⫽ 0’’ and ‘‘1 ⋅ log 0 ⫽ ⫺∞ .’’ Hence LS (t*) is equal to zero in the extreme situation where the estimated event-free probabilities Sˆ (t*| j i ) are 0 or 1 for all patients and coincide with their observed event status I (T i ⬎ t*). It will attain infinity if the estimated event-free probability happens to be equal to I (T i ⱕ t*) for at least one patient (111,112). If we do not wish to restrict ourselves to one fixed time point t*, we can consider both the Brier score and the logarithmic score as a function of time for 0 ⱕ t ⱕ t*. This function can also be averaged over time, i.e., for t ∈ [0, t*], by integrating it with respect to some weight function W(t) (103). So far, censoring has not been taken into account into the definition of both measures of inaccuracy. How to do that in a way that the resulting measures are still consistent estimates of the population quantities—in case of the Brier score the mean square error of prediction—is not a trivial problem. It can, however, be solved by reweighting the individual contributions in a similar way as in the calculation of the Kaplan-Meier estimator. The reweighting of uncensored observations and of observations censored after t* is done by the reciprocal of the Kaplan-Meier estimate of the censoring distribution, whereas observations censored before t* get weight zero. With this weighting scheme, a Brier or a logarithmic score under random censorship can be defined that enjoys the desirable statistical properties (103). Using these scores, R2-type measures (113–116) can also be readily defined by relating the Brier or the logarithmic score for a prognostic classification scheme to that score where the pooled Kaplan-Meier estimate is used as ‘‘universal’’ prediction for all patients. We calculated the Brier and the logarithmic score for the data of the GBSG4 study in node-negative breast cancer. For the NPI, the Brier score at t* ⫽ 5 is equal to 0.184, which is not very much below 0.196, the value reached by the pooled Kaplan-Meier prediction Sˆ (t*) ⫽ 0.733 for all patients; remember that the Brier score is equal to 0.25 when the trivial prediction Sˆ (t*) ⫽ 0.5 is made for all patients. Table 9B summarizes the results of various measures of inaccuracy for the NPI, the simplified COX index, and the CART index. For all measures, the simplified COX index performs better than the NPI, and some further

Table 9B Measures of Inaccuracy for Three Prognostic Classification Schemes in the GBSG-4 Study in Node-negative Breast Cancer Measure of inaccuracy

Pooled Kaplan-Meier

NPI

COX index

CART index

BS (t* ⫽ 5) LS (t* ⫽ 5) R2

0.196 0.580 0.0%

0.184 0.549 6.1%

0.179 0.538 8.7%

0.175 0.529 10.4%

364

Schumacher et al.

improvement is achieved by the CART index. Relative to the prediction with the pooled Kaplan-Meier estimate for all patients, there is only a moderate gain of accuracy; the R2-measure of explained residual variation based on the Brier score just reaches 10.4% for the best prognostic classification scheme. In general, it has to be acknowledged that measures of inaccuracy tend to be large, or, the other way round, R2-type values tend to be small reflecting that predictions are far from being perfect (117). In addition, it has to be mentioned that there may be still some overoptimism present resulting when a measure of inaccuracy is calculated in the same data where the prognostic classification scheme is derived from. We have assumed that the estimated probabilities Sˆ (t |j) of being event free up to time t have emerged from external sources. For the GBSG-4 study in node-negative breast cancer, this is by no means true and only pretended for illustrative purposes. Actually, the COX and the CART indices have been derived from the same data set. Even for the NPI that has been proposed in the literature (73,74), we have estimated the event-free probabilities from our data set and used them as predictions. To reduce the resulting overoptimism, cross-validation and resampling techniques may be used in a similar way as for the estimation of error rates (118,119) or for the reduction of bias of effect estimates as outlined in Section IV. For definitive conclusions, however, the determination of measures of inaccuracy in an independent test data set is absolutely necessary (79).

X.

SAMPLE SIZE CONSIDERATIONS

If the role of a new prognostic factor is to be investigated, a careful planning of an appropriate study is required. This includes an assessment of the power of the study in terms of sample sizes. An adequate analysis of the independent prognostic effect of a new factor has to be adjusted for the existing standard factors (4,120). With survival or event-free survival as the end point, this will often be done with the Cox proportional hazards model. Sample size and power formulae in survival analysis have been developed for randomized treatment comparisons. In the analysis of prognostic factors, however, the covariates included are expected to be correlated with the factor of primary interest. In this situation, the existing sample size and power formulae are not valid and may not be applied. In this section we give an extension of Schoenfeld’s formula (121) to the situation that a correlated factor is included in the analysis. We consider the situation that we wish to study the prognostic relevance of a certain factor—denoted by Z 1 in the presence of a second factor Z 2, which can also be a score based on several other factors. The criterion of interest is survival or event-free survival of the patients. We assume that the analysis of

Prognostic Factor Studies

365

the main effects of Z 1 and Z 2 is performed with Cox proportional hazards model given by λ (t | Z 1, Z 2 ) ⫽ λ 0 (t) exp (β 1 Z 1 ⫹ β 2 Z 2 ) where λ 0 (t) denotes an unspecified baseline hazard function and β 1 and β 2 are the unknown regression coefficients representing the effects of Z 1 and Z 2. For sake of simplicity we assume that Z 1 and Z 2 are binary with p ⫽ P (Z 1 ⫽ 1) denoting the prevalence of Z 1 ⫽ 1. The relative risk between the groups defined by Z 1 is then given by θ 1 ⫽ exp (β 1). Assume that the effect of Z 1 shall be tested by an appropriate two-sided test based on the partial likelihood derived from the Cox model with significance level α and power 1 ⫺ β to detect an effect that is given by a relative risk of θ 1 . For independent Z 1 and Z 2 , it was shown by Schoenfeld (121) that the total number of patients required is given by the following expression N⫽

(u 1⫺α/2 ⫹ u 1⫺β )2 (log θ 1 )2 ψ (1 ⫺ p)p

where ψ is the probability of an uncensored observation and u γ denotes the γ quantile of the standard normal distribution. This is the same formula as that used for a comparison of two populations as developed by George and Desu (122) for an unstratified and by Bernstein and Lagakos (123) for a stratified comparison of exponentially distributed survival times and by Schoenfeld (124) for the unstratified logrank test. The sample size formula depends on p, the prevalence of Z 1 ⫽ 1, that has to be taken into account. Obviously, the expected number of events—often also called the ‘‘effective sample size’’—to achieve a prespecified power is minimal for p ⫽ 0.5, the situation of a randomized clinical trial with equal probabilities for treatment allocation. By using the same approximations as Schoenfeld (121), one can derive a formula also for the case when Z 1 and Z 2 are correlated with correlation coefficient ρ; for details we refer to Schmoor et al. (125). This formula reads N⫽

冢

(u 1⫺α/2 ⫹ u 1⫺β )2 1 2 (log θ 1 ) ψ (1 ⫺ p)p 1 ⫺ ρ 2

冣

the factor 1/(1 ⫺ ρ 2 ) is usually called the variance inflation factor (VIF). This formula is identical to a formula derived by Lui (126) for the exponential regression model in the case of no censoring. Table 10 gives for some situations the value of the VIF and the effective sample size Nψ, that is, the number of events required to obtain a power of 0.8 to detect an effect to Z 1 of magnitude θ 1 as calculated by the formula given above. It shows that the required number

366

Schumacher et al.

Table 10 Variance Inflation Factors and Effective Sample Size Required to Detect an Effect of Z 1 of Magnitude θ 1 with Power 0.8 as Calculated by the Approximate Sample Size Formula for Various Values of p, ρ, θ 1 (α ⫽ 0.05) θ 1 ⫽ 1.5 p 0.5

0.3

ρ

VIF

0 0.2 0.4 0.6 0 0.2 0.4 0.6

1 1.04 1.19 1.56 1 1.04 1.19 1.56

θ1 ⫽ 2

θ1 ⫽ 4

Nψ 191 199 227 298 227 237 271 355

65 68 78 102 78 81 93 122

16 17 19 26 19 20 23 30

of events for the case of two correlated factors may increase up to a factor of 50% in situations realistic in practice. Note that the formula given above is identical to that developed by Palta and Amini (127) for the situation that the effect of Z 1 is analyzed by a stratified logrank test where Z 2 ⫽ 0 and Z 2 ⫽ 1 define the two strata. The sample size formulae given above will now be illustrated by means of the GBSG-2 study. Suppose we want to investigate the influence of the progesterone receptor in the presence of tumor grade. The Spearman correlation coefficient of these two factors is ρ ⫽ ⫺0.377; if they are categorized as binary variables we find ρ ⫽ ⫺0.248 from Table 11. Taking the prevalence of progesterone-positive tumors ( p ⫽ 60%) into account, a number of 213 events is required to detect a relative risk of 0.67 and of 74 events to detect an relative risk of 0.5

Table 11 Distribution of Progesterone Receptor by Tumor Grade and Estrogen Receptor in the GBSG-2 Study Tumor grade

Progesterone receptor ⬍20 ⱖ20 Correlation coefficient

Estrogen receptor

1

2⫹3

⬍20

ⱖ20

5 76

264 341

190 72

79 345

⫺0.248

0.536

Prognostic Factor Studies

367

with a power of 80% (significance level α ⫽ 5%). In this situation, the variance inflation factor is equal to 1.07, indicating that the correlation between the two factors has only little influence on power and required sample sizes. If we want to investigate the prognostic relevance of progesterone receptor in the presence of estrogen receptor, a higher correlation has to be considered. The Spearman correlation coefficient is equal to ρ ⫽ 0.598 if both factors are measured on a quantitative scale and ρ ⫽ 0.536 if they are categorized into positive (⬎20 fmol) and negative (ⱕ20 fmol), as given in Table 11. This leads to a variance inflation factor of 1.41 and a number of events of 284 and 97 required to detect a relative risk of 0.67 and of 0.5, respectively (power ⫽ 80%, significance level α ⫽ 5%). This has to be contrasted with the situation that both factors under consideration are uncorrelated; in this case the required number of events is 201 to detect a relative risk of 0.67 and 69 to detect a relative risk of 0.5, both with a power of 80% at a significance level of 5%. So from this aspect, the GBSG-2 study with 299 events does not seem too small to investigate the relevance of prognostic factors that exhibit at least a moderate effect (relative risk of 0.67 or 1.5). The question is whether it is large enough to permit the investigation of several prognostic factors. There have been some recommendations in the literature, based on practical experience or on results from simulation studies, regarding the event per variable relationship (58,128–131). More precisely, it is the number of events per model parameter that matters which is often overlooked. These recommendations range from 10 to 25 events per model parameter to ensure stability of the selected model and of corresponding parameter estimates and to avoid serious overfitting. The sample size formula given above addresses the situation of two binary factors. For more general situations (i.e., factors occurring on several levels or factors with continuous distribution), the required sample size may be calculated using a more general formula that can be developed according to the lines of Lubin and Gail (132). The anticipated situation has then to be specified in terms of the joint distribution of the factors under study and the size of corresponding effects on survival. It may be more difficult to pose the necessary assumptions than in the situation of only two binary factors, but in principle it is possible to base the sample size calculation on that formula. Numerical integration techniques are then required to perform the necessary calculation. So if several factors that should be included in the analysis, one practical solution is to prespecify a prognostic score based on the existing standard factors and to consider this score as the second covariate to be adjusted for. Another possibility would be to adjust for that prognostic factor where the largest effect on survival and the highest correlation is anticipated. Finally, it should be mentioned that a sample size formula for the investigation of interactive effects of two prognostic factors is also available (125,133,134).

368

Schumacher et al.

XI. CONCLUDING REMARKS In this chapter we consider statistical aspects of the evaluation of prognostic factors. In particular we highlight the situation that historical data might be available in a database. In contrast to therapeutic studies where the comparison of treatments based on historical data is almost always a totally useless exercise (135,136), such data might provide a valuable source for studying the role of prognostic factors under certain conditions. The problems when dealing with historical data range from the probably insufficient quality of data, completeness of baseline, and follow-up data to the heterogeneity of patients with respect to prognostic factors and therapy. Problems with completeness and quality of data can usually not retrospectively be solved; the only possibility is then a prospective collection of such data, including a regular follow-up. Problems with the heterogeneity of the patient population can at least partially be avoided by definition of suitable inclusion and exclusion criteria preferably in a similar study protocol as it is common practice in prospective therapeutic studies. This has, for example, been done in a study on the role of DNA content in advanced ovarian cancer (137). Here it is desirable to define a relatively homogeneous study population and to avoid a nontransparent mixture of patients from different stages of a disease and with substantially different treatments. As far as the statistical analysis is concerned, a multivariate approach is absolutely essential. In prognostic factor studies in oncology where the end point is survival or event-free survival, the Cox regression model provides a flexible tool for such an analysis, but also here various problems have to be considered requiring expert knowledge in medical statistics. By using data from three prognostic factor studies in breast cancer, some of these problems have been demonstrated in this chapter. We have shown that various approaches for such a modelbuilding process exist; including the method of classification and regression trees and ANNs. The most important requirements one should keep in mind are to arrive at models that are as simple and parsimonious as possible and to avoid serious overfitting. Only if these requirements are acknowledged can generalizability for future patients be achieved. Thus, validation in an independent study is an essential step in establishing a prognostic factor or a prognostic classification scheme. Some insight into the stability and generalizability of the derived models can be gained by cross-validation and resampling methods that, however, cannot be regarded to completely replace an independent validation study. In addition, further methodological research is needed about the use of such methods under various circumstances. The lack of appropriate validation studies in combination with insufficient design considerations and inadequate statistical analyses resulting in serious overfitting has led to the situation of conflicting results and failures to establish many

Prognostic Factor Studies

369

‘‘new’’ and even ‘‘old’’ prognostic factors (138,139). Thus, a careful planning of prognostic factor studies and a proper and thoughtful statistical analysis is an essential prerequisite for achieving an improvement of the current situation. In illustrating various approaches, we were not consistent in the sense that we have not attempted a complete analysis of a particular study according to some generally accepted guidelines. We have rather shown the strengths and weaknesses of various approaches to protect against potential pitfalls. For a concrete study, the statistical analysis should be carefully planned step by step and the model-building process should at least principally be fixed in advance in a statistical analysis plan as is required in much more detail for clinical trials according to international guidelines (140,141). The problem of adequate sample sizes for prognostic factor studies has not been fully appreciated in the past. In contrast to therapeutic studies where one might argue that also very small differences associated with relative risks close to 1 are relevant for a comparison of therapies (142–144), one might accept the requirement that established prognostic factors should exhibit large relative risks. Thus, at a first glance, studies on prognostic factors seem to require smaller number of patients. Three points have to be recognized, however, that are essential for the calculation of sample sizes. First, it is the expected number of events and not the total number of patients that constitutes the quantity of central importance and that depends on the length and completeness of follow-up. Second, the distribution can be described by the prevalence of the factor that might differ substantially from the optimal value of 0.5 usually arising in a randomized clinical trial. For small values of the prevalence, a study has to be larger than a comparable therapeutic trial using the same value of the relative risk as a clinically relevant difference. Third, it is the number of factors under consideration, or better the number of model parameters, that should affect the size of a study. Practical experience and results from simulation studies suggest as some rule of thumb that studies with less than 10 to 25 events per factor (or parameter) cannot be considered as an informative and reliable basis for the evaluation of prognostic factors. It has also to be recognized that the number of patients or events, respectively, suitable for the final analysis might differ substantially from the number of patients available in a database when rigorous inclusion and exclusion criteria are applied. From these considerations it can be derived that small or premature studies are not informative and cannot lead to an adequate assessment of prognostic factors. They can only create hypotheses in an explorative sense and might even lead to more confusion on the role of prognostic factors because of various sources of bias including the publication bias (145–148). To reach definitive conclusions, close cooperation between different centers or study groups might be necessary that might lead to a meta-analysis type evaluation of prognostic factors (149). This approach would surely be associated with many additional

370

Schumacher et al.

difficulties, but it would help to avoid some of the problems due to publication bias (146,149,150). It would also encourage the use of standard prognostic models for particular entities or stages of cancer. In addition, the size of an independent validation study should be large enough to allow valid and definitive conclusions. Thus, the Freiburg DNA study that we used as a validation study several times throughout this contribution was far too small from this point of view and should have served for illustrative purposes only. A number of important topics have not or have only been mentioned in passing in this chapter. One of these topics is concerned with the handling of missing values in prognostic factors. We have always confined ourselves to a so-called complete case analysis that would lead to consistent estimates of the regression coefficients if some assumptions are met (151,152). However, this may not be a very efficient approach, especially if the missing rates are higher than in the three prognostic studies that we used for illustration. Thus, more sophisticated methods for dealing with missing values in some prognostic factors might be useful (152–154). For applying prognostic factors or prognostic classification schemes to future patients, one also has to be prepared that some factors may have not been measured. To arrive at a prediction of survival probabilities for such patients, surrogate definitions for the corresponding prognostic classification schemes are required. Throughout this chapter, we have also assumed that effects of prognostic factors are constant over time and that prognostic factors are recorded and known at time of diagnosis. These assumptions do not cover the situation of time-varying effects and of time-dependent covariates. If multiple end points or different events are of interest, the use of competing risk and multistate models may be indicated. For these topics that are also of importance for prognostic factor studies, we refer to more advanced textbooks in survival analysis (14,19,20,22) and current research papers. In general, the methods and approaches presented here have at least in part been selected and assessed according to the subjective views of the authors. Thus, other approaches might also be seen as useful and adequate. What should not be a matter of controversy, however, is the need for a careful planning, conducting, and analyzing of prognostic factor studies to arrive at generalizable and reproducible results that could contribute to a better understanding and possibly to an improvement of the prognosis of cancer patients.

ACKNOWLEDGMENTS We thank our colleagues Erika Graf and Claudia Schmoor for valuable contributions and Regina Gsellinger for her assistance in preparing the manuscript.

Prognostic Factor Studies

371

REFERENCES 1. McGuire WL. Breast cancer prognostic factors: evaluation guidelines. J Natl Cancer Inst 1991; 83:154–155. 2. Infante-Rivard C, Villeneuve J-P, Esnaola S. A framework for evaluating and conducting prognostic studies: an application to cirrhosis of the liver. J Clin Epidemiol 1989; 42:791–805. 3. Clark GM, ed. Prognostic factor integration. Special issue. Breast Cancer Res Treat 1992; 22:185–293. 4. Simon R, Altman DG. Statistical aspects of prognostic factor studies in oncology. Br J Cancer 1994; 69:979–985. 5. Wyatt JC, Altman DG. Prognostic models: clinically useful or quickly forgotten? Commentary, Br Med J 1995; 311:1539–1541. 6. Armitage P, Gehan EA. Statistical methods for the identification and use of prognostic factors. Int J Cancer 1974; 13:16–36. 7. Byar DP. Analysis of survival data: Cox and Weibull models with covariates. In: Mike V, Stanley KE, eds. Statistics in Medical Research. New York: Wiley, 1982, pp. 365–401. 8. Byar DP. Identification of prognostic factors. In: Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials: Methods and Practice. Oxford: Oxford University Press, 1984, pp. 423–443. 9. Simon R. Use of regression models: statistical aspects. In Buyse ME, Staquet MJ, Sylvester RJ, eds. Cancer Clinical Trials: Methods and Practice. Oxford: Oxford University Press, 1984, pp. 444–466. 10. George SL. Identification and assessment of prognostic factors. Semin Oncol 1988; 5:462–471. 11. Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using ‘‘optimal’’ cutpoints in the evaluation of prognostic factors. Commentary. J Nat Cancer Inst 1994; 86:829–835. 12. Altman DG, De Stavola BL, Love SB, Stepniewska KA. Review of survival analyses published in cancer journals. Br J Cancer 1995; 72:511–518. 13. Hermanek P, Gospodarowicz MK, Henson DE, Hutter RVP, Sobin LH. Prognostic Factors in Cancer. Heidelberg–New York: Springer, 1995. 14. Marubini E, Valsecchi MG. Analysing Survival Data from Clinical Trials and Observational Studies. Chichester: Wiley, 1995. 15. Parmar MKB, Machin D. Survival Analysis: A Practical Approach. Chichester: Wiley, 1995. 16. Harris EK, Albert A. Survivorship Analysis for Clinical Studies. New York: Marcel Dekker, 1991. 17. Lee ET. Statistical Methods for Survival Data Analysis. 2nd ed. New York: Wiley, 1992. 18. Collett D. Modelling Survival Data in Medical Research. London: Chapman & Hall, 1994. 19. Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer, 1997.

372

Schumacher et al.

20. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: Wiley, 1980. 21. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York: Wiley, 1991. 22. Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Methods Based on Counting Processes. New York: Springer, 1992. 23. Burke HB, Henson DE. Criteria for prognostic factors and for an enhanced prognostic system. Cancer 1993; 72:3131–3135. 24. Henson DE. Future directions for the American Joint Committee on Cancer. Cancer 1992; 69:1639–1644. 25. Fielding LP, Fenoglio-Preiser CM, Freedman LS. The future of prognostic factors in outcome prediction for patients with cancer. Cancer 1992; 70:2367–2377. 26. Fielding LP, Henson DE. Multiple prognostic factors and outcome analysis in patients with cancer. Cancer 1993; 71:2426–2429. 27. Rothman KJ, Greenland S. Modern Epidemiology. 2nd ed. Philadelphia: Lippincott-Raven, 1998. 28. Simon R. Patients subsets and variation in therapeutic efficacy. Br J Cancer 1982; 14:473–482. 29. Gail M, Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 1985; 41:361–372. 30. Byar DP. Assessing apparent treatment covariate interactions in randomized clinical trials. Stat Med 1985; 4:255–263. 31. Schmoor C, Ulm K, Schumacher M. Comparison of the Cox model and the regression tree procedure in analyzing a randomized clinical trial. Stat Med 1993; 12: 2351–2366. 32. Bloom HJG, Richardson WW. Histological grading and prognosis in primary breast cancer. Br J Cancer 1957; 2:359–377. 33. Pfisterer J, Kommoss F, Sauerbrei W, Menzel D, Kiechle M, Giese E, Hilgarth M, Pfleiderer A. DNA flow cytometry in node positive breast cancer: prognostic value and correlation to morphological and clinical factors. Anal Quant Cytol Histol 1995; 17:406–412. 34. Schumacher M, Bastert G, Bojar H, Hu¨bner K, Olschewski M, Sauerbrei W, Schmoor C, Beyerle C, Neumann RLA, Rauschecker HF for the German Breast Cancer Study Group. Randomized 2⫻2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. J Clin Oncol 1994; 12:2086–2093. 35. Schmoor C, Olschewski M, Schumacher M. Randomized and non-randomized patients in clinical trials: experiences with comprehensive cohort studies. Stat Med 1996; 15:263–271. 36. Sauerbrei W, Hu¨bner K, Schmoor C, Schumacher M for the German Breast Cancer Study Group. Validation of existing and development of new prognostic classification schemes in node negative breast cancer. Breast Cancer Res Treat 1997; 42: 149–163. [Correction. Breast Cancer Res Treat 1998; 48:191–192.] 37. Cox DR. Regression models and life tables (with discussion). J R Stat Soc Ser B 1972; 34:187–220. 38. Schumacher M, Holla¨nder N, Sauerbrei W. Reduction of bias caused by model

Prognostic Factor Studies

39. 40. 41. 42.

43. 44. 45.

46. 47. 48. 49. 50. 51. 52.

53.

54.

55.

56. 57.

373

building. Proceedings of the Statistical Computing Section, American Statistical Association, 1996, pp. 1–7. Lausen B, Schumacher M. Maximally selected rank statistics. Biometrics 1992; 48:73–85. Hilsenbeck SG, Clark GM. Practical P-value adjustment for optimally selected cutpoints. Stat Med 1996; 15:103–112. Worsley KJ. An improved Bonferroni inequality and applications. Biometrika 1982; 69:297–302. Lausen B, Sauerbrei W, Schumacher M. Classification and regression trees (CART) used for the exploration of prognostic factors measured on different scales. In: Dirschedl P, Ostermann R, eds. Computational Statistics. Heidelberg: PhysicaVerlag, 1994, pp. 1483–1496. Lausen B, Schumacher M. Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comput Stat Data Analysis 1996; 21:307–326. Verweij P, Van Houwelingen HC. Cross-validation in survival analysis. Stat Med 1993; 12:2305–2314. Schumacher M, Holla¨nder N, Sauerbrei W. Resampling and cross-validation techniques: a tool to reduce bias caused by model building. Stat Med 1997; 16:2813– 2827. Van Houwelingen HC, Le Cessie S. Predictive value of statistical models. Stat Med 1990; 9:1303–1325. Andersen PK. Survival analysis 1982–1991: the second decade of the proportional hazards regression model. Stat Med 1991; 10:1931–1941. Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. Appl Stat 1999; 48:313–329. Miller AJ. Subset Selection in Regression. London: Chapman and Hall, 1990. Tera¨svirta T, Mellin I. Model selection criteria and model selection tests in regression models. Scand J Stat 1986; 13:159–171. Mantel N. Why stepdown procedures in variable selection? Technometrics 1970; 12:621–625. Sauerbrei W. Comparison of variable selection procedures in regression models— a simulation study and practical examples. In: Michaelis J, Hommel G, Wellek S, eds. Europa¨ische Perspektiven der Medizinischen Informatik, Biometrie und Epidemiologie. Mu¨nchen: MMV, 1993, pp. 108–113. Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Appl Stat 1994; 43:429–467. Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc Ser A 1999; 162:71–94. Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M, and the German Breast Cancer Study Group (GBSG). Modelling the effects of standard prognostic factors in node positive breast cancer. Br J Cancer 1999; 79:1752–1760. Hastie TJ, Tibshirani RJ. Generalized Additive Models. New York: Chapman and Hall, 1990. Valsecchi MG, Silvestri D. Evaluation of long-term survival: use of diagnostics

374

58.

59. 60. 61. 62. 63.

64. 65. 66. 67. 68.

69.

70. 71.

72.

73.

74. 75.

Schumacher et al. and robust estimators with Cox’s proportional hazards model. Stat Med 1996; 15: 2763–2780. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15:361–387. Chen CH, George SL. The bootstrap and identification of prognostic factors via Cox’s proportional hazards regression model. Stat Med 1985; 4:39–46. Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med 1989; 8:771–783. Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: application to the Cox regression model. Stat Med 1992; 11:2093–2109. Breiman L, Friedman JH, Olshen R, Stone CJ. Classification and Regression Trees. Wadsworth: Monterey, 1984. Zhang H, Crowley J, Sox HC, Olshen R. Tree-structured statistical methods. In: Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Chichester. Wiley, 1998, pp. 4561–4573. Gordon L, Olshen R. Tree-structured survival analysis. Cancer Treat Rep 1985; 69:1065–1069. LeBlanc M, Crowley J. Relative risk regression trees for censored survival data. Biometrics 1992; 48:411–425. LeBlanc M, Crowley J. Survival trees by goodness of split. JASA 1993; 88:457– 467. Segal MR. Regression trees for censored data. Biometrics 1988; 44:35–47. Segal MR. Tree-structured survival analysis in medical research. In: Everitt BS, Dunn G, eds. Statistical Analysis of Medical Data: New Developments, London: Arnold, 1998, pp. 101–125. Ciampi A, Hendricks L, Lou Z. Tree-growing for the multivariate model: the RECPAM approach. In: Dodge Y, Whittaker J, eds. Computational Statistics. Vol 1. Heidelberg: Physica-Verlag, 1992. Tibshirani R, LeBlanc M. A strategy for binary description and classification. J Comput Graph Stat 1992; 1:3–20. Sauerbrei W. On the development and validation of classification schemes in survival data. In: Klar R, Opitz O, eds. Classification and Knowledge Organization. Berlin, Heidelberg, New York: Springer, 1997, pp. 509–518. Schmoor C, Schumacher M. Methodological arguments for the necessity of randomized trials in high-dose chemotherapy for breast cancer. Breast Cancer Res Treat 1999; 54:31–38. Haybittle JL, Blamey RW, Elston CW, Johnson J, Doyle PJ, Campbell FC, Nicholson RI, Griffiths K. A prognostic index in primary breast cancer. Br J Cancer 1982; 45:361–366. Galea MH, Blamey RW, Elston CE, Ellis IO. The Nottingham Prognostic Index in primary breast cancer. Breast Cancer Res Treat 1992; 22:207–219. Balslev I, Axelsson CK, Zedeler K, Rasmussen BB, Carstensen B, Mouridsen HT. The Nottingham Prognostic Index applied to 9,149 patients from the studies of the Danish Breast Cancer Cooperative Group (DBCG). Breast Cancer Res Treat 1994; 32:281–290.

Prognostic Factor Studies

375

76. Collett K, Skjaerven R, Machle BO. The prognostic contribution of estrogen and progesterone receptor status to a modified version of the Nottingham Prognostic Index. Breast Cancer Res Treat 1998; 48:1–9. 77. Copas JB. Using regression models for prediction: shrinkage and regression to the mean. Stat Methods Med Res 1997; 6:167–183. 78. Vach W. On the relation between the shrinkage effect and a shrinkage method. Comput Stat 1997; 12:279–292. 79. Ripley BD. Pattern Recognition and Neural Networks. Cambridge: University Press, 1996. 80. Baxt WG. Application of artificial neural networks to clinical medicine. Lancet 1995; 346:1135–1138. 81. Cross SS, Harrison RF, Kennedy RL. Introduction to neural networks. Lancet 1995; 346:1075–1079. 82. Dybowski R, Gant V. Artificial neural networks in pathology and medical laboratories. Lancet 1995; 346:1203–1207. 83. Wyatt J. Nervous about artificial neural networks? Lancet 1995; 346:1175– 1177. 84. Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Stat Med 2000; 19:541– 561. 85. Penny W, Frost D. Neural networks in clinical medicine. Med Decis Making 1996; 16:386–398. 86. Cheng B, Titterington DM. Neural networks: a review from a statistical perspective (with discussion). Stat Sci 1994; 9:2–54. 87. Ripley BD. Statistical aspects of neural networks. In: Barndorff Nielsen OE, Jensen JL, eds. Networks and Chaos—Statistical and Probabilistic Aspects. London: Chapman and Hall, 1993. 88. Schumacher M, Rossner R, Vach W. Neural networks and logistic regression. Part I. Comput Stat Data Anal 1996; 21:661–682. 89. Stem HS. Neural networks in applied statistics (with discussion). Technometrics 1996; 38:205–220. 90. Warner B, Misra M. Understanding neural networks as statistical tools. Am Stat 1996; 50:284–293. 91. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks 1989; 2:359–366. 92. Ripley RM. Neural network models for breast cancer prognosis. Ph.D. dissertation, University of Oxford, Dept. of Engineering Science, Oxford, 1998. 93. Liestøl K, Andersen PK, Andersen U. Survival analysis and neural nets. Stat Med 1994; 13:1189–1200. 94. Ravdin PM, Clark GM. A practical application of neural network analysis for predicting outcome on individual breast cancer patients. Breast Cancer Res Treat 1992; 22:285–293. 95. Ravdin PM, Clark GM, Hilsenbeck SG, Owens MA, Vendely P, Pandian MR, McGuire WL. A demonstration that breast cancer recurrence can be predicted by neural network analysis. Breast Cancer Res Treat 1992; 21:47–53. 96. De Laurentiis M, Ravdin PM. Survival analysis of censored data: neural network

376

97. 98.

99. 100. 101.

102.

103.

104. 105. 106.

107.

108. 109. 110.

111. 112. 113. 114.

Schumacher et al. analysis detection of complex interactions between variables. Breast Cancer Res Treat 1994; 32:113–118. Burke HB. Artificial neural networks for cancer research: outcome prediction. Semin Surg Oncol 1994; 10:73–79. McGuire WL, Tandon AK, Allred, DC, Chamness GC, Ravdin PM, Clark GM. Treatment decisions in axillary node-negative breast cancer patients. Monogr Nat Cancer Inst 1992; 11:173–180. Kappen HJ, Neijt JP. Neural network analysis to predict treatment outcome. Ann Oncol 1993; 4(suppl):31–34. Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995; 14:73–82. Biganzoli E, Boracchi P, Mariani L, Marubini E. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. Stat Med 1998; 17:1169–1186. Ripley BD, Ripley RM. Neural networks as statistical methods in survival analysis. In: Dybowski R, Gant V, eds. Clinical Applications of Artificial Neural Networks. New York: Cambridge University Press, 2001 (in press). Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med 1999; 18:2529– 2545. Parkes MC. Accuracy of predictions of survival in later stages of cancer. Br Med J 1972; 264:29–31. Forster LA, Lynn J. Predicting life span for applicants to inpatient hospice. Arch Intern Med 1988; 148:2540–2543. Maltoni M, Pirovano M, Scarpi E, Marinari M, Indelli M, Arnoldi E, Galluci M, Frontini L, Piva L, Amadori D. Prediction of survival of patients terminally ill with cancer. Cancer 1995; 75:2613–2622. Henderson R, Jones M. Prediction in survival analysis. model or medic. In: Jewell NP, Kimber AC, Ting Lee ML, Withmore GA, eds. Lifetime Data: Models in Reliability and Survival Analysis. Dordrecht: Kluwer Academic Publishers. 1995. Henderson R. Problems and prediction in survival-data analysis. Stat Med 1995; 3:143–152. Brier GW. Verification of forecasts expressed in terms of probability. Monthly Weather Rev 1950; 78:1–3. Hilden J, Habbema JDF, Bjerregard B. The measurement of performance in probabilistic diagnosis III: methods based on continuous functions of the diagnostic probabilities. Methods Inform Med 1978; 17:238–246. Hand DJ. Construction and Assessment of Classification Rules. Chichester: Wiley, 1997. Shapiro AR. The evaluation of clinical predictions. N Engl J Med 1997; 296:1509– 1514. Korn EJ, Simon R. Explained residual variation, explained risk and goodness of fit. Am Stat 1991; 45:201–206. Schemper M. The explained variation in proportional hazards regression. Biometrika 1990; 77:216–218. [Correction. Biometrika 1994; 81:631.]

Prognostic Factor Studies

377

115. Graf E, Schumacher M. An investigation on measures of explained variation in survival analysis. Statistician 1995; 44:497–507. 116. Schemper M, Stare J. Explained variation in survival analysis. Stat Med 1996; 15: 1999–2012. 117. Ash A, Schwartz M. R2: a useful measure of model performance when predicting a dichotomous outcome. Stat Med 1999; 18:375–384. 118. Efron B. Estimating the error rate of prediction rule: improvement on cross-validation. JASA 1983; 78:316–330. 119. Efron B, Tibshirani R. Improvement on cross-validation: the .632⫹ bootstrap method. JASA 1997; 92:548–560. 120. Fayers PM, Machin D. Sample size: how many patients are necessary? Br J Cancer 1995; 72:1–9. 121. Schoenfeld DA. Sample size formula for the proportional-hazard regression model. Biometrics 1983; 39:499–503. 122. George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chronic Dis 1974; 27:15–24. 123. Bernstein D, Lagakos SW. Sample size and power determination for stratified clinical trials. J Stat Comput Sim 1978; 8:65–73. 124. Schoenfeld DA. The asymptotic properties of nonparametric tests for comparing survival distribution. Biometrika 1981; 68:316–319. 125. Schmoor C, Sauerbrei W, Schumacher M. Sample size considerations for the evaluation of prognostic factors in survival analysis. Stat Med 2000; 19:441–452. 126. Lui K-J. Sample size determination under an exponential model in the presence of a confounder and type I censoring. Control Clin Trials 1992; 13:446–458. 127. Palta M, Amini SB. Consideration of covariates and stratification in sample size determination for survival time studies. J Chronic Dis 1985; 38:801–809. 128. Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis. I. Background, goals and general strategy. J Clin Epidemiol 1995; 48:1495–1501. 129. Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48:1503–1510. 130. Harrell FE, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modeling strategies for improved prognostic prediction. Stat Med 1984; 3:143–152. 131. Harrell FE, Lee KL, Matchar DB, Reichert TA. Regression models for prognostic prediction: advantages, problems and suggested solutions. Cancer Treat Rep 1985; 69:1071–1077. 132. Lubin JH, Gail MJ. On power and sample size for studying features of the relative odds of disease. Am J Epidemiol 1990; 131:551–566. 133. Peterson B, George SL. Sample size requirements and length of study for testing interaction in a 2 ⫻ k factorial design when time-to-failure is the outcome. Control Clin Trials 1993; 14:511–522. 134. Olschewski M, Schumacher M, Davis K. Analysis of randomized and nonrandomized patients in clinical trials using the comprehensive cohort follow-up study design. Control Clin Trials 1992; 13:226–239.

378

Schumacher et al.

135. Dambrosia JM, Ellenberg JH. Statistical considerations for a medical data base. Biometrics 1980; 36:323–332. 136. Green SB, Byar DP. Using observational data from registries to compare treatments: the fallacy of omnimetrics. Stat Med 1984; 3:361–370. 137. Pfisterer J, Kommoss F, Sauerbrei W, Renz H, duBois A, Kiechle-Schwarz M, Pfleiderer A. Cellular DNA content and survival in advanced ovarian cancer. Cancer 1994; 74:2509–2515. 138. Hilsenbeck SG, Clark GM, McGuire WL. Why do so many prognostic factors fail to pan out? Breast Cancer Res Treat 1992; 22:197–206. 139. Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med 2000; 19:453–473. 140. European Community, CPMP Working Party on Efficacy of Medicinal Products. Biostatistical methodology in clinical trials in applications for marketing authorisations for medicinal products. Note for Guidance III/3630/92-EN, December 1994. Stat Med 1995; 14:1659–1682. 141. International Conference of Harmonisation and Technical Requirements for Registration of Pharmaceuticals for Human Drugs. ICH Harmonised Tripartite Guideline: Guideline for Good Clinical Practice. Recommended for Adoption at Step 4 of the ICH Process on 1 May 1996 by the ICH Steering Committee. 142. Peto R. Clinical trial methodology. Biomed 1978; 28:24–36. 143. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? (with discussion). Stat Med 1984; 3:402–422. 144. Lubsen J, Tijssen GP. Large trials with simple protocols. Indications and contraindications. Control Clin Trials 1987; 10:151–160. 145. Simes RJ. Publication bias: the case for an international registry of clinical trials. J Clin Oncol 1986; 4:1529–1541. 146. Begg CB, Berlin A. Publication bias: a problem of interpreting medical data. J R Stat Soc Ser A 1988; 151:419–463. 147. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet 1991; 337:867–872. 148. Stern JM, Simes RJ. Publication bias: evidence of delayed publication in a cohort study of clinical research projects. Br Med J 1997; 315:640–645. 149. Simon R. Meta-analysis and cancer clinical trials. Principles Pract Oncol 1991; 5: 1–9. 150. Felson DT. Bias in meta-analytic research. J Clin Epidemiol 1992; 45:885–892. 151. Vach W. Logistic Regression with Missing Values in the Covariates. Lecture Notes in Statistics 86. New York: Springer 1994. 152. Vach W. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med 1997; 16:57–72. 153. Robins JM, Rotnitzky A, Zhao LD. Estimation of regression coefficients when some regressors are not always observed. JASA 1994; 89:846–866. 154. Lipsitz SR, Ibrahim JG. Estimating equations with incomplete categorical covariates in the Cox model. Biometrics 1998; 54:1002–1013.

18 Statistical Methods to Identify Prognostic Factors Kurt Ulm, Hjalmar Nekarda, Pia Gerein, and Ursula Berger Technical University of Munich, Munich, Germany

I.

INTRODUCTION

In recent years the search for prognostic factors has stimulated increasing attention in medicine, especially in oncology and cardiology. Two reasons are mainly responsible for this trend. First, one is interested in getting more insight in the development of the disease (e.g., in the tumor biology). Second, there is a tendency away from a more or less uniform therapy toward an individual therapy. An improved estimate about prognosis can also be used to inform a patient more precisely about the further outcome of the disease. Further reasons to explore prognostic factors are discussed by Byar (1). Risk stratification in oncology until now is based mainly on conventional factors of tumor staging (TNM classification: UICC [2]) like local tumor invasion or size, status of lymph nodes, and status of metastasis. But the outcome of these factors to answer the questions mentioned above is limited. For a better stratifica-

379

380

Ulm et al.

tion of patients for prognosis and therapy, the staging system needs to be more sophisticated and new factors have to be identified. For example in breast cancer, presumably one of the leading fields in that research area, about 100 new factors are under discussion (3). One of the related topics is the question about adjuvant therapy, especially who should be treated. The problems are not restricted to breast cancer. In other locations and in other medical disciplines there are the same problems (e.g., in stomach cancer [4,5] or in cardiology [6]). Stomach cancer belongs to the category of tumors where the effect of an adjuvant chemotherapy has not been proven. The search for important factors is a great challenge in medicine. The appropriate analysis on the other hand is by far not a routine task for statisticians. Comparable with medicine where traditional factors were mainly used, in statistics classic methods like the logistic or the Cox regression have been used over decades. Parallel to the development of new prognostic factors in medicine, new statistical tools have been described. The aim of this chapter is to summarize and highlight some of these new developments. Data on patients with stomach cancer are used to illustrate these new methods. Recently, Harrell et al. (7) proposed a system that can be used to identify important factors. The proposals made here contain some other features and are concentrated to give an answer to the two questions mentioned at the beginning.

II. METHODS A.

Classic Method

The example used to illustrate the new developments contains censored data. For this type of data in the literature, mostly the Cox model is used. If the time of follow-up is denoted by t and the potential prognostic factors with Z ⫽ (Z l . . . , Z p ), the model is usually given in the following form (8): λ(t |Z ) ⫽ λ(t | Z ⫽ O) ⋅ e ∑βjZj ⫽ λ 0 (t) ⋅ e ∑βjZ j

(1)

with λ(t |Z ) being the hazard rate at time t given the factors Z. Throughout the chapter very often the ratio of the hazard function or the logarithm is considered: ln

λ(t |Z ) ⫽ λ 0 (t)

p

冱 βZ

j j

(2)

j⫽1

which is also denoted as relative hazards and can be interpreted as the logarithm of the relative risk (ln RR).

Statistical Methods to Identify Prognostic Factors

381

B. Linearity Assumption In the simplest form, each continuous factor Z is assumed, maybe after some transformation, to be linearly related to the outcome. There are several proposals in the literature on how to check the assumption of linearity. One approach is to change the linear relationship β ⋅ Z into a functional form β(Z ). We want to mention two methods to estimate β(Z ). One approach is not to specify the function β(Z ). The only assumption is that β(Z ) has to be smooth. This leads to smoothing splines (9). The method to estimate β(Z ) can be described in the context of a penalized log-likelihood function Lp(β): Lp(β) ⫽ 2 ⋅ 1(β) ⫺ λ ⋅ P(β)

(3)

where l(β) is the usual log-likelihood function, P(β) is a roughness penalty, penalizing deviations from smooth functions, and λ is the weight of penalizing. Using the integrated squared derivative as roughness penalty P(β) ⫽ ∫(β″(Z )) 2 dZ the maximum of relation (3) leads to natural cubic splines (10). The problem is the appropriate choice of λ. The main problems for wider applications associated with this approach are the use of special software like S⫹ and the lack of a simple statistics to test whether β(Z ) is equal to β ⋅ Z or different. A second option is the use of fractional polynomials (11). The idea is simply to construct a function β(Z ) consisting of up to some polynomials of the form Z pi , p i ∈ {⫺2, ⫺1.5, ⫺1, ⫺0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3}. Either one component or at least two components seems to be sufficient for most practical applications, i.e., i ∈ {1, 2}. The advantage of this approach is the representation of β(Z ) in a functional form. If two polynomials are used a variety of functional relationships can be described. In Figure 1 both methods, the result of using smoothing splines and fractional polynomials for one of the factors, PAI-1, of the example used in Section III are shown. There is only a slight difference between the results of both methods. The deviance from linearity is obvious. The advantages of fractional polynomials is that standard software packages, like SPSS or SAS, and common test statistics can be used for their determination. C. Proportional Hazards Assumption One of the basic assumptions in using the Cox model is that of proportional hazards. The effect of a certain factor is assumed to be constant over the total follow-up period. On the other hand, it is more natural to assume a change (e.g., a decrease in the effect if time is prolonged). Considering one factor Z, the idea

382

Ulm et al.

Figure 1 Influence of a prognostic factor (PAI-1) on the relative risk plotted on an lnscale. The results of assuming a linear relationship (⋅ ⋅ ⋅ ⋅), smoothing spline (——), including the 95% CI (---) and fractional polynomials (— —), are shown.

is to extend the linear assumption β ⋅ Z into γ(t) ⋅ Z. Now the influence of a certain factor Z on the hazard ratio can be described as a function of time. One way to simplify the analysis is to restrict this form of relationship to binary variables. Otherwise, some form of relationship between γ(t) and Z has to be assumed. The hypothesis of interest is whether γ(t) is constant or not (H 0: γ(t) ⫽ γ 0). There is a long history in extending the classical Cox model. The first approach by Cox himself was based on using some predefined functions, e.g., a linear (γ(t) ⫽ t) or a log-function (γ(t) ⫽ ln t). Over the years several proposals have been made to extend this approach. One approach is again the use of smoothing splines (12) to analyze the time-varying effect of a certain factor Z. As an alternative, fractional polynomials can also be used by defining γ(t) ⫽ ∑ β i ⋅ t pi. The advantages of fractional polynomials compared with smoothing splines are again the use of standard software packages and the direct use of a simple test statistics. For the analysis, estimation methods for regression models

Statistical Methods to Identify Prognostic Factors

383

Figure 2 The time-varying effect of age (ⱕ65 years vs. older) using fractional polynomials (γ(t) ⫽ ⫺1.28 ⫹ 4.94/ √t) is shown.

with time-dependent covariates can be applied by performing the transformations X i (t) ⫽ Z ⋅ t pi. According to the selection of γ(t), the values X i (t) have to be calculated for all observed failure times. Figure 2 shows an example with a timevarying effect. In this example the influence of age on the survival rate is considered (for details, see Sect. III). Age is divided into two groups according to the median of 65 years. In all the analyses, a decrease of the effect during extended follow-up can be seen D. Combination of ␤(Z ) and ␥(t) into One Model In the context of a regression model both extensions can be combined into a model of the form *

λ(t| Z,Z*) ⫽ λ 0 (t) ⋅ e ∑βj(Zj )⫹∑γj(t)⋅Zj

(4)

384

Ulm et al.

where Z denotes all the continuous factors and Z* all the binary covariates either binary in a natural way, like gender, or coded. Within this model the influence of certain factors on the event rate can be investigated in greater extent compared with the classical Cox model. E.

Selection Procedure

We describe very briefly one option for the selection of factors in the context of fractional polynomials. In the first step, in a univariate analysis the ‘‘optimal’’ choices for β(Z ) and γ(t) can be identified. For the division of a continuous factor Z into a binary variable Z*, two options are possible, either the use of the predefined cutpoints or the selection of ‘‘optimal’’ cutpoints based on maximal selected test statistics. The second choice is connected with an inflated p value. However, no proof exists whether ‘‘optimal’’ binary coding has any influence on the selection of γ(t). In a stepwise forward procedure, either the factor Zj or Z*j is selected that provides the best fit, based on the likelihood ratio statistics taking into account the degrees of freedom for β(Z ) or γ(t). An alternative can be the use of the criterion proposed by Akaike AIC ⫽ Dev ⫹ 2 ⋅ ν with Dev the deviance and v the number of parameters used to describe for β(Z ) or γ(t). The selection in connection with the use of smoothing splines is described in another article (13). We want to concentrate here on the use of fractional polynomials. However, model (4) does not give directly an answer to the classification of a patient into a certain risk group. One approach is to divide the functional term PI ⫽ ∑β j (Z j ) ⫹ ∑γ(t)Z*j into certain intervals. The problem is of course related to the cutpoints used. Therefore, another approach, the CART method, has attracted great attention—at least in the medical community (14). F.

CART Method

The idea of this method is simply to split the whole data into two subsamples with the greatest difference in the outcome (e.g., the survival rate). For this split all factors with all possible divisions into two groups are considered. If a split is performed, both subsamples are further analyzed independently in the same way until no further split is recommended, for example, the difference is too small or the number of patients in the subsample is too low. For performing a split, a certain test statistic has to be selected. In the situation of failure time data, the log-rank test is often used. The CART method results in a variety of so-called terminal nodes with subgroups of patients at different risks. The clinicians can now identify the subgroups where different therapies should be applied. There

Statistical Methods to Identify Prognostic Factors

385

are several proposals how to define the ‘‘optimal’’ tree (e.g., with the lowest misclassification rate). One way to get an optimal tree is to prune the tree (15). This means a large tree is constructed and afterward this large tree is cut back. Another problem is related to the selection of the optimal splits. Usually there is a mixture of continuous and discrete factors. It is well known that there is an inflation of the test statistics in analyzing continuous data, called the maximal selected test statistics. To make a fair comparison, the p value has to be adjusted. One can use a permutation test or some correction formulas (16).

III. RESULTS A. Description of the Data The study contains data of 295 completely resected patients with stomach cancer who underwent curative surgery between 1987 and 1996 (17). The follow-up period is between 3 months and 11 years (median, 41 months). Until now 108 patients had died. Figure 3 shows the survival curve for the whole sample. In addition to the traditional factors (TNM classification), new factors like uPA and PAI-1 were investigated (17). Table 1 gives a short description of the prognostic factors used in the analysis. The classic Cox model gives the results as shown in Table 2. The continuous factors have not been divided into certain categories. In the multivariate analysis the percentage of positive lymph nodes (NODOS.PR) and the local tumor invasion (T.SUB) (18) turned out to be statistically significantly correlated with the survival rate. B. Results Using Fractional Polynomials 1. Univariate Analysis The first step was used to identify ‘‘optimal’’ functions for β(Z ) and γ(t). The results for β(Z ) can be seen in Table 3. Only the continuous factors are included in this analysis. Three of five continuous factors show an association with the event rate (NODOS.PR, uPA, and PAI-1) comparable with the classic Cox model. However, the form of the relationship is better described in a nonlinear way. Two factors (AGE and NODOS.GE) show no association on the log of the hazard ratio even using a more flexible form of the relationship. For the analysis of time-varying effects all continuous factors Z have been changed into binary variables Z*. The transformation was based either on predefined cutpoints (AGE, NODOS.PR, and NODOS.GE) or optimized cutpoints

386

Ulm et al.

Figure 3 Survival curve for the total sample (n ⫽ 295, 108 deaths) of patients with stomach cancer.

(uPA and PAI-1). The results of the univariate analysis of time-varying effects (⫽ γ(t)) can be seen in Table 4. For AGE, T.SUB, METAS, DIFF, and NODOS.GE there is a significant change in the effect during follow-up. 2. Multivariate Analysis In the multivariate analysis, model (4) has been considered. In a stepwise forward procedure all the results from the univariate analyses either in considering β(Z ) or γ(t) have been included (Table 5). This means the functional form remained unchanged. Only the parameters are newly estimated in this multivariate analysis. The selection procedure is based on the likelihood ratio statistics taking into account the degrees of freedom. The percentage of positive lymph nodes is the most important factor, showing a constant effect over time. The value of the likelihood function (⫺2 ⋅ ln L)

Statistical Methods to Identify Prognostic Factors Table 1

387

Prognostic Factors Analyzed in the Stomach Cancer Study

Factor

Range

AGE

28–90

NODOS.PR

0–97

T.SUB

1–7

METAS

yes/no

DIFF

1–4

NODOS.GE

6–105

uPA

0.02–20.57

PAI-1

0.02–264.62

Coding

Interpretation

0: 1: 0: 1: 0: 1:

ⱕ65 (median) ⬎65 ⬍20 ⱖ20 ⱕ4 (cutoff) ⬎4

0: 1: 0: 1: 0: 1: 0: 1: 0: 1:

no yes ⫽ 1, 2 (cutoff) ⫽ 3, 4 ⱕ42 (median) ⬎42 ⱕ5.94 (cutoff) ⬎5.94 ⬍4.13 (cutoff) ⱖ4.13

Age at surgery Percentage of positive lymph nodes Local tumor invasion (Japanese staging system [17]) Cutoff ⫽ lamina subserosa Lymph node metastasis (no. 12, 13 of comp. III [17]) Grading Total number of removed lymph nodes Urokinase-type plasminogen activator Plasminogen activator inhibitor type 1

Table 2 Result of the Analysis of the Stomach Cancer Study Using the Classic Cox Model Type of analysis Univariate Factor AGE NODOS.PR T.SUB (5–7 vs. 1–4) METAS (yes vs. no) DIFF (3,4 vs. 1,2) NODOS.GE uPA PAI-1

Multivariate

eβ

p Value

eβ

p Value

1.01 50.6 4.7 3.9 1.7 1.0 1.1 1.1

0.41 ⬍0.001 ⬍0.001 ⬍0.001 0.02 0.48 ⬍0.001 0.004

1.01 19.2 2.7 1.3 1.1 1.0 1.1 1.0

0.34 ⬍0.001 ⬍0.001 0.30 0.64 0.77 0.11 0.43

388

Ulm et al.

Table 3 ‘‘Optimal’’ Choices for the Fractional Polynomials β(Z ) in the Univariate Analysis (only Continuous Factors Are Included) β(Z )

Factor

p(H 0 : β(Z ) ⫽ 0)

0.01 ⋅ Z ⫺0.12 Z ⫺1 ⫹ 1.97 Z 2 ⫺113 ⋅ Z ⫺2 ⫺0.17 Z ⫺2 ⫹ 0.01 Z 2 ⫺1.99/√Z 2

AGE NODOS.PR NODOS.GE uPA PAI-1

0.35 ⬍0.001 0.12 ⬍0.001 ⬍0.001

Table 4 ‘‘Optimal’’ Choices for the Fractional Polynomials γ(t) Analyzing the Time-varying Effect of all Dichotomized Factors (Univariate Analysis) γ(t)

p(H 0 : γ (t) ⫽ γ 0)

⫺1.3 ⫹ 4.9/√t Constant 1.0 ⫹ 0.01 t 2 ⫺ 0.003 t 2 ⋅ ln t 1.8 ⫹ 84.5/t 2 ⫺ 68.7/t 2 ⋅ ln t ⫺0.09 ⫺ 0.09/t 2 ⫹ 37.8/t 2 ⋅ ln t 0.7 ⫺ 0.001 ⋅ t 2 Constant Constant

0.001 — 0.04 0.01 0.01 0.001 — —

Factor AGE NODOS.PR T.SUB METAS DIFF NODOS.GE uPA PAI-1

Table 5

Multivariate Analysis: Result of the Stepwise Selection Procedure

Step

Factor

β(Z ) or γ(t)*

1 2 3 4 5 Total

NODOS.PR T.SUB AGE NODOS.GE PAI-1

β 1 (Z ) γ 2 (t) γ 3 (t) γ 4 (t) β 5 (Z )

* β 1 (Z) ⫽ ⫺0.09/Z ⫹ 1.84 ⋅ Z 2 γ 2 (t) ⫽ ⫺0.09 ⫹ 0.01 ⋅ t 2 ⫺ 0.004 t 2 ⋅ ln t γ 3 (t) ⫽ ⫺0.96 ⫹ 4.44/√t γ 4 (t) ⫽ 0.62 ⫺ 0.001 ⋅ t 2 β 5 (Z) ⫽ ⫺1.05/√Z

Likelihood ratio statistics R 115 20 13 10 7 165

p ⬍0.001 ⬍0.001 0.002 0.007 0.008

Statistical Methods to Identify Prognostic Factors

389

(a)

(b) Figure 4 Results of the multivariate analysis: (a) time-varying effects γ(t) for T.SUB, AGE, and NODOS.GE; (b) β(Z ) for NODOS.PR (b1) and PAI-1 (b2).

390

Ulm et al.

(c) Figure 4 Continued

is increased by value of R ⫽ 115. The second factor selected is the local tumor invasion (T.SUB), showing a strong time-varying effect (R ⫽ 20). The influence of T.SUB increases within the first 3 years and then decreases. The next factor selected is age, with a dynamic effect (R ⫽ 13). Shortly after surgery the older patients (65 years and older) had a higher mortality rate. But the difference is declining, and after about 2 years of follow-up the situation changes and the younger patients seem to have the higher risk. The fourth factor selected is the total number of lymph nodes removed (R ⫽ 10). There is again a change of the effect over time. The patients with 42 or more lymph nodes removed have the higher risk at the beginning. But about 2 years after surgery, the risk in the group with fewer lymph nodes removed is increasing. Finally, the effect of PAI-1 is considered to be important (R ⫽ 7). The influence of PAI-1 is constant over time but the value of PAI-1 seems important. In contrast to the result of the classical Cox model, an additional effect of AGE, NODOS.GE, and PAI-1 could be identified. The result can be seen in Figure 4.

Statistical Methods to Identify Prognostic Factors

391

C. CART Method The most important factor was the percentage of positive lymph nodes. This factor was first divided into two categories (ⱕ20% and greater) based on clinical experience (19). Seventy-three patients had more than 20% positive lymph nodes, whereas 54 have died so far. Among the remaining 222 patients with less than 20% positive lymph nodes, 54 have also died. The analysis of the continuous factor, percentage of positive lymph nodes, showed that the predefined cutpoint of 20% was close to the optimal cutpoint (Fig. 5). The cutpoint with the highest value of the test statistics was 12% (χ 2LR ⫽ 124.5) followed by 21% (χ 2LR ⫽ 121.3). In the next step, the subsample with the high mortality rate was further divided by the same factor using a cutpoint of 70% into a group of 12 patients where all have died and another group where 42 of 61 have died. In the subsample with less than 20% positive lymph nodes, the factor T.SUB shows the best discrimination. The next split is performed with uPA. Altogether six subgroups are

Figure 5 Log-rank rest statistics to select the optimal cutoff value for the split of the continuous factor NODOS.PR into two groups.

392

Ulm et al.

Figure 6 Result of the CART-analysis after pruning. At each split, the value of the logrank test statistics the total number of patients (⫽ n) and the number of deaths (⫽ ⫹) are given. For each terminal node additionally the relative risk (RR) compared with the total sample is calculated.

identified with the optimal tree after pruning, given in Figure 6. There is a great difference in the mortality rate starting from 13 of 126 to 12 of 12. The relative risks given in Figure 6 for each of the terminal nodes represent the risk compared with the total sample. Therefore, some of the values are below 1 and some are above 1. Four of the six subgroups have a higher mortality rate compared with the total sample. The two remaining subgroups containing about half of all patients have a low mortality rate.

IV. DISCUSSION Within the extended regression model, additionally the nonlinear influence of PA1-1 and the significant change in the effect of AGE, T.SUB, and NODOS.GE during follow-up could be detected.

Statistical Methods to Identify Prognostic Factors

393

Based on these results, a more detailed prognosis for an individual patient can be made. A further impact of this analysis can be a different schedule for follow-up. Patients with an increased risk at the beginning should be medically examined more frequently shortly after surgery. The result can also be used to investigate the factors in greater detail. There is a discussion in the literature regarding adjuvant treatment in gastric cancer (20). Based on the result of CART, all patients except that in the two groups with the lowest risks (RR ⫽ 0.23 and RR ⫽ 0.47) seem to be candidates for some sort of adjuvant therapy. Fractional polynomials can be applied in connection with standard software packages like SPSS or SAS. Especially the time-varying effect can be analyzed in an easy way. The idea is simply to use available estimation methods for timedependent covariates after having applied suitable transformations. Therefore at each observed failure time the transformed value of the covariate, assuming a time-varying effect, has to be calculated. To simplify the analysis, the form of the relationship can be estimated in a univariate model. This form is than used in the multivariate analysis. The next step can be to extend this model for analyzing also interactions. On the other hand, developing treatment decisions based on the result of regression analysis should be met with suspicion. It seems more appropriate to use the result of CART for the identification of subgroups of patients where certain strategies should be applied. But in any case it seems justified to use these extensions of the classic models to get more insight into the data and the disease. For the selection of variables in the classic Cox model, a stepwise procedure, either forward or backward, is mostly used. However, the effect of the variables is sometimes overestimated. In recent years some procedures to correct these estimates have been proposed. One method is called shrinkage (21). A shrinkage factor λ (λ ⬍ 1), depending on the total number of regression parameters and the likelihood ratio statistics of the particular model, is calculated and the estimated regression parameters β have to be multiplied by λ to give adequate values for these parameters. The shrinkage factor can also be estimated by using bootstrapping or cross-validation (7). Another approach, recently published by Tibshirani (22), is called lasso. The idea there is to estimate the parameters β under same constraints (∑ | β j | ⱕc). The sum of the standardized regression parameters should be less than some predefined value c. The effect is that some of the factors Z j , which are only borderline significant, are ignored in the model. The estimation of β depends on the choice of c. A small value of c corresponds to a model with only few parameters. Breiman (23) made a proposal on how to improve the prediction using CART, called ‘‘bagging.’’ The idea is to construct new samples by using bootstrap techniques and apply the CART method to all samples. For each sample a

394

Ulm et al.

new tree or decision rule is obtained. The result is a whole set of different decision rules. To classify a new patient, these decision rules have to be applied and the average has to be calculated. It can be shown that the misclassification rate can be reduced. The problem of this method, however, is that no simple decision rule is available. In summary the result of the extended Cox model gives more insight into the further development of the disease. The result can be used to give a better information about prognosis and to define the appropriate schedule for further medical examinations. The CART method helps to identify risk groups and can be used directly for treatment decisions.

REFERENCES 1. Byar D. Identification of prognostic factors. In: Buyse M, et al., eds. Cancer Clinical Trials. Oxford Press, 1984. 2. Hermanek P, Henson DE, Hutter RUP, Sobin LH. UICC TNM supplement. A commentary on uniform use. Berlin: Springer, 1993. 3. McGuire WL, Clark GM. Prognostic factors and treatment decisions in axillarynode-negative breast cancer. N Engl J Med 1992; 326:1756–1761. 4. Siewert JR, Fink U, et al. Gastric cancer. Curr Probl Surg 1997; 34:838–928. 5. Allgayer H, Heiss MM, Schildberg FW. Prognostic factors in gastric cancer. Br J Surg 1997; 84:1651–1664. 6. Schmidt G, Malik M, Ulm K, et al. Heart rate chronotropy following ventricular prematine beats predicts mortality after acute myocardial infarction. Lancet 1999; 353:1390–1396. 7. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy and measuring and reducing errors. Stat Med 1996; 15:361–387. 8. Cox DR. Regression models and life tables. J R Stat Soc B 1972; 34:187–220. 9. Hastie T, Tibshirani R. Generalized additive Models. London: Chapman and Hall, 1990. 10. Green P, Silverman B. Nonparametric regression and generalized linear models. London: Chapman and Hall, 1994. 11. Royston P, Altmann DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat 1994; 43:429–467. 12. Hastie T, Tibshirani R. Varying-coefficient models. J R Stat Soc B 1993; 55:757– 796. 13. Ulm K, Klinger A, Dannegger F. Identifying and modeling prognostic factors with censored data. Stat Med 1998. 14. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. New York: Chapman and Hall, 1984. 15. Le Blanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993; 88:457–467.

Statistical Methods to Identify Prognostic Factors

395

16. Hilsenbeck SG, Clark GM. Practical p-value adjustment for optimally selected cutpoints. Stat Med 1996; 15:103–112. 17. Nekarda H, Schmitt M, Ulm K, Wenninger A, Vogelsang H, et al. Prognostic impact of urokinase-type plasminogen activator and its inhibitor PAI-1 in completely resected gastric cancer. Cancer Res 1994; 54:2900–2907. 18. Japanese Research Committee for Gastric Cancer. The general rules for gastric cancer study in surgery and pathology. Jpn J Surg 1981; 11:127–138. 19. Roder JD, Bottcher K, Busch R, Wittekind C, Hermanek P, Siewert JR. Classification of regional lymph node metastasis from gastric carcinoma. German Gastric Cancer Study Group. Cancer 1998; 82:621–631. 20. Bleiberg H, Sahmoud T, Di Leo A, Cunningham D, Rougier P. Adequate number of patients are needed to evaluate adjuvant treatment in gastric cancer. J Clin Oncol 1998; 16:3714. 21. Van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat Med 1990; 9:1303–1325. 22. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 1996; 58:267–288. 23. Breiman L. Bagging predictors. Machine Learning 1996; 26:123–140.

19 Explained Variation in Proportional Hazards Regression John O’Quigley University of California at San Diego, La Jolla, California

Ronghui Xu Harvard School of Public Health and Dana-Farber Cancer Institute, Boston, Massachusetts

I.

EXPLAINED VARIATION IN SURVIVAL

A. Motivation For many survival studies based on the use of a regression model, in addition to the usual model fitting and diagnostic tools—the evaluation of relative and combined predictive effects—it is also desirable to present summary measures estimating the percentage of explained variation. Making precise the notion of explained variation in the particular context of proportional hazards regression requires some thought. But before considering more closely the specifics of the model, roughly speaking we know that any suitable measure would reflect the relative importance of the covariates. This relative importance applies to the data set in hand, but additionally any measure should be estimating some meaningful population counterpart, a population value that can be given a concrete and intuitively useful interpretation. To give the ideas a more tangible framework, consider a study of 2174 breast cancer patients, followed over a period of 15 years at the Institute Curie in Paris, France. A large number of potential and known prognostic factors were 397

398

O’Quigley and Xu

recorded. Detailed analyses of these data have been the subject of a number of communications. Let us suppose that we focus here on a subset of prognostic factors: age at diagnosis, histology grade, stage, progesterone receptor status, and tumor size. We would like to be able to say, for example, that stage explains some 20% of survival but that once we have taken account progesterone status, age, and grade, this figures drops to 5%. Or that by adding tumor size to a model in which the main prognostic factors are already included, the explained variation increases, say, a negligible amount, specifically from 32% to 33%. Or given that some variable can explain so much variation, then to what extent do we lose (or gain), in terms of these percentages, by recoding a continuous prognostic variable, age at diagnosis for example, into discrete classes on the basis of cutpoints? Note that for this latter problem the models are nonnested and so the problem would be inherently more involved.

B.

Explained Variation in Regression Models

Consider the pair of random variables (T, Z ). Denote the marginal distribution functions by F(t) and G(z) and the conditional distribution functions by F(t | z) and G(z| t). A question of interest might relate to the reduction in variance of the random variable T by conditioning upon Z. The conditional variance of T given Z translates predictability, for a normal model directly in terms of prediction intervals and for other models if only by virtue of the Chebyshev inequality. Quite generally, i.e., independently of any model, we have that Var(T ) ⫽ E{Var(T|Z )} ⫹ Var{E(T|Z )}

(1)

The above identity enables us to write down an expression for the proportion of explained variation Ω 2 as Ω 2 (T|Z ) ⫽

Var(T ) ⫺ E{Var(T| Z )} Var{E(T| Z )} ⫽ Var(T ) Var(T )

(2)

the notation Ω 2 (T| Z ) reminding us which way round we are conditioning. We may also be interested in Ω 2 (Z|T ), the two quantities coinciding for bivariate normal models. The above expression does not lean on any model. When there is no reduction in variance by conditioning upon Z, then Var(T ) ⫽ E{Var(T| Z )} and Ω 2 ⫽ 0. When a knowledge of Z makes T deterministic, then Var(T|Z ) ⫽ 0 and Ω 2 ⫽ 1. Intermediate values of Ω 2 have a precise interpretation in terms of percentages of explained variation as a consequence of Eq. (2). Apart from the marginal variance, Var(T ), the relevant quantity we need to define Ω 2 can be expressed as

Explained Variation in Proportional Hazards Regression

E{Var(T|Z )} ⫽

冮冮 ᐆ

᐀

冦冮 t⫺

᐀

冧

tdF (t| z)

399

2

dF (t | z)dG(z)

(3)

This elementary definition is helpful in highlighting two important and related points: First, the values t and z only enter into the equation as dummy variables and second, consistent estimates for Ω 2 will follow if we can consistently estimate F(t | z) and G(z). Given the pairs of i.i.d. observations {(t i , z i ); i ⫽ 1, . . . , n}, the empirical distribution functions F n (t), G n (z), and F n (t |z), then it is only necessary to replace F(t), G(z), and F(t| z) in Eq. (2) by F n (t), G n (z), and F n (t | z), respectively, to obtain R 2 as a consistent estimator of Ω 2. It will also be helpful to consider an equivalent expression for E {Var(T| Z )} arising from a simple application of Bayes theorem. Instead of Eq. (3) write

冦

冧

冮 tg(z| u)dF (u) t⫺ E{Var(T|Z )} ⫽ 冮冮冮 g(z |u)dF(u) ᐀

ᐆ

᐀

᐀

2

dG(z| t)dF (t)

(4)

The above expression can be advantageous in certain estimation contexts, for example, when we may be able to estimate more readily the conditional distribution of Z given T rather than the other way around. In the main we are interested in regression models, the dependence being expressed via the conditional distribution of one of the variables given the other, and this dependence quantified by some parameter β, the larger the value of β in absolute value then the greater the degree of dependence for any given covariate distribution. Again we will sometimes wish to make this dependence explicit by writing Ω 2 (Z|T;β) or simply Ω 2 (β) when it is clear which way the regression is being done. We parameterize our regression model such that the special value β ⫽ 0 indicates an absence of association between the variables. The value of β itself quantifies the strength of regression effect and thereby directly relates to Ω 2 (β), the reason for including it as an argument. Typically we would not be interested in values of Ω 2 (β) elsewhere than at the true population value of β, but the concept turns out to be useful. Whereas the actual value of β itself will depend on the scaling of the covariate, Ω 2 (β) will be invariant to location and scale changes in the covariate and, in some sense, represent a standardized measure of strength of effect lying between 0 and 1. To avoid confusion when referring to β as an argument of a function as opposed to some assumed population value, we may denote the fixed population value as β 0. Note that the special value β ⫽ 0 corresponds to absence of association between T and Z; it helps the development by making this explicit in that

400

O’Quigley and Xu

E{Var(T| Z; β ⫽ 0)} ⫽ E{Var(T )} ⫽ Var(T )

(5)

leading to an expression for Ω 2 (β 0) in which the role of β is readily understood from E{Var(T | Z;β ⫽ 0)} ⫺ E{Var(T |Z;β ⫽ β 0 )} E{Var(T |Z;β ⫽ 0)} Var{E(T|Z )} ⫽ Var(T )

Ω 2 (T|Z;β 0 ) ⫽

(6)

One of the variables, most often Z, may have been assigned certain values by design, and we model the conditional distribution of T given Z. This is the case with the proportional hazards model where Z represents the covariate and T the elapsed time. It may seem the more natural to work with Ω 2 (T|Z ); however, this may not be the way to proceed, the definition Ω 2 (Z| T ) having some advantage in this context. The reason is outlined in the following section.

C.

Schoenfeld Residuals and Explained Variation in Proportional Hazards Models

Inference in the proportional hazards model remains invariant following monotonic increasing transformations on the time scale. This is a fundamental property, expressed via a model including an unknown baseline hazard function. Only the observed ranks of the failures matter, the actual values of the failure times themselves having no impact on parameter estimates and their associated variance. It could be argued that such a property ought be maintained for an appropriate Ω 2 (β) measure and its sample-based estimate R 2 (βˆ ). For definition Ω 2 (Z| T;β) it can be seen that failure rank invariance is respected, whereas for definition Ω 2 (T| Z;β) such invariance fails. More importantly, if we wish to accommodate time-dependent covariates, a fundamental feature of the Cox model, then Ω 2 (Z | T;β) can be readily generalized, maintaining an analogous interpretation, whereas Ω 2 (T|Z;β) is no longer even well defined. It can then be argued that a suitable measure of explained variation for the Cox model would relate to the predictability of the failure ranks rather than the actual times, absence of effect should translate as 0%, perfect prediction of the survival ranks should translate as 100%, and intermediate values should be interpretable. The measure introduced by O’Quigley and Flandre (1994) comes under this heading and corresponds to Ω 2 (Z| T;β). Xu (1996) shows that a reduction in the conditional variance of Z given T translates as greater predictability of the failure rankings given Z. These considerations lead to an Ω 2 of the form Ω 2 (Z| T; β) where

Explained Variation in Proportional Hazards Regression

Ω 2 (Z|T;β) ⫽

E{Var(Z(t)| T ⫽ t;β ⫽ 0)} ⫺ E{Var(Z(t) |T ⫽ t;β)} E{Var(Z(t) |T ⫽ t;β ⫽ 0)}

401

(7)

When talking about proportional hazards regression, this form is assumed unless indicated otherwise. We therefore suppress the notation Z |T in the definition of Ω 2, although the dependence on β may be indicated. We can write the above as Ω 2 (β) ⫽ 1 ⫺

∫ E β {[Z(t) ⫺ E β (Z(t)| t)] 2 | t}dF(t) ∫ E β {[Z(t) ⫺ E 0 (Z(t)| t)] 2 | t}dF(t)

(8)

where E β denotes expectation assuming the model to be true at the value β and E 0 is the expected value under the null model. We return to this below but note that if we can consistently estimate all the quantities in Eq. (8), then our problem is solved. As it turns out (O’Quigley and Flandre 1994), the usual Schoenfeld residuals play a key role here, and it can be shown (Xu 1996) that R 2 (βˆ ) ⫽ 1 ⫺

∑ δi⫽1 r 2i (βˆ )w i ∑ δi⫽1 r 2i (0)w i

(9)

where r i (⋅) are the Schoenfeld (1982) residuals, evaluated at βˆ and 0, is consistent for Ω 2 (β). In this expression the weights w i are the decrements in the marginal Kaplan-Meier estimate. In many practical cases, ignoring the w i by equating them all to one may have little impact, thereby providing a particularly simple expression in terms of the Schoenfeld residuals. These residuals are typically an ingredient of any standard analysis. For ordinary linear regression, one minus the usual R 2 is the ratio of the average of the squared residuals and the average squared deviations about the overall mean. The estimate R 2 here is of the same form, in which the squared residuals of linear regression are replaced by the squared Schoenfeld residuals and where the square deviations about the overall mean are replaced by the square deviations of Z about the overall mean values of Z sequentially conditional on the risk sets. In the absence of censoring, the quantity ∑ ni⫽1r 2i (βˆ )/n can be viewed as the average discrepancy between the observed covariate and its expected value under the model, whereas ∑ ni⫽1r 2i (0)/n can be viewed as the average discrepancy without a model. Censoring is dealt with by correctly weighting the squared residuals.

II. ESTIMATION UNDER PROPORTIONAL HAZARDS A. Model and Notation Let T 1 , T 2 , . . . , T n be the failure times and C 1 , C 2, . . . , C n be the censoring times for the individuals i ⫽ 1, 2, . . . , n. For each i we observe X i ⫽ min(T i , C i ) and δ i ⫽ I(T i ⱕ C i ), where I(⋅) is the indicator function. Define the ‘‘at risk’’

402

O’Quigley and Xu

indicator Y i (t) ⫽ I(X i ⱖ t). We also use the counting process notation: let N i (t) ⫽ I {T i ⱕ t, T i ⱕ C i } and N(t) ⫽ ∑ n1 N i (t). The left continuous version of the Kaplan-Meier estimate of survival is denoted Sˆ (t) and the Kaplan-Meier estimate of the distribution function by Fˆ (t) ⫽ 1 ⫺ Sˆ (t). Usually we are interested in the situation where each subject has related covariates, or explanatory variables, Z i (i ⫽ 1, 2, . . . , n). All the results given here hold for an independent censorship model. Mostly, for ease of exposition, we assume the covariate Z to be one dimensional. Z in general could be time dependent, in which case it is assumed to be a predictable stochastic process and we will use the notation Z(t), Z i (t), etc. The Cox (1972) proportional hazards model assumes that the hazard function λ i (t) (i ⫽ 1, . . . , n) for individuals with different covariates, Z i (t), can be written λ i (t) ⫽ λ 0 (t) exp{βZ i (t)}

(10)

where λ 0 (t) is a fixed unknown ‘‘baseline’’ hazard function and β is a relative risk parameter to be estimated. B.

Basis for Inference

First some basic definitions. Let π i (β,t) ⫽ K(t)Y i (t) exp{βZ i (t)}

(11)

where K ⫺1 (t) ⫽ ∑ nᐉ⫽1 Y ᐉ (t) exp{βZ ᐉ (t)}. Under the model, π i (β,t) is exactly the conditional probability that at time t it is precisely individual i who is selected to fail, given all the individuals at risk and given that one failure occurs. Let ᐆ(t) be a step function of t with discontinuities at the points X i , i ⫽ 1, . . . , n, at which the function takes the value Z i (X i ). Also, for fixed t, define the expectation of Z(t) under the distribution {π i (β,t)}ni⫽1 by n

ε β (Z|t) ⫽

冱 Z (t)π (β,t) ᐉ

ᐉ

ᐉ⫽1 n

⫽

冱 K(t)Y (t)Z (t) exp{βZ (t)} ᐉ

ᐉ

(12)

ᐉ

ᐉ⫽1

Statistical inference on β is usually carried out by maximizing Cox’s (1975) partial likelihood which is equivalent to obtaining the value of β satisfying U 1 (β) ⫽ ∫ {ᐆ(t) ⫺ ε β (Z| t)}dN(t) ⫽ 0

(13)

An alternative to the partial likelihood estimator, useful when β may not be constant with time, is given by the solution to (Xu and O’Quigley 1998) U 2 (β) ⫽ ∫ W(t){ᐆ(t) ⫺ ε β (Z| t)}dN(t) ⫽ 0

(14)

Explained Variation in Proportional Hazards Regression

403

where W(t) ⫽ Sˆ (t){∑ ni⫽1 Y i (t)} ⫺1. Our purpose here is consistent estimation of Ω 2 and not robust estimation of β, but the two estimates of Ω 2 of the following section are related in a way not dissimilar to the relationship between the above two estimators. For practical calculation note that W(X i ) ⫽ Fˆ (X i ⫹) ⫺ Fˆ (X i ) ⫽ w i at each observed failure time X i , i.e., the jump of the Kaplan-Meier curve. This is also of theoretical interest since under departures from proportional hazards, an estimate based on U 2 (β) has a solid interpretation as average effect, whereas the estimate based on U 1 (β) cannot be interpreted in the presence of censoring (Xu 1996, Xu and O’Quigley 1998). This can be anticipated from the definition of W(X i ) whereby we can write U 2 (β) ⫽ ∫ {ᐆ(t) ⫺ ε β (Z|t)}dFˆ (t) ⫽ 0

(15)

C. Estimating ⍀ 2 Our basic task is accomplished in this section via a main theorem and a series of corollaries. The proofs are not given here. They can be found in Xu (1996) where proofs of the statements of the following section can also be found. Theorem 1 Under model (10), an independent censoring mechanism, and where βˆ is any consistent estimate of β, the conditional distribution function of Z(t) given T ⫽ t is consistently estimated by Fˆ t (z |t) ⫽ Pˆ (Z(t) ⱕ z |T ⫽ t) ⫽

冱

π ᐉ {βˆ ,t}

{ᐉ:Z ᐉ (t)ⱕz}

Corollary 1 Defining n

ε β (Z |t) ⫽ k

冱 Z (t)π {β,t} k i

(16)

i

i⫽1 n

⫽

冱 K(t) Y (t)Z (t) exp{βZ (t)}, i

k i

i

k ⫽ 1, 2, . . .

i⫽1

then ε βˆ (Z k |t) provide consistent estimates of E β (Z k (t)|T ⫽ t), under the model. In addition we have the following two results. Let ᏶(β,b) ⫽

冮

0

∞

W(t)ε β {[Z(t) ⫺ ε b (Z(t)| t)]2 |t} dN(t)

(17)

then Corollary 2 ᏶ (β, β) converges in probability to ∫ E β {[Z(t) ⫺ E β (Z(t)| t)]2 |t}dF(t)

(18)

404

O’Quigley and Xu

Corollary 3 ᏶(β,0) converges in probability to ∫ E β {[Z(t) ⫺ E 0 (Z(t)| t)] 2 | t}dF(t)

(19)

Theorem 2 Define ᏶(β,β) ᏶(β,0)

R 2ε (β) ⫽ 1 ⫺

(20)

Then R 2ε (βˆ ) converges in probability to Ω 2 (β) in Eq. (8). Theorem 3 Let Ᏽ(b) for b ⫽ 0, β be defined by n

Ᏽ(b) ⫽

冱冮 i⫽1

∞

0

W(t){Z i (t) ⫺ ε b (Z| t)}2 dN i (t)

(21)

then Ᏽ (βˆ ) R 2 (βˆ ) ⫽ 1 ⫺ Ᏽ(0)

(22)

is a consistent estimate of Ω 2 (β 0) in Eq. (8). Notice that the above defined R 2 (βˆ ) is the same as Eq. (9). Our experience has been that when the proportional hazards model correctly generates the data, R 2ε will be very close in value to R 2. Indeed, we can show that under the model, | R 2 (βˆ ) ⫺ R 2ε (βˆ ) | converges to zero in probability. When discrepancies arise, this would seem to be indicative of a failure in model assumptions. Although R 2ε (βˆ ) is of interest in its own right, our main purpose for studying it has been to develop certain statistical properties and for providing a simple way to construct confidence intervals for the population quantity Ω 2 (β). The coefficients R 2 and R 2ε and the population counterpart Ω 2 have a number of useful properties. We have R 2 (0) ⫽ 0, R 2ε (0) ⫽ 0, R 2ε (β) ⱕ 1, and R 2 (β) ⱕ 1. Although R 2ε and Ω 2 are nonnegative, we cannot guarantee the same for R 2. This would nonetheless be unusual corresponding to the case in which the best-fitting model, in a leastsquares sense, provides a poorer fit than the null model. Our experience is that R 2 (βˆ ) will only be slightly negative in finite samples if βˆ is very close to zero. Both R 2 and R 2ε are invariant under linear transformations of Z and monotonically increasing transformations of T. Viewed as a function of β, R 2 (β) reaches its maximum close to βˆ . In contrast, R 2ε (β) → 1 as | β | → ∞ and, as a function of β, R 2ε (β) increases monotonically with |β|. This last property (as well as all the stated properties of R 2ε ) also applies to Ω 2 (β) and enables us to construct confidence intervals for Ω 2 using that of β, as illustrated in the example. Finally, we can show that R 2 (βˆ ) and R 2ε (βˆ ) are asymptotically normal.

Explained Variation in Proportional Hazards Regression

405

III. SUMS OF SQUARES INTERPRETATION We have the following sums of squares decomposition for R 2ε (β): ε β {[Z ⫺ ε 0 (Z| X i )] 2 | X i }⫽ ε β {[Z ⫺ εβ |Z |X i )] 2 |X i } ⫹ {ε β (Z| X i ) ⫺ ε 0 (Z| X i )} 2

(23)

on the basis of which we can rearrange Eq. (20) so that ∑ ni⫽1 δ i W(X i ){ε β (Z|X i ) ⫺ ε 0 (Z|X i )} 2

R 2ε (β) ⫽

(24)

∑ ni⫽1 δ i W(X i )ε β {[Z ⫺ ε 0 (Z|X i )] 2 |X i }

Furthermore, we can take ∑ ni⫽1 δ i W(X i )r 2i (β) to be a residual sum of squares analogous to those from linear regression, whereas ∑ ni⫽1 δ i W(X i )r 2i (0) corresponds to the total sum of squares. Then n

冱 δ W(X ){Z (X ) ⫺ ε (Z|X )} i

i

i

i

0

i

2

i⫽1 n

⫽

冱

n

δ i W(X i ){Z i (X i ) ⫺ ε β (Z|X i )} 2 ⫹

i⫽1

冱 δ W(X ){ε (Z|X ) ⫺ ε (Z| X )} i

i

β

i

0

i

2

i⫽1

n

⫹2

冱 δ W(X ){ε (Z|X ) ⫺ ε (Z|X )}{Z (X ) ⫺ ε (Z|X )} i

β

i

i

0

i

i

i

β

i

i⫽1

Now the last term in the above is a weighted score that according to Proposition 1 of Xu (1996) is asymptotically zero with β ⫽ βˆ . So defining n

SS tot ⫽

冱 δ W(X )r (0) i

i

2 i

i⫽1 n

SS res ⫽

冱 δ W(X )r (βˆ ) i

i

2 i

i⫽1 n

SS reg ⫽

冱 δ W(X ){ε (Z|X ) ⫺ ε (Z|X )} i

i

βˆ

i

0

i

2

i⫽1

we obtain an asymptotic decomposition of the total sum of squares into the residual sum of squares and the regression sum of squares, i.e., SS tot ⫽ SS res ⫹ SS reg

(25)

holds asymptotically. So R 2 is asymptotically equivalent to the ratio of the regression sum of squares to the total sum of squares.

406

O’Quigley and Xu

IV. MULTIVARIATE EXTENSION Most often we are interested in explanatory variables Z of dimension greater than 1. Common classes of regression models consider the impact of a linear combination η ⫽ β ′Z on T, where β is a vector of the same dimension as Z and a′b denotes the usual inner product of a with b. For the multivariate normal model F(t | Z ⫽ z) and F(t | η ⫽ β′z) are the same, and so it is only necessary to consider η and not the actual values of z themselves. For other models, we may not have such a result, but in as much as we consider the effect as essentially being summarized via β′z, it makes sense to consider the multiple coefficient of explained variation, known as the coefficient of determination in linear regression, as Ω 2 (T| η;β). Such a quantity would not be invariant to monotonic transformations upon T and, assuming we deem this a requirement, then we should consider Ω 2 (η| T;β). Everything now follows through exactly as for the univariate case in which we work with the fitted Schoenfeld residuals and the null residuals. The exact linear combination we would use for η is of course unknown, and in practice we replace the vector β by the vector βˆ . The multiple coefficient is then R 2 (βˆ ) ⫽ 1 ⫺

∑ δi⫽1 [βˆ ′r i (βˆ )] 2 w i ∑ δ [βˆ ′r i (0)] 2 w i

(26)

i⫽1

V.

OTHER SUGGESTED MEASURES

There have been other suggestions for suitable measures of explained variation under the proportional hazards model. The earliest suggestions date back to Harrell (1986). His measure depends heavily on censoring, and this effectively rules it out for practical use. Kent and O’Quigley (1988) developed a measure based on the Kullback-Leibler information gain, and this could be interpreted as the proportion of randomness explained in the observed survival times by the covariates. The principle difficulty in Kent and O’Quigley’s measure was its complexity of calculation, although a very simple approximation was suggested and appeared to work well. The Kent and O’Quigley measure was not able to accommodate time-dependent covariates. Korn and Simon (1990) suggested a class of potential functionals of interest, such as the conditional median, and evaluated the explained variation via an appropriate distance measuring the ratio of average dispersions with the model to those without a model. Their measures are not invariant to time transformation nor could they accommodate time-dependent covariates. They have some advantage in generality in being applicable to a much wider class of models than the proportional hazards one. Schemper (1990,1994) introduced the concept of individual survival curves for each subject, with the

Explained Variation in Proportional Hazards Regression

407

model and without the model. Interpretation is difficult. As with the Harrell measure the Schemper measures depend on censoring, even when the censoring mechanism is independent of the failure mechanism. Other measures have also been proposed in the literature (see e.g. Schemper and Stare 1996), but it is not our intention here to give a complete review of them. Schemper and Kaider (11) suggested multiple imputation as a way to deal with censoring. This is a promising idea that requires further study to anticipate the statistical properties of the approach. Intuitively it appears that such an approach would come under the heading of providing estimators for the relevant population quantities of Section I. The unavailable empirical estimators are replaced by estimators deriving from an iterative algorithm. Currently this appears somewhat ad hoc, the population model not being referred to in the work of Schemper and Kaider (1997). However, it seems quite likely that the approach may be consistent, and further work is needed for this to be demonstrated and under what conditions. An alternative to the information gain measure of Kent and O’Quigley (1986), similar in spirit but leaning on theorem 1 and the conditional distribution of Z given T rather than the other way around, leads to a coefficient with good properties (Xu and O’Quigley 1999). In practical examples this coefficient based on information gain appears to give close agreement with the R 2 measure discussed here. There are in fact theoretical reasons for this agreement, and assuming the data do not strongly contradict the proportional hazards assumption, we anticipate the two coefficients to be close to one another.

VI. ILLUSTRATION We illustrate the basic ideas on the well-known Freireich (1963) data, which records the remission times of 42 patients with acute leukemia treated by 6mercaptopurine (6-MP) or placebo. It was used in the original Cox (1972) paper and has been studied by many other authors under the proportional hazards model. Our estimate of the regression coefficient is βˆ ⫽ 1.53, and R 2 (βˆ ) ⫽ 0.386 and R 2ε (βˆ ) ⫽ 0.371. The 95% confidence interval for Ω 2 (β) obtained using the monotonicity of Ω 2 (β), i.e., plugging the two end points of the interval for β into R 2ε (⋅), is (0.106, 0.628). Our 1000 bootstrap using Efron’s bias-corrected accelerated bootstrap (BCa) method gives confidence interval (0.111, 0.631) using R 2 and (0.103, 0.614) using R 2ε. We see that these have very good agreement with the one obtained through monotonicity. For practical use in calculating confidence interval of Ω 2 (β), we recommend the ‘‘plug-in’’ method, which is the most computationally efficient. The above R 2 (βˆ ) can be compared with some of the suggestions of the previous section. For the same data the measure proposed by Kent and O’Quigley (1986) resulted in the value 0.37. The explained variation proposals of Schemper

408

O’Quigley and Xu

(1990), based on empirical survival functions per subject, resulted in (his notation) V 1 ⫽ 0.20 and V 2 ⫽ 0.19 and Schemper’s later correction (1994) resulted in V 2 ⫽ 0.29. There is no obvious link between the Schemper measures and those presented here since the Schemper measures depend on an independent censoring mechanism. The measure of Korn and Simon (1990), based on quadratic loss, gave the value 0.32. Although there is some comfort to be gained by a value not dissimilar from that obtained above, again there appears to be no good grounds for investigating a potential association between the measures. This is because their measure, unlike the measure suggested here and the partial likelihood estimator itself, does not remain invariant to monotone increasing transformation of time. For these data the value 0.32 drops to 0.29 if the failure times are replaced by the square roots of the times. The Korn and Simon measure is most useful when the time variable provides more information than just an ordering, whereas rank ordering is the only assumption we need for the measure presented in this Chapter. Finally, the measure of Schemper and Kaider (1997) is calculated to be 0.34, and the measure of Xu and O’Quigley (1999) turns out to be 0.40.

REFERENCES Cox DR. Regression models and life tables (with discussion). JR Stat Soc B 1972; 34: 187–220. Cox DR. Partial likelihood. Biometrika 1975; 62:269–276. Freireich EO. The effect of 6-mercaptopmine on the duration of steroid induced remission in acute leukemia. Blood 1963; 21:699–716. Harrell FE. The PHGLM Procedure, SAS Supplement Library User’s Guide, Version 5, Cary, NC: SAS Institute Inc., 1986. Kent JT, O’Quigley J. Measure of dependence for censored survival data. Biometrika 1988; 75:525–534. Korn EL, Simon R. Measures of explained variation for survival data. Stat Med 1990; 9: 487–503. O’Quigley J, Flandre P. Predictive capability of proportional hazards regression. Proc Natl Acad Sci USA 1994; 91:2310–2314. Schemper M. The explained variation in proportional hazards regression. Biometrika 1990; 77:216–218. Schemper M. Correction: the explained variation in proportional hazards regression. Biometrika 1994; 81:631. Schemper M, Kaider A. A new approach to estimate correlation coefficients in the presence of censoring and proportional hazards. Comput Stat Data Anal 1997; 23:467– 476. Schemper M, Stare J. Explained variation in survival analysis. Stat Med 1996; 15:1999– 2012.

Explained Variation in Proportional Hazards Regression

409

Schoenfeld DA. Partial residuals for the proportional hazards regression model. Biometrika 1982; 69:239–241. Xu R. Inference for the Proportional Hazards Model. Ph.D thesis of University of California, San Diego, 1996. Xu R, O’Quigley J. Estimating average log relative risk under nonproportional hazards. ASA 1998 Proceedings of the Biometrics Section, 216–221. Xu R, O’Quigley J. A R 2 type measure of dependence for proportional hazards models. Nonparam Stat 1999; 12:83–107.

20 Graphical Methods for Evaluating Covariate Effects in the Cox Model Peter F. Thall and Elihu H. Estey University of Texas M.D. Anderson Cancer Center, Houston, Texas

I.

INTRODUCTION

In medicine, patient characteristics often have profound effects on prognosis. For example, in oncology a patient’s age, extent of disease, or the presence of a particular cytogenetic or molecular abnormality typically have substantial effects on his or her survival. When comparing the effects of two or more treatments on patient outcome, a fundamental scientific problem is that apparent treatment differences may result not from the inherent superiority of one particular treatment over another but rather from differences between the patients in the treatment groups. This observation has led to use of the randomized clinical trial to ensure that groups given different treatments are on average similar with regard to characteristics that may be related to response (‘‘covariates’’). Although randomization is an essential tool in comparative treatment evaluation, it cannot guarantee that treatment groups are perfectly balanced with regard to all variables that may be related to outcome. This is especially true in small (e.g., ⬍200 patient) randomized trials. In the more common and problematic setting where treatment comparisons are based on data from separate trials, as when evaluating data 411

412

Thall and Estey

from two or more single-arm phase II trials of different treatments, the potential for the effects of unbalanced covariates to confound actual treatment differences is much greater. Therefore, the use of statistical methods to account for variables that may influence patient outcome is critically important in evaluating both randomized and nonrandomized clinical trials. Although unobserved ‘‘latent’’ effects are also an important consideration when combining data from multiple clinical centers or separate trials, we do not address this issue here. For the interested reader, treatments of this problem are given by Li and Begg (2) and Stangl (3). Accounting for individual patient characteristics when evaluating treatment effects entails some form of statistical regression analysis. The Cox regression model (1) is the most widely used tool for evaluating the relationship between covariates and time-to-event treatment outcomes, such as survival time or disease-free survival (DFS) time. Unfortunately, the assumptions underlying the Cox model are often violated in practice. In particular, many published results in the medical literature are based on fitted models for which no goodness-of-fit analysis has been performed. If such model criticism is not done and if the qualitative relationship between a covariate and patient outcome is different from that assumed by a particular model, then the statistical estimate of the covariate’s effect under the fitted model may greatly misrepresent its actual effect. When this is the case, apparent covariate effects and treatment effects obtained from a fitted Cox model may be substantively misleading. The purpose of this chapter is to illustrate some statistical methods for assessing goodness-of-fit under the Cox model and also for correcting poor model fit. Formal descriptions of the methods are given by Therneau et al. (4), Grambsch and Therneau (5), Grambsch (6), and in Chapter 4 of the important book by Fleming and Harrington (7). An excellent, albeit somewhat more mathematical, explanation of the type of methods discussed here is given in Chapter 4.6 of Fleming and Harrington. We do not attempt to discuss all existing methods for assessing goodness-of-fit of the Cox model, since the current literature is quite extensive. Some earlier references are Crowley and Hu (8), Crowley and Storer (9), Kay (10), Schoenfeld (11), Cain and Lange (12), and Harrell (13). Our goal is to discuss and illustrate by example some useful graphical displays and statistical tests in terms that can be understood by physicians or other nonmathematical readers. The graphical methods illustrate qualitative relationships between covariates and outcome that are not otherwise apparent, and they also lead to use of the extended Cox model, which allows the possibility of covariate effects that vary with time, to obtain an improved model fit. Because these methods provide more accurate and reliable evaluation of covariate and treatment effects on patient outcome, their application often leads in turn to profound changes in the substantive inferences formed from a particular data set. These techniques are straightfor-

Graphical Methods for the Cox Model

413

ward to implement using freely available computer programs in either Splus or SAS (13,14). Our goal is to bring these methods into more widespread use, especially in the analysis of medical data. We illustrate the methods using three data sets arising from clinical trials in acute myelogenous leukemia (AML) and myelodysplastic syndromes (MDS) conducted at M.D. Anderson Cancer Center: a data set arising from several phase II trials of combination chemotherapies each involving fludarabine (16), where 415 of 530 patients had events (died or relapsed), with a 26-week median DFS time; a data set for which the effects of two proteins, caspase 2 (C 2 ) and caspase 3 (C 3 ), on survival were evaluated (17), where 116 of the 185 good-prognosis patients had events with a median DFS time of 82 weeks; and a data set arising from a four-arm randomized trial designed to evaluate the effects of all-trans retinoic acid (ATRA) and the growth factor granulocyte colony-stimulating factor on survival (18), where 139 of 215 patients died with a median survival time of 28 weeks. We refer to these as the ‘‘fludarabine,’’ ‘‘caspase,’’ and ‘‘ATRA’’ data sets. For one example we also use simulated data having specific properties. Most of our examples deal with the relationship between a single covariate and survival or DFS time. However, we also discuss how properly modeling covariate effects may affect treatment effect estimates in a multivariate model (Sect. VII) and also the use of conditional survival plots to assess interactions between two covariates (Sect. XII). The methods apply quite generally to any time-to-event outcome subject to right censoring, as is the case with patients who have not suffered the event in question when the trial is analyzed.

II. COX REGRESSION MODEL Consider the common problem of assessing the relationships between each of a collection of covariates Z ⫽ (Z 1 . . . . Z k ) and the elapsed time T from a ‘‘baseline’’ usually defined as the time of diagnosis or initiation of treatment to a particular event such as relapse or death. The covariates typically include one or more indicator variables denoting treatments given to patients. Each covariate may or may not be of value in predicting T, and those covariates that are predictive typically differ substantially in their qualitative relationships and strength of association with T. In addition, for some patients the value of T may be right censored in that T is not observed but rather is known only to be no smaller than a censoring time, usually due to the fact that the study ended without the patient experiencing the event. Many models and methods deal with this type of data (7,19). By far the most commonly used methods are based on the Cox regression model (1). The Cox model assumes that the instantaneous hazard λ(t; Z ) of the event occurring at time t from baseline for a patient with covariates Z takes the form λ(t; Z ) ⫽

414

Thall and Estey

λ 0 (t) exp (β 1 Z1 ⫹ . . . . ⫹ β k Z k ), where λ 0 (t) is an underlying baseline hazard function not depending on the covariates and β1, . . . , β k are unknown parameters quantifying the covariate effects. The expression β 1 Z 1 ⫹ . . . . ⫹ β k Z k , known as the linear component of the model, is typically the main focus of a Cox model analysis since the β j ’s quantify the covariate effects. Due to the fact that β1Z1 ⫹ . . . . ⫹ β k Zk ⫽ log e{λ (t; Z)/λ 0 (t)}, the covariates are said to have a log linear effect on the hazard of the event. If a particular β j ⫽ 0, then the covariate Z j has no effect on the hazard. If one defines the binary indicator variable Z A ⫽ 1 for treatment group A and Z A ⫽ 0 for treatment B and includes β A ZA in the linear component, then exp(βA ) is the relative risk or hazard ratio of the event for a patient given treatment A compared with B, regardless of the patient’s other covariates. For this reason, the Cox model is also called the proportional hazards model. A relative risk of 1 corresponds to the case where the risk of the event is the same with the two treatments. Numerical values β A ⬍ 0, β A ⫽ 0, and β A ⬎ 0 correspond to relative risks below 1, equal to 1, and above 1, respectively. Thus, a value of β A significantly less (greater) than 0 is the basis for inferring that A is superior (inferior) to B. Two crucial assumptions underlying the Cox model are that the covariates have a log linear effect on the hazard of the event and that the value of each β j does not vary with time. Thus, for example, the relative risk exp(βA ) associated with treatment A vis a vis treatment B is the same at any time t.

III. GOODNESS-OF-FIT The Cox model has proved to be an extremely useful statistical tool for evaluating covariate effects on events that occur over time. When the underlying model assumptions are not met in that the model does not fit the data well, however, the sort of analysis described above may be invalid. Since no model can be perfectly correct, the practical question is whether a given statistical model provides a reasonable fit to the data at hand. Consequently, practical application of any statistical model should include some form of data-driven model criticism, commonly known as goodness-of-fit analysis. Ideally, model criticism should also include consideration of fits obtained with other similar data sets. We do not pursue this issue here, however, since it involves notions of Bayesian inference, cross-validation, and meta-analysis that go far beyond the present discussion. The point is that use of a statistical model without some goodness-of-fit assessment is bad scientific practice, since it may easily produce flawed or substantively misleading inferences. It is this danger, given the widespread use of the Cox model to analyze medical data, that has motivated this chapter.

Graphical Methods for the Cox Model

415

IV. MARTINGALE RESIDUAL PLOTS In linear regression analysis the residuals are the differences, one for each patient, between the observed outcome variable and the value of that variable predicted by the fitted regression model. These ‘‘observed minus predicted’’ values may be used to assess how well the regression model fits the data. A wide variety of methods for residual analyses is discussed in the statistical literature, and each method applies to a particular type of regression model (e.g., linear, logistic, Cox). The martingale residuals associated with a fitted Cox model are the analog of ordinary residuals associated with a linear regression model. Specifically, martingale residuals are numerical values, one for each patient in the data set, that quantify the excess risk of the event not explained by the model. A large positive (negative) martingale residual r m for a patient corresponds to a fitted model that underestimates (overestimates) the risk of the event for that patient. This fact may be exploited to assess goodness-of-fit in terms of a martingale residual plot, obtained by plotting a point for each patient that corresponds to the patient’s value of r m on the vertical axis and the patient’s value for the particular covariate, Z, on the horizontal axis. The martingale residuals for this plot are computed from a Cox model that includes only a baseline hazard function but no covariates, so that rm essentially adjusts the patients’ observed times for censoring. This produces a scattergram of points, as shown in Figure 1. Applying a local regression smoother (14,20) to create a line through the scattergram then allows one to examine visually the nature of the relationship between r m and the covariate Z. Subsequently, after a model with an appropriately transformed version of Z has been fit, the plot on Z of the residual r m based on this new model should show no pattern other than random noise. In most applications, the pattern revealed by the smoother is impossible to determine by visual inspection of the scattergram alone. For larger data sets, say with 1000 patients or more, a plot of the smoothed line alone may be visually clearer since the scattergram of points tends to overwhelm the picture. Smoothed scattergrams may be constructed easily using several widely available statistical software packages, including Splus (13,14,21). If Z satisfies the proportional hazards assumption (i.e., if it has a log linear effect on the hazard), then aside from random variation the smoothed line will be straight. Nonlinear patterns correspond to violation of the proportional hazards assumption, and in this case the shape of the smoothed line indicates the relationship between outcome and Z that should produce a good fit. Thus, a simple, routine way to fit Cox models is to first examine the martingale residual plot for each covariate Z, if necessary fit a new model incorporating this relationship, examine a residual plot for the new model, and repeat this process until one obtains a good fit.

416

Thall and Estey

Figure 1 Martingale residual scatterplot on white blood cell count, from the caspase data.

V.

TIME-VARYING COVARIATE EFFECTS

Time-to-event data often deviate from the usual Cox model in that the effect of a given covariate may vary over time. The extended Cox model allowing one or more of the β j ’s to vary with time has hazard function of the form λ(t; Z ) ⫽ λ 0 (t) exp [β 1 (t) Z 1 ⫹ . . . . ⫹ β k (t) Z k ]. Under this extended model, the log-linear effect and corresponding risk associated with Zj at time t are given by β j (t) Z j and exp[β j (t) Z j ], respectively. An application where this extension often is appropriate is in evaluating the effect of baseline performance status (PS) on survival in acute leukemia, since the risk of regimen-related death for patients with poor PS decreases once they survive chemotherapy and hence βPS(t) may become closer to 0 as t increases beyond the early period, including treatment. A related extension is that which allows covariates to be evaluated repeatedly over time rather than being recorded only at baseline (t ⫽ 0), so that Z j (t) denotes the value of the jth covariate at time t and the hazard function takes the extended form λ(t; Z ) ⫽ λ 0 (t) exp [β 1 Z 1 (t) ⫹ . . . . ⫹ β k Z k (t)]. These two extensions are computationally very similar, although we do not explore this point here.

Graphical Methods for the Cox Model

417

When the effect of a covariate Z varies over time, it is essential to assess the form of β(t) to determine how Z actually affects patient survival. A graphical method for doing this, similar to the martingale residual plot, is the GrambschTherneau-Schoenfeld (GTS) residual plot (5,11), also known as a scaled Schoenfeld residual plot. A smoothed GTS plot provides a picture of β(t) as a function of t. This plot has an accompanying statistical test, due to Grambsch and Therneau (5), of whether β varies with time versus the null hypothesis that it is constant over time; hence, it tests whether the ordinary Cox model is appropriate. This test is very general in that for each of several transformations of the time axis, it takes the form of a particular goodness-of-fit test for the Cox model. We illustrate these methods for assessing goodness-of-fit by example, with emphasis on the graphical displays described above. Most of our examples simplify things by focusing on the effects of one covariate for the sake of illustration. In practice, each application described here would be followed by fitting and evaluating multivariate models incorporating whatever forms are determined in the univariate fits.

VI. A COVARIATE WITH NO PROGNOSTIC VALUE We begin with an illustration of what a martingale residual plot looks like for a covariate that is of no value for predicting outcome. Figure 1 is a martingale residual plot from the caspase data. We plotted r m on white blood count (WBC) for each patient, which produced the scattergram of points, and ran a local weighted regression (‘‘lowess’’) smoother (18) through the points to obtain the solid line. Note that very few patients have very large WBC values; consequently, most of the points in the scattergram are forced into a small area in the left portion of the figure, whereas the right portion of the smoothed line is very sensitive to the locations of a few points. A simple way to deal with this common problem is to truncate the WBC domain by excluding a small number of patients with large WBC values. This produces Figure 2, which gives a clearer picture of the true relationship between WBC and DFS for most of the data points. In each illustration given below, we similarly truncate the domain of the covariate as appropriate. Another method for dealing with a scattergram in which most of the points occupy a small portion of the plot is to transform the covariate, for example, by replacing Z with log(Z ). Aside from random fluctuations, the smoothed line in Figure 2 is straight, suggesting that the assumption of a log linear hazard is appropriate. A very important point in interpreting these graphs is that one may see patterns in any plot if one stares at it long enough; hence, a statistical test should accompany any graphical method to determine whether an apparent pattern is real. Here, the Cox model assumptions are reasonable since the p value of a Grambsch-Therneau goodness-of-fit test is p ⫽ 0.42. There is no relationship

418

Thall and Estey

Figure 2 Martingale residual scatterplot as in Figure 1, but with right truncation of the white blood count domain.

between WBC and DFS since p ⫽ 0.83 for a test of β WBC ⫽ 0 versus the alternative hypothesis β WBC ≠ 0 under the usual Cox model with βWBCWBC as its linear component. Thus, WBC is of no value for predicting DFS in this data set.

VII. A QUADRATIC EFFECT The following example illustrates both the importance of including relevant patient covariates when evaluating treatment effects and the importance of using goodness-of-fit analyses to properly model covariate effects. A fit of the usual Cox model with linear component β ATRAATRA to the ATRA data, summarized as Model 1 in Table 1, yields a test of the hypothesis β ATRA ⫽ 0 versus βATRA ≠ 0 having p value 0.055. The estimated relative risk exp(⫺0.329) ⫽ 0.72 seems to imply that ATRA reduces the risk of death or relapse in this patient group and that this reduction is both statistically and medically significant. Since baseline platelet count (platelets) often has a significant effect on DFS in treatment of hematologic diseases, it also seems reasonable to include platelets in the model.

Graphical Methods for the Cox Model Table 1

419

Platelets, Cytogenetics, and the ATRA Effect Estimated coefficient

SE

p

Model LR (df)

ATRA

⫺0.329

0.172

0.055

3.69 (1)

2

ATRA Platelets

⫺0.297 ⫺0.299

0.172 0.105

0.085 0.005

13.1 (2)

3

ATRA Platelets Platelets 2

⫺0.231 ⫺0.594 0.170

0.173 0.134 0.044

0.18 9.0 ⫻ 10⫺6 1.1 ⫻ 10⫺4

24.4 (3)

4

ATRA Platelets Platelets 2 m5m7

⫺0.204 ⫺0.546 0.163 0.415

0.174 0.136 0.045 0.181

0.24 5.7 ⫻ 10⫺5 2.8 ⫻ 10⫺4 0.022

29.4 (4)

Model

Covariate

1

The resulting fitted model is summarized as Model 2 in Table 1. This fit indicates that a higher platelet count is a highly significant predictor of better DFS and that after accounting for platelets the ATRA effect is still marginally significant with p ⫽ 0.085. Moreover, this model provides a better overall fit since its likelihood ratio (LR) statistic is 13.1 on 2 degrees of freedom (df), p ⫽ 0.0014, compared with only 3.69 on 1 df, p ⫽ 0.055, for the model including the ATRA effect alone. A closer analysis leads to rather different conclusions. The smoothed martingale residual plot given in Figure 3 indicates that the risk of relapse or death initially decreases as platelet count increases but that as platelet count rises above roughly 150 ⫻ 10 3, the risk of an event begins to increase. Thus, the plot suggests that either a very low or very high platelet count is prognostically disadvantageous. In particular, the relationship between platelet count and DFS is not log linear, as assumed when the standard Cox model is fit, but rather appears to be log parabolic. A Cox model that includes both platelets and platelets 2 as covariates would account for the parabolic shape suggested by Figure 3. This is done by fitting a Cox model with linear component β 1ATRA ⫹ β 2platelets ⫹ β 3platelets 2, summarized as Model 3 in Table 1. Since the hypothesis β 3 ⫽ 0 reduces the log parabolic model to the simpler log linear model, the test of this hypothesis addresses the question of whether the bend in the line in Figure 3 is significant or is merely an artifact of random variation in the data. The p value 1.1 ⫻ 10⫺4 corresponding to this test indicates that the log parabolic model is indeed appropriate. It is worth noting that the question of whether β 2 ⫽ 0 under this model is essentially irrelevant, and in general it is standard practice to include the lower

420

Thall and Estey

Figure 3 Martingale residual scatterplot on platelets, from the ATRA data.

order term Z whenever Z 2 has a significant coefficient. The martingale residual plot based on the fitted Model 3, given in Figure 4, shows a pattern consistent with random noise, indicating that the parabolic model has adequately described the relationship between platelets and DFS. This is underscored by the LR test for the overall fit of Model 3 (LR ⫽ 24.4 on 3 df, p ⫽ 2.0 ⫻ 10⫺5 ), which shows quantitatively that Model 3 provides a substantially better fit than Model 2. Perhaps most importantly, now that platelet count has been modeled properly the ATRA effect is no longer even marginally significant ( p ⫽ 0.18). We take this analysis one step further by adding to the linear component the indicator, m5m7, that the patient has the cytogenetic abnormality characterized by the loss of specific regions of the fifth and seventh chromosomes. This is summarized as Model 4 in Table 1, which shows that m5m7 is a significant predictor of worse DFS and that the inclusion of this covariate further reduces the prognostic significance of ATRA ( p ⫽ 0.24). This example illustrates the common scientific phenomenon that an apparently significant treatment effect may disappear entirely once patient prognostic covariates are properly accounted for. It is also notable how easily the defect in the log linear model was revealed and corrected. A final point is that there are

Graphical Methods for the Cox Model

421

Figure 4 Martingale residual scatterplot on platelets, based on fitted model including a parabolic model for platelets.

a number of models other than a parabola that may describe a curved line; we chose a parabolic function because it is simple and achieves the goal of providing a reasonable fit to the data.

VIII. CUTPOINTS A common practice in the medical literature is to dichotomize a numerical-valued variable Z, such as white count or platelet count, by replacing it with a binary indicator variable I c ⫽ 0 if Z ⱕ c and I c ⫽ 1 if Z ⬎ c for some cutpoint c. The cutpoint may be determined in various ways. One common practice is to set c equal to the mean or median of Z. Another is to use the ‘‘optimal’’ cutpoint which gives the smallest p value, among all possible cutpoints, for the test of β c ⫽ 0 under the model with linear term β cIc. One consequence of this practice is that in evaluating the statistical analyses of two published studies where different ‘‘optimal’’ cutpoints were used to define I c and the studies concluded, respectively, that ‘‘Z had a significant effect on survival’’ and ‘‘Z did not have a signifi-

422

Thall and Estey

cant effect on survival,’’ it is impossible to determine whether the conflicting conclusions were due to a phenomenological difference between the two studies or were simply artifacts of random variation in the data manifested in application of the optimal cutpoint method. The use of a model containing I c in place of Z is appropriate only when it describes the actual relationship between Z and the outcome. An illustration of this is provided by the effect of hemoglobin on DFS in the caspase data set, illustrated by the martingale residual plot given in Figure 5. This plot shows that for a cutpoint c located somewhere between 6 and 8, patients having baseline hemoglobin above c have a higher risk of relapse or death. That is, there is a ‘‘threshold effect’’ of hemoglobin on DFS. A search over values of c in this range yields the optimal cutpoint 7 and a fit of the Cox model with linear term β HG I HG , where IHG ⫽ 1 if hemoglobin ⬎ 7 and 0 if hemoglobin is ⱕ 7 yields a p value of 0.004 for the test of β HG ⫽ 0 under this model. A more appropriate test, which accounts for fact that multiple preliminary tests were conducted to locate the optimal cutpoint (22), yields the corrected p value 0.032. Other methods for correcting p values to account for an optimal cutpoint search have been given by Altman et al. (23) and Faraggi and Simon (24).

Figure 5 Martingale residual scatterplot on hemoglobin, from the caspase data.

Graphical Methods for the Cox Model

423

Unfortunately, in most cases where a cutpoint model is used the relationship between Z and outcome simply is not of the form given by Figure 5. In such cases, the practice of replacing a continuous variable Z with a binary indicator I c is just plain wrong. Figure 6 is a martingale residual plot obtained by simulating an artificial covariate Z according to a standard normal distribution, so that Z is ‘‘white noise’’ and, as is apparent from Figure 6, has no relationship whatsoever to the actual patient outcome data. The optimal cutpoint c ⫽ 0.623, chosen because it gave the lowest p value among all possible values for c, produces a Cox model with linear component β cI c under which the uncorrected p value for the test of β c ⫽ 0 is 0.018. This appears to show that the binary variable I c ⫽ 1 if Z ⱖ 0.623 and I c ⫽ 0 if Z ⬍ 0.623 is a ‘‘significant’’ predictor. This anomaly, that white noise produces an apparently significant predictor based on the optimal cutpoint, is because many tests of hypotheses were conducted to determine the optimal cutpoint; hence, the final ‘‘significant’’ test is merely an artifact of this multiple testing procedure. The properly adjusted p value that accounts for this is 0.370, which correctly reflects the facts that the cutpoint is a artificial and that Z is nothing more than noise. More fundamentally, because Figure 5 does not exhibit the sharp vertical rise that indicates an actual threshold effect, it is inappropriate to fit a cutpoint model for Z to the data.

Figure 6 Martingale residual scatterplot on white noise.

424

Thall and Estey

IX. PLATEAU EFFECT It is well known that the presence of an antecedent hematologic disorder (AHD) is prognostically unfavorable in AML. We have previously considered an AHD ‘‘present’’ if a documented abnormality in blood exceeded 1 month in duration before diagnosis of AML (25). This reduces the duration Z AHD of an AHD to the binary indicator variable I AHD ⫽ 1 if Z AHD ⱖ 1 month and I AHD ⫽ 0 if Z AHD ⬍ 1 month. Others have used a cutpoint of 3 months rather than 1 month. A martingale residual plot on Z AHD for the fludarabine data is given in Figure 7, with the lowess smooth given by the solid line and a parametric fit, described below, by the dashed line. Figure 7 indicates that the risk of relapse or death increases sharply for values of Z AHD from 0 up to roughly 10 to 20 months and then stabilizes at a constant level for larger values. Due to the plateau in the smoothed curve, the relationship between Z AHD and DFS is neither linear nor quadratic. There are many functions that describe this pattern. We used the parametric function minimum {β1 log(Z AHD ⫹ 0.5), β 2arctan(Z AHD )}, illustrated by the dashed line, and this provides a reasonable fit that agrees with the lowess smooth. The estimates 0.204 for β 1 and 0.442 for β 2 each have p values ⬍ 10⫺8. While I AHD, the lowess

Figure 7 Martingale residual scatterplot on duration of antecedent hematological disorder, from the fludarabine data.

Graphical Methods for the Cox Model

425

function, and the parametric function each are highly significant predictors of DFS with the associated p ⬍ 10⫺7 in each case, Figure 7 indicates that the cutpoint model, IAHD, only approximates either the lowess or the parametric function. X.

PARABOLIC TIME-VARYING EFFECT

As noted earlier, the usual Cox model assumes that covariate effects are constant over time. For example, under the Cox model the hazard associated with an age of 60 years at diagnosis is the same at either 1 month or 5 years after treatment. This assumption frequently is not verified despite the fact that in some situations it might appear tenuous. For example, in chemotherapy of hematologic malignancies, the clinician might suspect that older patients are at greater risk of death occurring during the first few weeks of therapy rather than later on. A fit of the ordinary Cox model with linear term β AGE AGE to the fludarabine data yields an estimate of β AGE equal to 0.0175 with p ⬍ 0.001. This indicates that, for example, at any time after start of treatment a 60-year-old patient has twice the risk of death or relapse compared with a 20-year-old patient, since exp[(60 ⫺ 20) 0.0175] ⫽ 2. Figure 8 is the smoothed GTS plot on age, including

Figure 8 Grambsch-Therneau-Schoenfeld residual scatterplot on age, from the fludarabine data.

426

Thall and Estey

a 95% confidence band for the graphical estimate of β AGE (t). This plot and the associated test were produced with the ‘‘cox.zph’’ computer subroutine of Therneau (14) using the ‘‘identity’’ (untransformed) timescale. In Figure 8, as previously, the horizontal line at β AGE ⫽ 0 corresponds to age having no effect. Dots above the horizontal line at β AGE ⫽ 0 indicate an excess of deaths, whereas dots below the line indicate a deficit of deaths. Whereas the effect of age would be represented by a horizontal line at 0.0175 under the ordinary Cox model, the smoothed GTS plot indicates that β AGE (t) may be a parabolic function of t. Thus, the GTS plot indicates that the proportional hazards assumption may not be appropriate. A Grambsch-Therneau test (5) of the hypothesis that the data are compatible with the proportional hazards assumption has p ⫽ 0.007, confirming the graphical results. In particular, the effect of age on DFS cannot be quantified adequately by the single estimate 0.0175 noted above. The GTS plot suggests that the effect of age on the risk of relapse or death may be described by the quadratic function (β 1 ⫹ β 2t ⫹ β 3t2 ) Z AGE . The respective estimates of β1, β 2 , and β 3 under this extended Cox model are 0.0150, ⫺3.62 ⫻ 10⫺4, and 2.45 ⫻ 10⫺6 respectively, with p ⬍ 10⫺6, 1.2 ⫻ 10⫺4, and 0.014, indicating that age really has a parabolic time-varying effect.

XI. NONLINEAR TIME-VARYING EFFECT WITH PLATEAU The GTS plot is especially valuable when a time-dependent covariate effect is not described easily by a parametric function. This is the case for pretreatment Zubrod PS in the fludarabine data set. The ordinary Cox model fit with linear term β PS Z PS gives an estimate for β PS of 0.417 with p ⬍ 0.001. The GTS plot based on PS, given in Figure 9, shows that the effect of poor PS decreases during the first 3 months and then reaches a small but nonzero plateau thereafter. The grouping of points into rows is characteristic of GTS plots for variables taking on a small number of values. Here, PS take has possible values 0, 1, 2, 3, 4, which correspond respectively to the five groups of points in the plot. The Grambsch-Therneau goodness-of-fit test has p ⬍ 10⫺7, indicating that the proportional hazards assumption is untenable.

XII. CONDITIONAL KAPLAN-MEIER PLOTS A useful method for assessing covariate effects on survival or DFS is to construct a set of conditional Kaplan-Meier (KM) survival plots. These conditional plots, or ‘‘coplots,’’ may be used to make inferences without resorting to any parametric model fit or conventional test of hypothesis, although in practice we have found it most useful to apply the graphical and model-based methods together. A general

Graphical Methods for the Cox Model

427

Figure 9 Grambsch-Therneau-Schoenfeld residual scatterplot on performance status, from the fludarabine data.

discussion of conditional plots, in the context of analyzing trinary data, is given by Cleveland (26). An application of coplots to the caspase data is given in Figure 10, which is reproduced from Estrov et al. (17). The purpose of this figure is to provide a visual representation of the joint effects of C 2 and C 3 on survival. This figure was constructed as follows. Each of the nine plots in Figure 10 is a usual KM plot, constructed from a particular subset of the data. Moving from left to right, the three columns correspond to the lowest one third, middle one third, and upper one third of the C 2 values, which happen to be C 2 ⱕ 0.69, 0.69 ⬍ C 2 ⱕ 1.07, and 1.07 ⬍ C 2 for this data set. Similarly, going from bottom to top, the three rows correspond to the lowest, middle, and upper thirds of the C3 sample values, which are C 3 ⱕ 1.09, 1.09 ⬍ C 3 ⱕ 1.57, and 1.57 ⬍ C 3 for these data. Thus, for example, the KM plot in the center of the top row is constructed from the data of the 21 patients having intermediate C 2 values (0.69 ⬍ C 2 ⱕ 1.07) and high C 3 values (1.57 ⬍ C 3). Each plot is thus the usual KM estimate of the survival probability curve but conditional on the patient having C 2 and C3 values in their specified ranges. Thus, for example, moving from left to right along the bottom row shows how survival changes with increasing C 2 given that

428

Thall and Estey

Figure 10 Conditional KM plots for varying C 2 and C 3 values, based on three nonoverlapping intervals for each covariate.

C 3 ⱕ 1.09. The most striking message conveyed by this matrix of KM coplots is that survival is very poor for patients having both high C 2 and high C 3 . It is important to bear in mind that the particular numerical cutoffs of the subintervals here are specific to this data set, so that any patterns revealed by the plots may hold generally for a similar data set but the particular numerical values very likely will differ. This sort of interactive effect, manifested on a particular subdomain of the two-dimensional set of (C 2 , C 3 ) values, would not be revealed by the conventional approach of fitting a Cox model with linear component including the terms β 2C 2 ⫹ β 3C 3 ⫹ β 23C 2 C 3, since this parametric model assumes that the multiplicative interaction term β 23C 2C 3 is in effect over the entire domain of both covariates. The coplots clearly show that this is not the case. A slightly different way to construct this type of plot is to allow the adjacent subintervals of each covariate to overlap, in order to provide a smoother visual transition. Figure 11 is obtained by first defining subintervals in the domain of C 2, each of which contains half of the C 2 data, with the first interval running

Graphical Methods for the Cox Model

429

Figure 11 Conditional KM plots for varying C 2 and C 3 values, using overlapping intervals for each covariate.

from the minimum to the 3/6th percentile, the second from the 1/6th to 4/6th percentile, the third from the 2/6th to the 5/6th percentile, and the fourth from the 3/6th percentile to the maximum. Thus, adjacent intervals share 1/3rd of the C 2 data. Four subintervals of C 3 are defined similarly. The particular numerical values of the C 2 and C 3 interval end points are given along the bottom and left side of Figure 11. Another advantage of using intervals that overlap is that the sample size for each KM plot is larger than if the intervals are disjoint. Scanning each of the lower two rows from left to right shows that for lower values of C 3, survival improves with increasing C 2. This pattern changes slightly at the end of the third row, where survival seems to level off, and markedly in the top row, where survival drops as both C 2 and C 3 become large. The relatively poor survival shown in the upper right corner KM plot is notable in that this plot is based on 59 patients, comprising 27% of the sample values, whereas the

430

Thall and Estey

considerably more striking drop in the upper right corner plot of Figure 10 is based on a more extreme subsample of 31 patients.

XIII. DISCUSSION The importance of prognostic factor analyses in clinical research is widely acknowledged. In this chapter, we illustrate some general problems with these analyses as they are often conducted, and we describe some graphical methods and tests to address these problems. The aim of these methods is to determine the true relationship between one or more covariates and patient outcome in a particular data set. When a covariate is modeled incorrectly, which is the case if its effect on patient risk is not log linear as assumed under the usual Cox model, evaluation of the covariate’s effect may be misleading. In particular, dichotomization of a numerical variable by use of a cutpoint without first determining the actual form of the covariate’s effect typically leads to loss of information and in many cases is completely wrong. Use of the optimum cutpoint often leads to spurious inferences arising from nothing more than random variation in the data. Our examples illustrate that these problems may be addressed easily and effectively by the combined use of martingale residual plots, statistical tests, and transformation of covariates as appropriate. We also provide examples of covariates having effects that change over time, along with methods for revealing and formally evaluating such time-varying effects. The use of these methods helps to avoid flawed inferences from being drawn, which may be the case with the typical approach of fitting a Cox or logistic regression model without performing any goodness-of-fit analyses. Finally, model-based regression analyses may be augmented or even avoided entirely by the use of conditional KM plots. An important caveat to keep in mind when interpreting the results of any regression analysis is that the model-fitting process, including graphical methods and tests based on intermediate models, is not accounted for by p values obtained using conventional methods based only on the final fitted model. Formally, the p value of any such final test should be adjusted for the process that produced the model. This is due to the fact that both the model-fitting process and the final tests are based on the same data set. For example, the adjusted p value computation required to test properly for an optimal cutpoint, noted in Section 8, recognizes this problem. In fact, it applies more broadly to the entire model-fitting process. This consideration leads to notions of cross-validation and bootstrapping, which we do not pursue here. A basic reference is Efron and Tibshirani (27). The practical point is that due to random variation, a particular regression model fit to a given data set is not likely to provide as good a fit to another data set based on a similar experiment.

Graphical Methods for the Cox Model

431

Like medical research, statistical research is constantly evolving. Powerful new techniques for modeling and analyzing data are currently becoming available at an ever increasing rate. Difficulties in implementing graphical methods, such as those described here, have decreased dramatically due to the widespread availability of high-speed computing platforms and flexible statistical software packages. These methods are of value to medical researchers for at least three reasons. First, they suggest new directions for medical research. Why, for example, is a platelet count above 200,000 associated with inferior DFS in patients with AML or MDS (Fig. 3)? Certainly, this finding may be illusory, but the p value of 0.0001 for the quadratic term in Model 3 of Table 1 suggests otherwise. Is the sharp drop in survival for high levels of both C 2 and C 3 (Fig. 10 and 11) due to a real biological phenomenon, or is it merely an artifact of random variation? Second, graphical methods provide a powerful means to determine if and how the Cox model assumptions are violated, they lead quite easily to corrected models, and they are perhaps the best method available for communicating the results of a regression analysis to nonstatistical colleagues. Finally, because the methods often provide a greatly improved fit of the statistical model to the data, in turn they provide more reliable inferences regarding covariate and treatment effects on patient outcome.

ACKNOWLEDGMENT We are grateful to Terry Therneau for his thoughtful comments on an earlier draft of this manuscript.

REFERENCES 1. Cox DR. Regression models and life tables (with discussion). J Royal Stat Soc B 1972; 34:187–220. 2. Li Z, Begg CB. Random effects models for combining results from controlled and uncontrolled studies in meta-analysis. J Am Stat Assoc 89:1523–1527. 3. Stangl DK. Modelling and decision making using Bayesian hierarchical models. Stat Med 1995; 14:2173–2190. 4. Therneau TM, Grambsch PM, Fleming TR. Martingale-based residuals for survival models. Biometrika 1990; 77:147–160. 5. Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81:515–526. 6. Grambsch PM. Goodness-of-fit diagnostics for proportional hazards regression models. In: Thall PF, ed. Recent Advances in Clinical Trial Design and Analysis. Boston: Kluwer, 1995, pp. 95–112.

432

Thall and Estey

7. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. New York: Wiley, 1991. 8. Crowley JJ, Hu M. Covariance analysis of heart transplant survival data. J Am Stat Assoc 1977; 72:27–36. 9. Crowley JJ, Storer BE: Comment on ‘‘A reanalysis of the Stanford heart transplant data’’ by M. Aitkin, N. Laird and B. Francis. J Am Stat Assoc 1983; 878:277–281. 10. Kay R. Proportional hazards regression models and the analysis of censored survival data. Applied Statistics 1977; 26:227–237. 11. Schoenfeld D. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika 1980; 67:145–153. 12. Cain KC, Lange NT. Approximate case influence for the proportional hazards regression model with censored data. Biometrics 1984; 40:493–499. 13. Harrell FE. The PHGLM procedure. SAS Supplememntal User’s Guide, Version 5. Cary, NC: SAS Institute, Inc. 1986. 14. Therneau TM. A Package for Survival in S. Mayo Foundation. 1995. 15. Harrell FE. Predicting Outcomes. Applied Survival Analysis and Logistic Regression. Charlottesville: University of Virginia. 1997. 16. Estey EH, Thall PF, Beran M, Kantarjian H, Pierce S, Keating M. Effect of diagnosis (RAEB, RAEB-t, or AML) on outcome of AML-type chemotherapy. Blood 1997; 90:2969–2977. 17. Estrov Z, Thall PF, Talpaz M, Estey EH, Kantarjian HM, Andreeff M, Harris D, Van Q, Walterscheid M, Kornblau S. Caspase 2 and caspase 3 protein levels as predictors of survival in acute myelogenous leukemia. Blood 1998; 92:3090–3097. 18. Estey EH, Thall PF, Pierce S. Randomized phase II study of fludarabine ⫹ cytosine arabinoside ⫹ idarubicin ⫾ all trans retinoic acid ⫾ granulocyte-colony stimulating factor in poor prognosis newly diagnosed acute myeloid leukemia and myelodysplastic syndrome. Blood 1999; 93:2478–2484. 19. Gentleman R, Crowley L. Local full likelihood estimation for the proportional hazards model. Biometrics 1991; 47:1283–1296. 20. Cleveland WS. Robust locally-weighted regression and smoothing scatterplots. J Am Stat Assoc 1979; 74:829–836. 21. Becker RA, Chambers RM, Wilks ARA. The New S Language. Pacific Grove, CA: Wadsworth, 1988. 22. Hilsenbeck SG. Practical p-value adjustments for optimally selected cutpoints. Stat Med 1996; 15:103–112. 23. Altman DG, Lausen B, Sauerbrei W, Schumacher M. The dangers of using ‘‘optimal’’ cutpoints in evaluation of prognostic factors. J Natl Cancer Inst 1994; 86:829– 835. 24. Faraggi D, Simon R. A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis. Stat Med 1996; 15:2203–2214. 25. Estey EH, Thall PF, et al. Use of G-CSF before, during and after fludarabine ⫹ araC induction therapy of newly diagnosed AML or MDS: comparison with fludarabine ⫹ ara-C without G-CSF. J Clin Oncol 1994; 12:671–678. 26. Cleveland WS. Visualizing Data. Summit, NJ: Hobart Press, 1993. 27. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993.

21 Graphical Approaches to Exploring the Effects of Prognostic Factors on Survival Peter D. Sasieni and Angela Winnett* Imperial Cancer Research Fund, London, England

I.

INTRODUCTION

In this chapter we are interested in exploratory data analysis rather than precise inference from a randomized controlled clinical trial. Although the quality of data collection and follow-up is important, there is no need to have a randomized trial to study prognostic factors; large series of patients receiving standard therapy are important sources of information. Recent interest in molecular and genetic markers create additional problems for the data analyst, but these are mostly concerned with multiple testing and test reliability. They will not be discussed in this chapter. We are concerned with methods appropriate for analyzing a small number of prognostic factors. The methods described in this chapter are illustrated using data from the Medical Research Council’s fourth and fifth Myelomatosis Trials (1). A total of

*Current affiliation: Imperial College School of Medicine, London, England.

433

434

Table 1

Summary of Prognostic Variables

Variable

Units

Min.

1st quartile

Continuous variables 5.7 3 2.1 ⫺1.7 6.5 5.1

Median

3rd quartile

Max.

6.3 2.7 6.9

6.9 3.4 7.5

8.1 6.3 11.0

Freq. 1

Freq. 0

277 758 191

736 255 822

Age log2 (sβ2m) log2 (serum creatinine)

10 years log(mg/l) log(mM)

Variable

Description

ABCM Cuzick index int. or poor Cuzick index poor

Indicator variables 1 for trial 5 with ABCM, 0 otherwise 1 for Cuzick index ‘‘intermediate’’ or ‘‘poor,’’ 0 otherwise 1 for Cuzick index ‘‘poor,’’ 0 otherwise

Sasieni and Winnett

Prognostic Factors and Survival

435

1013 patients are included, of whom 821 had died by the time the data set was compiled and 192 were censored. Survival times range from 1 day to over 7 years. The patients in the fourth trial received treatment of either intermittent courses of melphalan and prednisone (MP) or MP with vincristine given on the first day of each course. In the fifth trial patients received either intermittent oral melphalan (M7), or adriamycin, 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU), cyclophosphamide, and melphalan (ABCM). A number of prognostic variables were recorded—age, serum β2 microglobulin (sβ2m), and serum creatinine. Also prognostic groups were defined according to the Cuzick index, which is based on blood urea concentration, hemoglobin, and clinical performance status (2). No significant differences were found between survival of patients with the different treatments in trial 4 (3) or between survival of patients in trial 4 and patients with M7 treatment in trial 5, so these three groups were pooled together. Logarithms (to base 2) were used for sβ2m and serum creatinine since otherwise they are very skewed. Age in years was divided by 10 so that the difference in the interquartile range of each continuous variable was close to 1 (between 1 and 1.3). A summary of the variables is given in Table 1.

II. LOGISTIC REGRESSION Clinically one may be interested in short-term (1 year), medium-term (5 year), and long-term (10 year) survival. In the absence of censoring, one could use three logistic models to examine the effect of various potentially prognostic factors on each end point. An advantage of this approach is the ease with which the results can be presented and interpreted. Not only can the importance of individual factors be evaluated in a multivariate model, but a prognostic score can be developed to quantify the probability that an individual with a given profile will survive to each of the three time points. A disadvantage of fitting three logistic models is that although any patient who survives 5 years must have survived 1 year, the models are not linked and estimation of the conditional probability of survival to 5 years given that the patient is alive at 1 year is not straightforward. Instead, one might consider a single model for the ordered multinomial end points: ‘‘early death’’ (⬍1 year), ‘‘medium death’’ (1–5 years), ‘‘late death’’ (5–10 years), or ‘‘long-term survivor’’ (⬎10 years). Although such models are attractive, it is rare that one has uncensored long-term follow-up, so they have their limitations and survival analysis models are required. However, short-term survival is often uncensored and logistic regression is a useful and underused technique for exploring the role of prognostic factors in such situations. Example None of the survival times in the myeloma data were censored at less than 2 years, whereas 454 of the deaths occurred in the first 2 years and 199 in

436

Table 2

Multivariate Logistic Regression Using Three End Points Compared with Longer Survival Death within 0–6 months

0–2 years

6 months–2 years

O.R.

95% CI

O.R.

95% CI

O.R.

95% CI

Age (per 10 years) log2 (sβ2m) log 2 (serum creatinine) ABCM Cuzick index int. or poor (vs. good) Cuzick index poor (vs. int. or good) Deviance Null deviance

1.38 1.62 1.20 0.74 2.27

(1.11–1.71) (1.31–2.00) (0.94–1.54) (0.50–1.09) (1.30–3.99)

1.19 1.68 1.06 0.62 1.62

(1.02–1.39) (1.41–2.00) (0.85–1.33) (0.46–0.84) (1.15–2.28)

1.08 1.56 0.94 0.63 1.37

(0.90–1.29) (1.28–1.90) (0.72–1.23) (0.44–0.89) (0.94–2.00)

1.46

(0.95–2.23)

1.11

(0.74–1.65)

0.90

(0.55–1.45)

860 1004

1255 1393

965 1012

Sasieni and Winnett

Covariate

Prognostic Factors and Survival

437

the first 6 months. Therefore, logistic models were used to study the effects of the prognostic variables on the probability of death in the first 6 months and on the probability of death in the first 2 years. The results of the logistic regressions are in Table 2. Confidence intervals were calculated based on ⫾1.96 ⫻ standard error of the coefficients. From the logistic regression results it can be seen that sβ2m is a strongly prognostic factor for survival both to 6 months and to 2 years. The treatment ABCM can be seen to improve survival to 2 years; although its effect on 6-month survival is still beneficial, the effect is smaller in magnitude and not statistically significant. The Cuzick prognostic index ‘‘good’’ does indeed indicate improved survival, at least as far as 6 months, whereas the difference between the groups ‘‘poor’’ and ‘‘intermediate’’ is not statistically significant. The odds ratios for age and the Cuzick prognostic index are closer to one for survival to 2 years than for survival to 6 months, but these models by themselves do not indicate whether there is any association with survival to 2 years within those patients who survived to 6 months. This was investigated in a further logistic model for survival up to 2 years, including only those patients who were still alive after 6 months. The results are also in Table 2. It is seen that neither age nor the Cuzick index have statistically significant association with survival up to 2 years conditional on survival up to 6 months, whereas sβ2m and ABCM treatment are associated with differences in survival from 6 months to 2 years and survival up to 6 months. The effect of serum creatinine is not statistically significant; this is in contrast to what is seen if only serum creatinine is included in the model (Table 3). There is strong correlation between serum creatinine and sβ2m, so that the strong prognostic value of serum creatinine by itself can be largely accounted for by the confounding effect of sβ2m, whereas the prognostic value of sβ2m is highly statistically significant even when serum creatinine has been included in the model. This can also be seen from Table 4, where the two variables have been

Table 3

Logistic Regression Models for Serum Creatinine and sβ 2 M only Logistic regression for death within 6 mo with serum creatinine only

Logistic regression for death within 6 mo with serum creatinine and sβ2m

Covariate

O.R.

95% CI

O.R.

95% CI

log 2 (serum creatinine) log 2 (sβ 2m) Deviance

2.14

(1.82–2.52)

1.33 1.75

(1.05–1.68) (1.42–2.16) 886

916

438

Sasieni and Winnett

Table 4 Number of Individuals in Categories Defined by Serum Creatinine and sβ2 m with Percentage Dying in the First 6 Months in Each Category log 2 (sβ 2 m) log2(serum creatinine) ⱕ6.46 6.46–6.75 6.75–7.04 7.04–7.67 ⬎7.67 Total

ⱕ1.89

1.89–2.44

2.44–2.93

2.93–3.73

⬎3.73

Total

97 5% 56 9% 33 3% 20 20% 1 0%

59 7% 62 11% 57 7% 25 4% 1 0%

33 18% 41 12% 51 22% 56 14% 16 19%

16 25% 33 21% 42 26% 70 26% 42 29%

7 29% 5 20% 17 18% 31 35% 142 46%

212 10% 197 13% 200 15% 202 21% 202 40%

207 7%

204 8%

197 17%

203 26%

202 41%

1013 20%

divided into five categories each with roughly equal numbers of individuals. The correlation between the two variables can be seen by the high frequencies around the diagonal of the table. The prognostic value of each variable by itself can be seen by the proportions dead by 6 months in the row and column total cells, which increase as the value of each variable increases. The prognostic value of sβ2m after adjusting for serum creatinine can be seen by the increasing proportions dead in each row (i.e., as sβ2m increases in each category of serum creatinine). The lack of association between serum creatinine and survival after adjusting for sβ2m can be seen from the proportions dead which do not increase steadily in each (internal) column of Table 4. This contrasts with the strong increasing trend seen in the column of marginal totals.

III. PROPORTIONAL HAZARDS Hazard-based models are naturally adapted for use with (right) censored data and have therefore become the standard approach for survival analysis. In particular, the proportional hazards regression model introduced by Cox (4) has become ubiquitous in medical journals. The (conditional) hazard (of death) at time t for an individual with covariates Z is defined by λ(t |Z) ⫽ lim h↓0 P(T ∈ [t, t ⫹ h] |T ⱖ t, Z)/h

[1]

Prognostic Factors and Survival

439

It is the death rate at time t among those who are alive (and uncensored) just prior to time t. Constant hazards correspond to exponential random variables and are conveniently described in terms of one death every so many person-years. Constant hazards are, however, rarely observed in clinical studies. The usual form of the proportional hazards model is λ(t| Z) ⫽ λ 0 (t)exp(β T Z)

[2]

in which λ 0 (t) is an unspecified baseline hazard function (that corresponds to individuals with Z ⫽ 0) and β is a vector of parameters. This model forms a good starting point for analysis of prognostic factors for censored survival data. However, it makes quite strong assumptions on the form of the effects and it is always important to check the appropriateness of the model. Even without questioning the form of the model, one has to apply a sensible model-building strategy for selecting important prognostic factors from a pool of potentially relevant covariates. We assume here that the goal is not simply prediction (in which case one may prefer to use a ridge or shrinkage approach over covariate selection (5)) but that the chosen model should be biologically plausible. In many situations, several of the covariates will be correlated, and it is a good idea to include certain basic covariates (factors known to be of prognostic value from previous studies) in any model. After that one may wish to consider a step forward or a step backward procedure to select a model. It is certainly useful to properly document the model selection procedures employed and where possible to validate the final model on a separate data set. Example Cox regression was used to estimate the effects of prognostic variables on survival in the myeloma study; the results are in Table 5. Here, as in the logistic models, Table 2, higher values of age and sβ2m are associated with worse prognosis. As in the logistic models, ABCM treatment and Cuzick prog-

Table 5

Cox Proportional Hazards Model

Covariate

Hazard ratio

95% CI

1.12 1.26 1.06 0.76 1.38 1.11

(1.03–1.21) (1.16–1.37) (0.95–1.19) (0.65–0.89) (1.15–1.65) (0.91–1.36)

Age (per 10 years) log 2 (sβ2 m) log 2 (serum creatinine) ABCM Cuzick index int. or poor (vs. good) Cuzick index poor (vs. int. or good) ⫺2 ⫻ log partial likelihood (fitted model): 10,093 ⫺2 ⫻ log partial likelihood (null model): 10,236

440

Sasieni and Winnett

nostic index good are associated with a reduced hazard, but the difference between Cuzick prognostic index intermediate and poor is not statistically significant. Notice that the hazard ratios are generally closer to one than the corresponding odds ratios from the logistic models in Table 2, but this is to be expected from the relationship between odds ratios and hazard ratios. Prognostic value of the prognostic index After fitting the Cox model one has a function βˆ T Z that defines the effect of covariates on the baseline hazard. This is not itself particularly useful clinically but should be combined with the estimated baseline hazard to obtain estimates of the effects of the covariates on survival. This will most conveniently be described in terms of the effect on the median (or some other quantile) survival or on survival to some fixed time. Alternatively, one can simply use the prognostic index to divide the study population into subgroups and estimate the survival of each group using standard Kaplan-Meier techniques. An advantage of this latter approach is that the proportional hazards model is only used to divide the population into subgroups with different prognoses: The actual survival of each subgroup is then estimated nonparametrically. The disadvantages are the potential bias and loss of information associated with discretizing a continuous prognostic factor and the loss of power resulting from abandonment of the model. Example From Table 5, higher sβ 2m is strongly associated with an increased hazard, but it is not clear what the clinical significance of this effect would be. Figure 1 shows estimates from the fitted Cox model of the survival function for patients with log 2 (sβ 2 m) equal to 1.54 and 4.52 (the 10% and 90% quantiles). The first part of the figure is based on a patient with values of the other covariates

Figure 1 Survival functions estimated from the Cox model, with 95% confidence intervals, for log 2 (sβ2 m) equal to 1.54 and 4.52 and (a) age 69, log 2 (serum creatinine) ⫽ 7.5, not ABCM treatment, Cuzick index ‘‘poor’’; (b) age 57, log 2 (serum creatinine) ⫽ 6.5, ABCM treatment, Cuzick index ‘‘good.’’

Prognostic Factors and Survival

441

Figure 2 Kaplan-Meier estimates with 95% confidence intervals, for individuals with prognostic index from the Cox model (a) βˆ TZ in (1.91–2.03) and (2.52–3.22) and (b) βˆ TZ in (0.45–1.36) and (1.68–1.80).

corresponding to a relatively poor prognosis, whereas the second part of the figure is based on a patient with values of the other covariates corresponding to a relatively good prognosis. The prognostic index, βˆ T Z, from the fitted Cox model ranges from 0.45 to 3.22. By partitioning the prognostic index βˆ T Z, the sample was divided into 10 groups with equal numbers of individuals in each group, and Kaplan-Meier estimates of survival in each group were calculated. The covariate values of Figure 1a correspond to βˆ T Z ⫽ 1.98 and βˆ T Z ⫽ 2.66, which fall in the 6th and 10th of the 10 groups, and the covariate values of Figure 1b correspond to βˆ TZ ⫽ 1.08 and βˆ T Z ⫽ 1.77, which fall in the 1st and 4th of the 10 groups. Figure 2 shows the Kaplan-Meier estimates for these groups. The two methods produce survival estimates with quite different shapes since Figure 1 is based on the assumption of proportional hazards and the specific form of the Cox model, whereas in Figure 2 the survival functions are estimated nonparametrically but based on grouping large numbers of patients together. For comparison, 1-year survival probabilities for the four groups in Figure 1 are 51%, 71%, 76%, and 87% and in Figure 2 33%, 71%, 82%, and 94%. The corresponding 5-year survival rates are 2%, 16%, 22%, and 54% compared with 11%, 9%, 20%, and 42%. Thus, it is seen that deviations from the model fit are greatest at the extremes of the prognostic index range.

IV. TRANSFORMATION OF COVARIATES Both the logistic model and the Cox model (2) impose a particular form on each continuous covariate. As with all regression models one should consider the

442

Sasieni and Winnett

possibility of transforming covariates before entering them in the model. Many serum markers have positively skewed distributions and are traditionally logtransformed. In other situations one may need to consider whether an extreme value has undue influence on the parameter estimates or whether there is a (biologically plausible) nonmonotone covariate effect. A simple exploratory analysis of the effect of a single continuous covariate on survival can be done using smooth estimates of quantiles of the conditional survival function (6,7). The conditional survival function for covariate value z is estimated using the Kaplan-Meier estimator with individual i weighted according to the distance between z and Zi. Usually there will be more than one covariate, and the following sections describe methods for exploring the relationship between continuous covariates and survival within a logistic model or a Cox model.

A.

Transformation of Covariates in the Logistic Model

For the logistic model as in Section II the form of the covariate effects can be investigated graphically simply using scatter plot smoothers of the response (i.e., death) against each covariate. Since the logistic model assumes that the logit transformation of the probability of death given covariate vector Z is linear in each component of Z, it may be more useful to plot the logit transformation of a smooth of the response against a covariate. Example Figure 3 shows the logit transformation of a smooth of the indicator of death up to 2 years against the covariate values for sβ2m and serum creatinine. The smoother used is a cubic smoothing spline with 7 degrees of freedom and the shaded histograms on the plots indicate the distribution of the covariates.

Figure 3 Logit transformation of smoothed indicator of death up to 2 years against prognostic variables.

Prognostic Factors and Survival

443

Notice that the probability of death within 2 years of diagnosis is very high (90%) for those with very high levels of log 2 (sβ 2 m) (⬎5) and very low (20%) for those with extremely low values (⬍0). Despite this increasing trend, the relationship may not be monotone—there is certainly no evidence of increasing risk associated with log 2 (sβ2 m) values of between 1.0 and 2.0. By contrast the probability of death by 2 years does seem to be a monotone function of serum creatinine concentration (except possibly at the lowest few percent of concentrations). The spread of risk is, however, less than for sβ 2m. The strong relationship between dying and serum creatinine is interesting in that it largely disappears after adjusting for sβ 2m. B. Transformation of Covariates in the Cox Model The Cox model is hazard based and allows for censoring, so there is no simple response that can be plotted to investigate the form of the covariate effects in it, but other graphical methods have been developed. The simplest approach to investigating covariate transformations is to partition the covariate of interest to create about five ‘‘dummy variables.’’ Cutpoints should be chosen so that there are roughly equal numbers of observations in each group wherever possible, but standard cutpoints may be preferred. A plot of the estimated parameters against the mean value of the observations in the interval is used to examine the appropriateness of the linear fit associated with the basic model. Discretizing a covariate and estimating a separate parameter for each interval is equivalent to fitting a piecewise constant function. With a continuous covariate, this is a very crude approximation to the logarithm of the hazard ratio that may vary as a smooth function of the basic covariate. Consider the additive Cox model

冦冱 s (Z )冧 p

λ(t| Z) ⫽ λ 0 (t) exp

j

j

[3]

j⫽1

in which the hazard ratio associated with the jth covariate is equal to the function exp{sj (Zj )} instead of simply exp(βj Zj ). Techniques exist for estimating the sj directly using local estimation (8–10), regression splines (11–13), or penalized partial likelihood (14,15). In this chapter we are more interested in diagnostic plots to investigate whether the chosen form exp(βj Zj ) is reasonable. Methods based on residuals yield one-step approximations toward the underlying sj and have the advantage of being easy to use and easy to apply using any software that can do Cox regression and smoothing. A very crude estimate of the sj can be obtained by applying a scatterplot smoother to a plot of the so-called martingale residuals against Zj (16). These are defined as

444

Sasieni and Winnett

ˆ 0 (Ti ) ˆ i ⫽ δ i ⫺ exp(βˆ T Zi )Λ M for individual i with covariate vector Zi and survival time Ti , where exp ˆ 0 (Ti ) is an estimate of the cumulative hazard function for individual i at (βˆ TZi )Λ Ti , ˆ 0 (Ti ) ⫽ Λ

冱∑ TjⱕTi

TkⱖTj

δj exp(βˆ T Z k )

and δ i ⫽ 1 if the observation on individual i is a death, δ i ⫽ 0 if it is censored. These residuals can then be smoothed against each component of the covariate ˆ i is smoothed against Z ij to estimate the form of sj. Earlier approaches vector; M ˆ 0 (Ti ) ⫽ δi ⫺ M ˆ i against Z i (17,18;) included plotting the terms Eˆ i ⫽ exp(βˆ T Zi )Λ ˆ martingale residuals Mi are an improvement as the terms δi provide an adjustment for censoring. The resulting estimates of the s j’s are not the best available diagnostics; a ˆi better diagnostic plot can be obtained by adjusting each martingale residual M ˆ by E i and plotting smooth

冢冣

ˆi M against Z ij with smoothing weights Eˆi ˆE i

[4]

ˆ i /Eˆ i the adjusted martingale residual. Alternatively, a diagnostic plot We call M can be obtained by smoothing both δ i and Eˆ i against the covariate values and plotting the logarithm of the ratio of the two smooths, log

smooth(δ ) against Z 冦smooth (Eˆ ) against Z 冧 i

ij

i

[5]

ij

(19). Motivation for these plots comes from the fact that under the additive Cox model (3), E(δ i | Z i ) ⫽ E[Λ 0 (T i ) exp{∑j s j (Z ij )}| Z i ], whereas E(Eˆ i | Z i ) ⫽ ˆ 0 (Ti )exp(∑j βˆ j Zij )| Zi}. Thus, an estimate of E(δi |Z ij )/E(Eˆi |Zij ) approxiE{Λ mately estimates the factor exp{sj (Zij ) ⫺ βˆ j Zij }, which leads to Eq. [5] as an estimate of sj (Zij ) ⫺ βˆ jZij. The two methods [4] and [5] are similar as log{E(δi |Zij )/ ˆ i |Z ij )/E(Eˆi |Zij ) by the approximation E(Eˆi | Zij )} is approximately equal to E(M log(1 ⫹ x) ⬇ x for small x. Smoothing is needed in any plot of martingale residuals since they are generally very skewed and are nearly uncorrelated; plots of martingale residuals or adjusted martingale residuals themselves are not usually helpful. Note that smooths should be mean based since they are estimating expected values; a robust ˆ. smoother is likely to be biased due to the skewedness of Λ Some statisticians use martingale residuals before entering a new covariate into the Cox model. Since residual methods yield one-step approximations toward

Prognostic Factors and Survival

445

the underlying s j , it is always advisable to start with at least a linear approximation. If martingale residuals are based on a model including the covariate Z j , with coefficient βˆ j , the function s j (Z j ) is estimated by the residual estimate [4] or [5] added to the linear term βˆ j Z j. On the other hand, to determine whether the function s j deviates from linear, it may be more useful to simply plot the residual estimate. Confidence intervals Confidence intervals are always important when looking at any estimate; they are particularly important in the case of smoothed residual plots, since smoothers can make a plot appear to have some nonlinear structure even from random data with no underlying structure. Estimating confidence intervals for smoothed estimates presents various problems such as bias correction, multiple testing, and determining the shape of an estimate as opposed to its value (see for example Hastie and Tibshirani, Sect. 3.8 (20)). Pointwise confidence intervals are relatively simple to estimate, at least if a linear smoother is used, and are certainly useful, although care should be taken in interpreting them. ˆ i /Eˆ i can be estimated by 1/Eˆ i , but The variance of the adjusted residual M there is an additional problem due to correlation between adjusted residuals for different individuals and the variance due to adding the linear estimate from the Cox model. However, the variance of the linear estimate, and the covariance of ˆ i′ /Eˆ i′ for i ≠ i′, are usually small compared with the variance of ˆ i /Eˆ i , and M M ˆ i /Eˆ i for each individual, so approximate confidence intervals can be found by M ˆ i /Eˆ i by the diagonal estimating the variance of the vector of adjusted residuals M matrix with diagonal elements equal to 1/Eˆ i. If the weighted smooth against the kth covariate is represented by the linear smoothing matrix L k , the variance of the smooth estimate [4] can be estimated by the diagonal matrix with diagonal elements equal to 1/Eˆ i , premultiplied by L k and postmultiplied by L Tk. Example Figure 4 shows the weighted smooth of the adjusted martingale residuals (4) with the linear term added for the continuous covariates, sβ2m, age, and serum creatinine in the Cox model of Table 5. The smoother is a smoothing spline with 7 degrees of freedom, and the shaded histograms on the plots indicate the distribution of the covariates. The additive Cox model [3] only makes sense if there is some constraint on the functions s j , for example, if λ 0 is taken to be the hazard function for an individual with Z ⫽ 0, then s j (0) ⫽ 0 for each covariate j; in Figure 4 λ 0 corresponds to the minimum observed value of each covariate. As a result of Figure 4, the linear terms for log 2 (sβ2 m), log 2 (serum creatinine), and age in the Cox model were replaced by continuous piecewise linear functions. The variable log2 (sβ2m) was split into four, for values up to 2, between 2 and 3, between 3 and 5, and greater than 5. Similarly log2 (serum creatinine) was split into two variables for values greater than and less than 10, and age was split into two at 55 years. These cutpoints are shown as vertical dashed lines on Figure 4. A new variable was defined equal to age when age is less than 55 and equal to 55 otherwise, and a second variable was defined equal to age when age is

446

Sasieni and Winnett

Figure 4 Estimate of covariate transformation using smoothed adjusted martingale residuals, with approximate 95% confidence intervals.

greater than 55 and equal to 55 otherwise; similarly for sβ2 m and serum creatinine variables. The resulting estimates are shown in Table 6. The model in Table 6 does not give a particularly good estimate of the shape of the functions β j (z j ). Figure 4 itself is more appropriate for that, but it does give a better idea of the strength of the effects and the standard errors of the estimates. Thus, according to this model it is both very young and older patients that have an increased hazard. An increase in sβ2m is associated with an increased hazard both for very high values and for more typical values, but extremely high values of sβ2m seem to be associated with an additional increase in the hazard, whereas very low values may not indicate a correspondingly lower hazard. In general, the hazard seems to depend on the level of sβ2 m in a fairly complicated way. After adjusting for sβ2 m, serum creatinine has no statistically significant association with increased hazard either in general or for extreme values. The value of minus twice the log partial likelihood for this model is 48 less than that for the model in Table 5, for the addition of five extra variables, so by a partial likelihood ratio test the new model is certainly a better fit, even allowing for the data driven choice of cut points.

Prognostic Factors and Survival

447

Table 6 Cox Proportional Hazards Model Using Continuous Piecewise Linear Covariate Effects Covariate Age up to 55 years (per 10 years) Age after 55 years (per 10 years) log 2 (sβ2 m) up to 2 log 2 (sβ2 m) between 2 and 3 log 2 (sβ2 m) between 3 and 5 log 2 (sβ2 m) above 5 log 2 (serum creatinine) up to 10 log 2 (serum creatinine) above 10 ABCM Cuzick index int. or poor (vs. good) Cuzick index poor (vs. int. or good)

Hazard ratio

95% CI

0.80 1.28 1.00 2.02 0.93 5.87 1.03 3.43 0.79 1.34 1.14

(0.65–1.00) (1.13–1.44) (0.83–1.20) (1.58–2.58) (0.78–1.11) (3.38–10.18) (0.92–1.17) (0.97–12.08) (0.67–0.92) (1.12–1.62) (0.93–1.39)

⫺2 ⫻ log partial likelihood: 10,045

Figure 5 shows adjusted martingale residual plots for sβ2 m and serum creatinine based on martingale residuals calculated from a model without these covariates, that is, only age, hemoglobin, ABCM, and the Cuzick index indicator variables are in the model. In contrast to Figure 4, the plot for serum creatinine indicates a strong effect (note the scales on the y-axes are not the same as in Fig. 4). This is due to the correlation between serum creatinine and sβ2 m as discussed in Section II and illustrates the advantage of entering a covariate in a model at least as a linear term before calculating residuals. Recall from Table 4 that of the 202 individuals with log2 (serum creatinine) greater than 7.67, 184 (91%) had log2 (sβ2 m) greater than 2.93.

Figure 5 Estimate of covariate transformation using smoothed adjusted martingale residuals based on the model without sβ2 m or serum creatinine.

448

V.

Sasieni and Winnett

NONPROPORTIONAL HAZARDS: TIME-VARYING COEFFICIENTS

In the Cox model, the hazard ratio of two individuals with covariates Z ⫽ z 1 and Z ⫽ z 2 , respectively, is given by exp{β T (z 2 ⫺ z 1 )}, which does not vary with time. Covariate effects that are thought to change (on the hazard ratio scale) over time can be modeled by including user-defined time-dependent covariates, but that is not a particularly flexible approach. Rather, one may wish to consider the more general model λ(t |Z) ⫽ λ 0 (t)exp{β(t) TZ} in which the hazard ratio exp{β(t) T (z 2 ⫺ z 1 )} is allowed to vary over time through the vector of functions β(t). Standard software for fitting a Cox model to time-dependent covariates can be used to estimate β(t) if one is willing to use a parametric regression spline so that a single covariate X is replaced by a vector Xb(t) where b is a vector basis for the spline (21,22). However, regression splines are not the most flexible approach, and here we are more interested in diagnostic plots that can be used to examine the form of the functions β(t) rather than direct estimation. One simple approach is to estimate the parameters of the Cox model locally in time. Although this is computationally intensive, it is conceptually simple and easily implemented in any package capable of doing Cox regression. To estimate the parameters at some point t* one considers a window in time (t 0 , t 1] containing t* and estimates β in a standard Cox model left truncating the data at t 0 and right censoring at t 1. A disadvantage is that because the estimate is locally constant, there is bias toward the ends of the range of event times, in the same way as smoothing in general using a running mean smoother results in bias toward the ends. This approach is discussed more fully by Valsecchi et al. (23). In a similar way to the methods for transformations of covariates, one-step estimates of time-varying coefficients can by found by smoothing residuals against time. The appropriate residuals here are Schoenfeld residuals (24–26). Let t (1) , . . . , t (d) be the unique event times, then the Schoenfeld residual at event time t (i) is defined as rˆ (i) ⫽

冱冦

j:Tj⫽t(i)

Zj ⫺

S 1 (βˆ , t (i)) S 0 (βˆ , t (i))

冧

where S k (βˆ , t) ⫽ ∑ j:Tjⱖt Z jk exp(βˆ T Z j ). Note that the residuals are only defined at death times (there is not a residual that is identically zero at each censoring time). Note also that this is a vector valued residual. Smoothing each component of the residuals against the event times leads to estimates of each component of the vector of functions β.

Prognostic Factors and Survival

449

The residual rˆ (i) has variance V(i) estimated by Vˆ (i) ⫽

冱冦S (βˆ , t ) ⫺ S 2 (βˆ , t (i))

j:Tj⫽t(i)

0

(i)

S 1 (βˆ , t (i)) 䊟2 S 0 (βˆ , t (i)) 2

冧

where S 2 (βˆ , t) ⫽ ∑j:Tjⱖt Z 䊟2 exp(βˆ T Z j ). Theory suggests that the expected value j ⫺1 rˆ (i) is approximately equal to {β(t (i) ) ⫺ βˆ } (24). If one only had a single of Vˆ (i) covariate, the shape (but not the magnitude) of β(t) could be estimated without standardizing, but because the different components are not in general independent, it is necessary to standardize the residuals by premultiplying by the inverse ⫺1 of their variance before smoothing against time. The adjusted residual Vˆ (i) rˆ (i) then ⫺1 . This variance can vary greatly, particularly for has variance approximately V (i) the later event times; therefore in smoothing, for the kth covariate, the inverses ⫺1 of the kth diagonal elements of the matrices Vˆ (i) should be used as weights. Often Vˆ (i) will be nearly the same for each event time, in which case the Schoenfeld residuals can be adjusted using V, the mean of the Vˆ (i)s. This saves computational effort, particularly since V is equal to the inverse of the Cox model variance matrix divided by the number of events and the adjusted Schoenfeld residuals are therefore available without any extra computation in statistics packages (such as S-Plus, Stata). On the other hand, if the risk set becomes very small at later event times or, rather, if the range of covariate values in the risk set becomes small at later event times, then Vˆ (i) is likely to be smaller as well, and using the mean V can lead to bias (27). Therefore, if V is used instead of Vˆ (i) , care must be taken in interpreting the smooths once the risk set is small. A number of points from Section IV also apply to the plots in this section. Smoothing is needed for looking at plots of Schoenfeld residuals since the adjusted residuals themselves are often highly skewed and nearly uncorrelated. Additionally, trends in Schoenfeld residuals may be obscured by the residuals being in ‘‘bands’’ for different values of categorical covariates. The motivation for the plots is based on using smoothing to estimate expected values; therefore, smooths should be mean based and not robust. The constant estimates from the Cox model can be added to the smoothed adjusted residuals to estimate the functions β or the smoothed adjusted residuals can be plotted by themselves to estimate the deviation of β from constant. The residual plots are only one-step estimates, so the covariate should be entered in the Cox model initially at least as constant, since a one-step estimate starting from a constant estimate should be better than a one-step estimate starting from zero. Confidence intervals In the same way as described in Section IV.B, confidence intervals are important for interpreting smooths of adjusted Schoenfeld residuals. Approximate pointwise confidence intervals can be estimated fairly easily if a linear smoother is used (26). Let βˆ 1(i) be the adjusted Schoenfeld residuals plus

450

Sasieni and Winnett

⫺1 constant estimate, βˆ 1(i) ⫽ Vˆ (i) rˆ (i) ⫹ βˆ , then the variance of βˆ 1(i) can be estimated ⫺1 1 ˆ ˆ ˆ by V (i) and for i ≠ i′, β (i) and β 1(i′) are approximately uncorrelated. Thus, for covariate k, the variance of the vector of estimates βˆ 1(i)k can be estimated by the diagonal ⫺1 matrix with (i)th diagonal element equal to the kth diagonal element of V (i) . If the weighted smoothing for covariate k is represented by a linear smoothing matrix L k , then the variance of the smoothed estimate can be estimated by premultiplying the diagonal variance matrix by L k and postmultiplying by L Tk.

Example Figure 6 shows the weighted smooth of the adjusted Schoenfeld residuals with the constant estimate added for sβ2 m, from the fitted Cox model in Table 6. The smoother is a smoothing spline with 7 degrees of freedom. The shaded histograms indicate the distribution of observed deaths, for patients with log 2 (sβ2 m) between 2 and 3 in the first part and for those with log 2 (sβ2 m) above 5 in the second part. From these plots it seems that high values of sβ2 m are associated with an increased hazard initially, but the effect decreases over time, and possibly sβ2 m has no prognostic value beyond 2 or 3 years. A third Cox model was fitted in which the constant coefficients of log2 (sβ2 m) were replaced by a piecewise constant coefficient, which was allowed to have different values for up to and beyond 2 years. The four variables for log2 (sβ2 m) were set to zero after 2 years and a further variable was defined to have value zero up to 2 years and the value of log2 (sβ2 m) after 2 years. The results are in Table 7; thus, there does not seem to be any statistically significant association between sβ2 m and increased hazard beyond 2 years. Again, the Cox model gives an idea of the prognostic effect of sβ2 m with error estimates in each of the two time intervals but does not give any idea of how quickly or in what manner the effect decreases over time; an idea

Figure 6 Estimates of time-varying coefficients with 95% confidence intervals using smoothed Schoenfeld residuals.

Prognostic Factors and Survival

451

Table 7 Cox Model Using Continuous Piecewise Linear Covariate Effects with Coefficients Piecewise Constant in Time Covariate Age up to 55 years (per 10 years) Age after 55 years (per 10 years) log 2 (sβ2 m) up to 2 in the first two years log 2 (sβ2 m) between 2 and 3 in the first two years log 2 (sβ2 m) between 3 and 5 in the first two years log 2 (sβ2 m) above 5 in the first two years log 2 (sβ2 m) after the first 2 years log 2 (serum creatinine) up to 10 log 2 (serum creatinine) above 10 ABCM Cuzick index int. or poor (vs. good) Cuzick index poor (vs. int. or good)

Hazard ratio

95% CI

0.81 1.28 0.98 2.83 1.11 4.61 1.02 1.01 2.92 0.79 1.42 1.15

(0.65–1.00) (1.13–1.44) (0.73–1.33) (2.02–3.96) (0.91–1.35) (2.65–8.02) (0.92–1.14) (0.90–1.14) (0.83–10.27) (0.67–0.92) (1.18–1.71) (0.94–1.40)

⫺2 ⫻ log partial likelihood: 10,013

of this can be seen from the figure instead. The value of minus twice the log partial likelihood for this model is 32 less than that for the model in Table 6, with one extra variable.

VI. STRATIFICATION A covariate in the Cox model may require both a transformation as in Section IV and a time-dependent covariate as in Section V. Therefore it might be necessary to study the effect of a continuous prognostic variable without assuming either proportional hazards or a particular form with respect to the covariate. A simple way of doing this is to discretise the covariate as in Section IV and use a stratified Cox model. Denote by X the discretised covariate of interest, and model the conditional hazard as λ(t| Z, X ⫽ k) ⫽ λ k (t) exp(β T Z)

[6]

where there are now several ‘‘baseline hazards,’’ one for each level of the discretised covariate. The functions λ k (t) can be estimated using smoothing (28). However estimating the set of functions λk now means estimating a function of both time and the covariate X, and therefore without strong constraints on the functions or a very large data set it can only be estimated with very limited accuracy.

452

Sasieni and Winnett

Estimation of the corresponding survival functions is much easier than estimating the hazard functions; this can be done using product limit type estimators Sˆ k (t) ⫽

兿

ˆ k (T i )} {1 ⫺ ∆Λ

Ti ⱕt

where the jumps are defined by ∑ j:Xj⫽k,Tj⫽Ti δ i ˆ k (Ti ) ⫽ ∆Λ ∑ j:Xj⫽k,TjⱖTi exp(βˆ T Z j ) Note that it is important to center the covariates so that the baseline survival functions correspond to a realistic combination of covariates (Z ⫽ 0). Extrapolation from a group of individuals aged 40–65 years to age 0, for instance, is likely to lead to nonsensical results. Such survival curves can be interpreted directly, reading off x-year (e.g., 3 year) survival or the y-percentile (e.g., median) survival in each group. Other plots based on stratified survival estimates can also be used; for example, box plots can be used with stratified censored data and may provide a useful visual summary of the survival functions (6,7). As in Section IV. B, discretizing a covariate and estimating a separate baseline hazard for each resulting stratum is equivalent to estimating a hazard that is piecewise constant in the covariate. A continuously stratified Cox model may be used instead to provide a hazard estimate that is smooth in the covariate. Such estimators are discussed in detail by Sasieni (29) and Dabrowska (30). The survival estimates Sˆ k are based on a stratification that is determined by a specific covariate of interest, X. Alternatively, interest might be in simply estimating survival in prognostic groups, within which individuals have similar survival functions. Tree-based methods can be used to divide data into strata (prognostic groups) based on the values of several covariates so that survival within each stratum is relatively homogeneous (31,32). The baseline survival functions can also be used in exploratory analysis for deciding on an appropriate model. For example, traditionally the proportional hazards assumption can be examined by plotting log {⫺log Sˆ k (t)} against t for all strata on the same axes; under proportional hazards, the resulting curves should be parallel (that is, the vertical distance between two curves should be constant for all values of t). Unfortunately, it is surprisingly difficult to judge whether two nonlinear curves are parallel, and the methods of Sections IV and V are more useful for exploring possible models. Example From Figures 4 and 6 it has been seen that the association between sβ2 m and the hazard ratio is both nonlinear in the value of log2 (sβ2 m) and nonconstant in time. Therefore, to estimate the effect of sβ2 m on survival without making either of these assumption, a stratified Cox model was used with strata

Prognostic Factors and Survival

453

Figure 7 Estimated baseline survival functions and 95% confidence intervals from stratified Cox model. Strata are defined by log 2 (sβ2 m) ⱕ 2.6, 2.6 ⬍ log2 (sβ2 m) ⱕ 5, and 5 ⬍ log2 (sβ2 m).

defined by log2 (sβ2 m) with the cutpoints 5 and 2.6. The latter cutpoint was chosen so that the two strata with log2 (sβ2 m) ⱕ 5 had approximately equal numbers of individuals. Figure 7 shows estimates of the survival functions corresponding to the baseline hazard functions for the three strata; each of the other covariates was centered to have mean zero, so that the estimates correspond to the mean values of the other covariates. Individuals in the same strata are not expected to have the same survival functions so the estimates are for a randomly selected individual from the group. Summary statistics from Figure 7 are in Table 8.

VII. DISCUSSION The standard approach to regression analysis of censored data is to use the semiparametric proportional hazards model. The model is extremely useful for adjusting for possible confounders when the main interest is on a binary covariate. Formal inference can be based on the score test that is an adjusted log-rank test, and survival in the two groups can be examined without assuming proportional

Table 8 log 2 (sβ2 m) ⬍2.6 2.6–5 ⬎5

Estimated Survival from Stratified Cox Model Median survival

Probability of surviving 2 years

36 mo 21 mo 4 mo

0.70 0.45 0.22

454

Sasieni and Winnett

hazards by use of a stratified Cox model. If, however, the primary interest is in one or more continuous covariates, one may wish to investigate more flexible models. There are now a variety of graphical techniques that can be extremely useful in pointing the data analyst toward a more appropriate model. It is often tempting to overinterpret nonlinear effects detected by such plots, and one should err on the side of caution unless a validation sample is available to check the significance of trends found during exploratory analyses. We have considered two main types of departure from the log linear proportional hazards model. First, we looked at covariates whose effect may not be linear on the prognostic index. This may arise due to a U- or J-shaped relationship or when there is a large amount of information available so that more subtle departures from linearity can be detected. In the example used throughout this chapter, we observed a nonmonotone relationship with age (patients aged about 55 years had a better prognosis than both older and younger patients) and a monotone, but distinctly nonlinear, relationship with the logarithm of sβ2 m concentration. We also saw how the apparent significance of serum creatinine depends critically on whether or not sβ2 m is adjusted for. The second form of departure considered was covariate effects that change over time on the proportional hazards scale. In particular, we saw how sβ2 m, which is so informative for short-term survival, contains little or no information regarding subsequent survival of those who survive at least 2 years from diagnosis.

REFERENCES 1. MacLennan ICM, Chapman C, Dunn J, Kelly K. Combined chemotherapy with ABCM versus melphalan for treatment of myelomatosis. The Lancet 1992; 339: 200–205. 2. Cuzick J, Galton DAG, Peto R for the MRC’s Working Party on Leukaemia in adults. Prognostic features in the third MRC myelomatosis trial. Br J Cancer 1980; 43:831–840. 3. MRC Working Party on Leukaemia in Adults. Objective evaluation of the role of vincristine in induction and maintenance therapy for myelomatosis. Br J Cancer 1985; 52:153–158. 4. Cox DR. Regression models and life tables [with discussion]. J R Stat Soc B 1972; 34:187–220. 5. LeCessie S, vanHouwelingen JC. Ridge estimators in logistic regression. Appl Stat 1992; 41:191–201. 6. Gentleman R, Crowley, J. A graphical approach to the analysis of censored data. Breast Cancer Res Treat 1992; 22:229–240. 7. Gentleman R, Crowley J. Graphical methods for censored data. J Am Stat Assoc 1991; 86:678–683.

Prognostic Factors and Survival

455

8. Tibshirani R, Hastie T. Local likelihood estimation. J Am Stat Assoc 1987; 82:559– 567. 9. Gentleman R, Crowley J. Local full likelihood estimation for the proportional hazards model. Biometrics 47:1283–1296. 10. Hastie T, Tibshirani R. Generalized additive models. Stat Sci 1:297–318. 11. Sleeper LA, Harrington P. Regression splines in the Cox model with application to covariate effects in liver disease. J Am Stat Assoc 1990; 85:941–949. 12. Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med 1989; 8:551–561. 13. Gray RJ. Flexible methods for analyzing survival data, using splines, with applications to breast cancer prognosis. J Am Stat Assoc 1992; 87:942–951. 14. O’Sullivan F. Nonparametric estimation of relative risk using splines and crossvalidation. SIAM J Sci Stat Comput 1988; 9:531–542. 15. Hastie T, Tibshirani R. Exploring the nature of covariate effects in the proportional hazards model. Biometrics 1990; 46:1005–1016. 16. Therneau TM, Grambsch PM, Fleming TR. Martingale based residuals for survival models. Biometrika 1990; 77:147–160. 17. Lagakos SW. The graphical evaluation of explanatory variables in proportional hazard regression models. Biometrika 1981; 68:93–98. 18. Crowley J, Storer BE. Comment on Aitkin M, Laird N, Francis B, A reanalysis of the Stanford heart transplant data. J Am Stat Assoc 1983; 78:278–281. 19. Grambsch PM, Therneau TM, Fleming TR. Diagnostic plots to reveal functional form for covariates in multiplicative intensity models. Biometrics 1995; 51:1469– 1482. 20. Hastie T, Tibshirani R. Generalized Additive Models. London: Chapman and Hall. 21. Hess KR. Assessing time-by-covariate interactions in proportional hazards regression models using cubic spline functions. Stat Med 1994; 13:1045–1062. 22. Abrahamowicz M, MacKenzie T, Esdaile JM. Time-dependent hazard ratio: modelling and hypothesis testing with application in lupus nephritis. J Am Stat Assoc 1996; 91:1432–1439. 23. Valsecchi MG, Silvestri D, Sasieni P. Evaluation of long-term survival: use of diagnostics and robust estimators with Cox’s proportional hazards model. Stat Med 1996; 15:2763–2780. 24. Schoenfeld D. Partial residuals for the proportional hazards regression model. Biometrika 1982; 69:239–241. 25. Pettitt AN, Bin Daud I. Investigating time dependence in Cox’s proportional hazards model. Appl Stat 1990; 39:313–329. 26. Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81:515–526. [Correction in Biometrika 82: 668.] 27. Winnett AS, Sasieni P. A note on scaled Schoenfeld residuals for the proportional hazards model. Biometrika 2001. (In press.). 28. Wells MT. Nonparametric kernel estimation in counting processes with explanatory variables. Biometrika 1994; 81:795–801. 29. Sasieni P. Information bounds for the conditional hazard ratio in a nested family of regression models. J R Stat Soc B 1992; 54:617–635.

456

Sasieni and Winnett

30. Dabrowska DM. Smoothed Cox regression. Ann Stat 1997; 25:1510–1540. 31. LeBlanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993; 88:457–467. 32. Crowley J, LeBlanc M, Jacobson J, Salmon SE. Some exploratory tool for survival analysis. In: Lin DY, Fleming TR, eds. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis. New York: Springer, 1997:199–229.

22 Tree-Based Methods for Prognostic Stratification Michael LeBlanc Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

INTRODUCTION

Identification of groups of patients with differing prognosis is often desired to understand the association of patient characteristics and survival times and for aiding in the design of clinical trials. Applications can include the development of stratification schemes for future clinical trials and identification patients suitable for studies involving therapy targeted at a specific prognostic group. For instance, identification of patients with poor prognosis may be useful to determine eligibility for studies involving high-dose therapy for that disease. Cox’s (1) proportional hazards model is a flexible tool for the study of covariate associations with survival time. It has been used to identify prognostic groups of patients by using the linear component of the model (prognostic index) or informally through counting up the number of poor prognostic factors corresponding to terms in the fitted model. However, the model does not directly lead to an easily interpretable description of patient prognostic groups. An alternative to using scores constructed from the Cox model is a rule that can be expressed as simple logical combinations of covariate values. For example, an individual with some hypothetical disease may be to have poor prognosis if (age ⬎ 60) and (serum creatinine ⬎ 2) or (serum calcium ⬍ 5) and (sex ⫽ male). Tree457

458

LeBlanc

based methods, also called recursive partitioning methods, are techniques for adaptively deriving these logical rules based on patient data. Tree-based methods were formalized and extensively studied by Breiman et al. (2). Trees have also recently been of interest in the machine learning; one example is the C4.5 algorithm due to Quinlan (3). Tree-based methods recursively split of the data into groups leading to a fitted model that is piecewise constant over regions of the covariate space. Each region is represented by a terminal node in a binary decision tree. Tree-based methods have been extended to censored survival data for the goal of finding groups of patients with differing prognosis (4–8). Some examples of tree-based methods for survival data in clinical studies are given in Albain et al. (9), Ciampi et al. (10), and Kwak et al. (11). In this chapter, we discuss some of the general methodological aspects of tree-based modeling for survival data. We illustrate the methods using data from a clinical trial for patients with myeloma conducted by the Southwest Oncology Group (SWOG). SWOG study 8624 entered patients between 1987 and 1990 and showed a significant treatment effect (12). However, we combined the data from the treatment arms for the prognostic analyses presented here.

II. NOTATION AND PIECEWISE CONSTANT MODEL We assume that the X is the true survival time, C is the random censoring time, and Z is a p-vector of covariates. The observed variables are the triple (T ⫽ X ∧ C, ∆ ⫽ I {X ⱕ C}, Z) where T is the time under observation, ∆ is an indicator of failure. Given Z, we assume that X and C are independent. The data consist of a sample of independent observations {(t i , δ i , z i ) : i ⫽ 1,2 , . . . , N} distributed as the vector (T, ∆, Z). The survival probability at time t is denoted by S(t| z) ⫽ P(X ⬎ t |z) Trees represent approximating models that are homogeneous over regions of the prediction space. The model for survival can be represented piecewise model S(t| z) ⫽

冱 S (t) I {z∈B } h

h

n∈T˜

where B h is a ‘‘box’’-shaped region in the predictor space, represented by a terminal node, h, and the function S h (t) is the survival function corresponding to region B h where T˜ is the set of terminal nodes. Each terminal node can be described by a logical rule, for instance, (z 1 ⬍ 3) 傽 (z 2 ⱖ 3) 傽 (z 5 ⬍ 3). With sample sizes typically available for clinical applications, a piecewise constant model can yield quite poor approximations to the true conditional survival func-

Tree-Based Methods

459

tion, which is likely a smooth function of the underlying covariates. Smooth methods such as linear Cox regression likely yield better function approximations than tree-based methods. However, the primary motivation for using tree-based models is the easy interpretation of the resulting regions or groups patients, and such rules are not directly obtained by methods that assume S(t |z) is a smooth function of the covariates. The myeloma data from SWOG study 8624 presented in this chapter are from 478 patients with complete data for sex, age, performance status, calcium, creatinine, albumin, and serum β2 microglobulin.

III. CONSTRUCTING A TREE RULE A tree-based model is developed by recursively partitioning the data. At the first step the covariate space is partitioned into two regions and the data are split into two groups. The splitting rule is applied recursively to each of the resulting regions until a large tree has been grown. Splits along a single covariate are used because they are easy to interpret. For an ordered covariate, splits are of the form ‘‘Z j ⬍ c’’ or ‘‘Z j ⱖ c’’ and for a nominal covariate splits are of the form ‘‘Z j ∈ S’’ or ‘‘Z j ∉ S,’’ where S is a nonempty subset of the set of labels for the nominal predictor Z j . Potential splits are evaluated for each of the covariates, and the covariate and split value resulting in the greatest reduction in impurity is chosen. The improvement for a split at node h into left and right daughter nodes l(h) and r(h) is G(h) ⫽ R(h) ⫺ [R(l(h)) ⫹ R(r(h))]

(1)

where R(h) is the residual error at a node. For uncensored continuous response problems, R(h) is typically the mean residual sum of squares or mean absolute error. For survival data, it would be reasonable to use deviance corresponding to an assumed survival model. For instance, the exponential model deviance for node h is

冤冢冣

δ R(h) ⫽ ∑2 δ i log ˆ i ⫺ (δ i ⫺ λˆ h t i ) λ ht i

冥

where λˆ h is the maximum likelihood estimate of the hazard rate in node h. Often an exponential assumption for survival times is not valid. However, a nonlinear transformation of the survival times may make the distribution of survival times closer to an exponential distribution. LeBlanc and Crowley (7) investigate a ‘‘full-likelihood’’ method that is equivalent to transforming time by the marginal cumulative hazard function and using the exponential deviance.

460

LeBlanc

Figure 1 Logrank test statistics for splits at observed values of creatinine.

However, most recursive partitioning schemes for censored survival data use the logrank test statistic of Mantel (13) for G(h) to measure the separation in survival times between two groups. Simulation studies of the performance of splitting with logrank test statistic and some other between node statistics are given in LeBlanc and Crowley (8) and Crowley et al. (14). Figure 1 shows the value of the logrank test statistic for groups defined by (creatinine ⬍ c) and (creatinine ⱖ c) for observed values of the covariate for the entire data set. The largest logrank test statistic corresponds to a split at c ⫽ 1.7 and would lead to the first split in a tree-based model to be (creatinine ⬍ 1.7) versus (creatinine ⱖ 1.7). A.

Updating Splitting Statistics

It is also important that the splitting statistic can be calculated efficiently for all possible split points for continuous covariates. While the logrank test is relatively inexpensive to calculate, one way to improve computational efficiency is the use a simple approximation to the logrank statistic that allows simple updating algorithms to consider all possible splits (15). Updating algorithms can also be

Tree-Based Methods

461

constructed for exponential deviance; for instance, see Davis and Anderson (16) and LeBlanc and Crowley (15).

B. Constraints on Splitting If there are weak associations between the survival times and covariates, splitting on a continuous covariate tends to select splits that send almost all the observations to one side of the split. This is called ‘‘end-cut’’ preference by Breiman et al. (2). When growing survival trees, we restrict both the minimum total number of observations and the minimum number of uncensored observations within any potential node. This restriction is also important for prognostic stratification, since very small groups of patients are usually not of clinical interest. Figure 2 shows the tree grown using the logrank test statistic for splitting and with a constraint of a minimum node size of 40 observations and 5 uncensored cases. The tree has nine terminal nodes. Below each split in the tree the

Figure 2 Unpruned survival tree. Below each split is the logrank test statistic and a permutation p value. Below each terminal node is the logarithm of the hazard ratio relative to the left most node in the tree and the number of cases in the node.

462

LeBlanc

logrank test statistic and permutation p value are presented. The p value is calculated at each node by the permuting responses over the covariates and recalculating the best split at the node 1000 times. At each terminal node, the logarithm of the hazard ratio relative to the left-most node and the number of cases falling into each terminal node are presented. The logarithm of the hazard ratio is obtained by fitting a Cox (1) model with dummy variables defined by terminal nodes in the tree. The worst prognostic group are patients with (creatinine ⱖ 1.7) and (albumin ⬍ 3.3) and corresponds an estimated logarithm of the hazard ratio relative to the best prognostic group equal to 1.7. While the minimum node size was set to be quite large (40 observations), the logrank test statistics near the bottom of the tree (and permutation p values) indicate there may be several nodes that should be combined to simplify the model.

IV. PRUNING AND SELECTING A TREE Two general methods have been proposed for pruning trees for survival data. The methods that use within-node error or deviance usually adopt the Classification and Regression Tree (CART) pruning algorithm directly.

A.

Methods Based on Within-Node Deviance

In the CART algorithm, the performance of a tree is based on the cost-complexity measure R α (T) ⫽

冱 R(h) ⫹ α | T˜ | h∈T˜

of the binary tree T, where T˜ is the set of terminal nodes, | T˜ | is the number terminal nodes, α is a nonnegative parameter, and R(h) is the cost (often deviance) of node h. A subtree (a tree obtained by removing branches) To is an optimally pruned subtree for any penalty α of the tree T if R α (T o ) ⫽ min R α (T′) T′ⱮT

where ‘‘Ɱ’’ means ‘‘is a subtree of’’ and T o is the smallest optimally pruned subtree if T o Ɱ T″ for every optimally pruned subtree, T.″ The cost-complexity pruning algorithm obtains the optimally pruned subtree for any α. This algorithm finds the sequence of optimally pruned subtrees by repeatedly deleting branches of the tree for which the average reduction in

Tree-Based Methods

463

impurity per split in the branch is small. The cost-complexity pruning algorithm is necessary for finding optimal subtrees because the number of possible subtrees grows very rapidly as a function of tree size. The deviance will always decrease for larger trees in the nested sequence based on the data used to construct the tree. Therefore, some way of honestly estimating deviance for a new sample is required to select a tree that would have small expected deviance. If a test sample is available, the deviance for the test sample can be calculated for each of the pruned trees in the sequence using the node estimates from the training sample. For instance, the deviance at a node would be

冤

R(h) ⫽ ∑ 2 δ Ti log

冢冣

δ Ti ⫺ (δ Ti ⫺ λˆ h t Ti ) λˆ h t Ti

冥

where (t Ti , δ Ti ) are the test sample survival times and status indicators for test sample observations falling into node h, z Ti ∈ B h , for the tree and node estimate λˆ h calculated from the learning sample. However, usually a test sample is not available. Therefore, the selection of the best tree can be based on resampling based estimates of prediction error (or expected deviance). The most popular method for tree-based models is the K-fold cross-validation estimate of deviance. The training data, ᏸ, are divided up into K test samples ᏸ k and training samples ᏸ (k) ⫽ ᏸ ⫺ ᏸ k , k ⫽ 1, . . . , K of about equal size. Trees are grown with each of the training samples ᏸ (k); each test sample ᏸ k is used to estimate the deviance using the parameter estimates from the training sample ᏸ (k). The K-fold cross-validation estimate of deviance is the sum of the test sample estimates. The tree that minimizes the cross-validation estimate of deviance (or a slightly smaller tree) is selected. While K-fold crossvalidation is a standard method for selecting tree size, it subject to considerable variability; this is noted in survival data in simulations given in LeBlanc and Crowley (8). Therefore, other methods such as those based on bootstrap resampling maybe useful alternatives (17). One bootstrap method for methods based on logrank splitting is given in the next section.

B. Methods Based on Between-Node Separation LeBlanc and Crowley (8) developed an optimal pruning algorithm analogous to the cost-complexity pruning algorithm of CART for tree performance based on between node separation. They define the split-complexity of a tree as G α (T) ⫽ G(T) ⫺ α |S|

464

LeBlanc

where G(T) is the sum over the standardized splitting statistics, G(h), in the tree T: G(T) ⫽

冱 G(h) h∈S

where S represents the internal nodes T. A tree T o is an optimally pruned subtree of T for complexity parameter α if G α (T o ) ⫽ max G α (T′) T′ ⱮT

and it is the smallest optimally pruned subtree if T o Ɱ T′ for every optimally pruned subtree. The algorithm repeatedly prunes off branches with smallest average logrank test statistics in the branch. An alternative pruning method for trees based on the maximum value of the test statistic within any branch was proposed by Segal (6). Since the same data are used to select the split point and variable as used to calculate the test statistic, we use a bias-corrected version of the split complexity described above G α (T) ⫽

冱 G*(h) ⫺ α| S| h∈S

where corrected split statistic is G*(h) ⫽ G(h) ⫺ ∆*(h) and where the bias is denoted by ∆*(h) ⫽ E Y* G(h; ᏸ*, ᏸ) ⫺ E Y* G(h; ᏸ*, ᏸ*) The function G(h; ᏸ*, ᏸ) denotes the test statistic where the data ᏸ* were used to determine the split variable and value and the data ᏸ were used to evaluate the statistic. The function G(h; ᏸ*, ᏸ*) denotes the statistic, where the same data was used to pick the split variable and value and to calculate the test statistic. The difference ∆*(h) is the optimism due to adaptive splitting of the data. We use the bootstrap to obtain an estimate ∆ˆ *(h) then we select trees that minimize the corrected goodness of split ˜ α (T) ⫽ G

冱 (G(h) ⫺ ∆ˆ *(h)) ⫺ α| S| h∈S

˜ α (T) is similar to the bias-corrected version of split complexity Note that the G used in LeBlanc and Crowley (8) except here we do the correction locally for each split conditional on splits higher in the tree. We typically chose a complexity parameter, α ⫽ 4. Note that if splits were not selected adaptively, an α ⫽ 4

Tree-Based Methods

465

Figure 3 Pruned survival tree. Below each split is the logrank test statistic and permutation p value. Below each terminal node is the median survival and the number of patients represented by that node.

would correspond approximately to the 0.05 significance level for a split and α ⫽ 2 is in the spirit of AIC (18). Permutation sampling methods can also be used to add an approximate p value to each split conditional on the tree structure above the split to help the interpretation of individual splits. Figure 3 shows a pruned tree based on the corrected goodness of split using 25 bootstrap samples with α ⫽ 4. There are five terminal nodes. For example, the patients represented by (creatinine ⬍ 1.7) and (serum β2 microglobulin ⬍ 5.1) and (age ⬍ 62) have the best prognosis with an estimated median survival of 5.1 years.

V.

FURTHER RECOMBINATION OF NODES

Usually, only a small number of prognostic groups are of interest. Therefore, further recombination of nodes with similar prognosis from the pruned tree may

466

LeBlanc

be required. We select a measure of prognosis (for instance, hazard ratios relative to some node in the tree or median survival for each node) and rank each of the terminal nodes in the pruned tree based on the measure of prognosis selected. After ranking the nodes, there are several options for combining nodes in a pruned tree. One method would be to grow another tree on the ranked nodes and only allow the second tree to select three or four nodes, another method would be to divide the nodes based on quantiles of the data, and a third method would be to evaluate all possible recombinations of nodes into V groups and choose the partition that yields the largest partial likelihood or largest V sample logrank test statistic. The result of recombining to yield the largest partial likelihood for a three-group combination of the pruned myeloma tree given in Figure 3 is presented in Figure 4. The results of the prognostic staging scheme can be compared with cur-

Figure 4 Survival for the three prognostic groups derived from the pruned survival tree. The best prognostic group corresponds to nodes 1 and 2, the middle prognostic group corresponds to nodes 3 and 5, and the worst prognosis group corresponds to node 4 (numbered from the left on Fig. 3).

Tree-Based Methods

467

Figure 5 Survival by Durie-Salmon stages (I–II, IIIA, IIIB).

rently used prognostic staging system Durie-Salmon stage (19), which is based on an the number of tumor cells categorized into three groups (I,II,III) and a classification of kidney function. Figure 5 shows the systems collapsed into three groups (I–II, IIIA, IIIB). However, the tree-based prognostic stratification is adaptively chosen based on the response, so we would expect some shrinkage of the differences on a test data set.

VI. OTHER ISSUES A. Competing Splits The tree model only shows the split point and variable (and test statistic) for the best split at each node. However, there may be alternative partitions that yield only slightly smaller test statistics. Therefore, it can be desirable for a given node to automatically investigate the performance of other splits on other variables. Figure 6 shows the logrank test statistics for potential splits at the first split in the tree. For instance, while creatinine ⬍ 1.7 split yielded the largest logrank

468

LeBlanc

Figure 6 Logrank test statistics for splits on continuous variables at the root node of the tree.

test statistics, the plots show that many values of serum β 2 microglobulin between 5 and 15 would also yield large test statistics. Given that the tree (Fig. 3) subsequently splits on serum β 2 microglobulin, it may be reasonable to investigate alternative tree models that only use serum β 2 microglobulin. This could also be supported by the biology since serum β 2 microglobulin measures the impact of the disease on kidney function and tumor volume whereas creatinine focuses on kidney function. B.

Stability of a Split

It has been recognized that the mechanism of selecting a best split and the recursive partitioning of data leading to smaller and smaller data sets to be considered can lead to instability in the tree structure in some data sets. Note this does not necessarily mean that the predictions are unstable. However, one easily imple-

Tree-Based Methods

469

mented way to understand the stability of the structure of the tree is to draw bootstrap samples of the observed data and recalculate the splitting statistics. Figure 7 below shows the frequency of the variables and splitting values chosen as the first split in the tree over 100 bootstrap samples. All the bootstrap splits were on either creatinine or serum β 2 microglobulin. This reflects the both prognostic importance of either of these two variables and that there is significant correlation between these two variables. The best splits from the bootstrap show a strong peak for creatinine at 1.7; however, for serum β 2 microglobulin best splits widely ranged between 5 and 15. Note that overall 63% of the bootstrap samples split on serum β 2 microglobulin and 37% split on creatinine. Thus, while split point selection on serum β 2 microglobulin may be unstable, it still leads to important prognostic groupings at a range of wide range of split points. This may lead an investigator to define the groups not at the optimal split on serum β 2 microglobulin but at one chosen so a sufficient number of patients fall into the good or poor prognosis groups to meet the goals of the prognostic analysis.

Figure 7 Frequency of first splits for 100 bootstrap samples.

470

LeBlanc

In addition, we investigated the potential for lack of stability resulting in variable survival estimates for the prognostic groups. We grouped the nodes of trees grown with a minimum node size of 40 observations into three prognostic groups for each of 100 bootstrap samples. We did not prune the trees so that the three prognostic groups included approximately the same numbers of patients across bootstrap samples. Figure 8 shows histograms of the median survival estimates for the prognostic groups obtained from the 100 bootstrap sample data sets. The bootstrap distribution of median survival estimates for the three groups are widely separated.

Figure 8 Median survival estimates for three-group prognostic stratification based on 100 bootstrap samples.

Tree-Based Methods

471

VII. DISCUSSION The primary objective of tree-based methods with censored survival data is to provide easy to understand description of groups of patients. The highly adaptive tree-based procedures require resampling methods for model selection. Resampling methods can be useful in understanding the stability of the tree structures. Software implementing tree-based methods based on logrank splitting using SPLUS are available from the first author. Other software has been written for tree-based modeling for survival data. The RPART program implements a recursive partitioning based on within-node error which includes exponential model based survival trees (20). Another implementation of software based on logrank test statistics based on Segal (6) is available from that author and is called TSSA.

ACKNOWLEDGMENTS We thank Drs. Salmon, Coltman, and Barlogie for encouragement to use the myeloma data. Supported by the U.S. NIH through NCI 2 P01 CA 53996.

REFERENCES 1. Cox DR. Regression models and life-tables [with discussion]. J R Stat Soc B 1972; 34:187–220. 2. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Belmont, CA: Wadsworth, 1984. 3. Quinlan JR. C4.5 Programs for Machine Learning. Morgan Kaufman, 1993. 4. Gordon L, Olshen RA. Tree-structured survival analysis. Cancer Treat Rep 1985; 69:1065–1069. 5. Ciampi A, Hogg S, McKinney S, Thiffaut J. RECPAM: a computer program for recursive partition and amalgamation for censored survival data. Computer Methods Progr Biomed 1988; 26:239–256. 6. Segal MR. Regression trees for censored data. Biometrics 1988; 44:35–48. 7. LeBlanc M, Crowley J. Relative risk regression trees for censored survival data. Biometrics 1992; 48:411–425. 8. LeBlanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993; 88:457–467. 9. Albain K, Crowley J, LeBlanc M, Livingston R. Determinants of improved outcome in small cell lung cancer: an analysis of the 2580 patient Southwest Oncology Group data base. J Clin Oncol 1990; 8:1563–1574. 10. Ciampi A, Thiffault J, Nakache JP, Asselain B. Stratification by stepwise regression,

472

11.

12.

13. 14.

15. 16. 17. 18. 19.

20.

LeBlanc correspondence analysis and recursive partition. Comput Stat Data Anal 1986; 4: 185–204. Kwak LW, Halpern J, Olshen RA, Horning SJ. Prognostic significance of actual dose intensity in diffuse large-cell lymphoma: results of a tree-structured survival analysis. J Clin Oncol 1990; 8:963–977. Salmon SE, Crowley J, Grogan TM, Finley P, Pugh RP, Barlogie B. Combination chemotherapy glucocorticoids, and interferon alpha in the treatment of multiple myeloma: a Southwest Oncology Group study. J Clin Oncol 1994; 12:2405–2414. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 1966; 50:163–170. Crowley J, LeBlanc M, Gentleman R, Salmon S. Exploratory methods in survival analysis. In: Koul HL, Deshpande JV (eds). Analysis of Censored Data, IMS Lecture Notes. Monograph Series 27. Hayward, CA: Institute of Mathematical Statistics. 1995, pp. 55–77. LeBlanc M, Crowley J. Step-function covariate effects in the proportional-hazards model. Can J Stat 1995; 23:109–129. Davis R, Anderson J. Exponential survival trees. Stat Med 1989; 8:947–962. Efron B, Tibshirani R. Improvements on cross-validation: the .632 bootstrap. J Am Stat Assoc 1997; 92:548–560. Akaike H. A new look at model identification. IEEE Trans Automatic Control 1974; 19:716–723. Durie BGM, Salmon SE. A clinical system for multiple myeloma. Correlation of measured myeloma cell mass with presenting clinical features, response to treatment and survival. Cancer 1975; 36:842–854. Therneau TM, Atkinson EJ. An introduction to recursive partitioning using the RPART routines. Technical report, Mayo Foundation (Distributed in PostScript with the RPART package), 1997.

23 Problems in Interpreting Clinical Trials Lillian L. Siu and Ian F. Tannock Princess Margaret Hospital, Toronto, Ontario, Canada

I.

INTRODUCTION

The oncological literature is overwhelmed by its unceasing abundance of published clinical trials, yet unfortunately many of these trials are performed with little or no chance of improving clinical practice. Some trials may be well conducted but ask irrelevant questions, whereas others attempt to address important issues with poor methodology. This chapter focuses on key elements that need to be considered when planning or evaluating a clinical trial (Table 1).

II. IMPORTANCE AND RELEVANCE OF THE QUESTION Therapeutic advances in oncology that alter patient outcome and clinical practice occur as a consequence of a multistep process that include formulation of a pertinent question, generation of an appropriate hypothesis, testing of the hypothesis in the setting of properly conducted clinical trial(s), and careful interpretation and accurate presentation of trial results. Priorities in cancer research should be directed toward strategies that lead to an improved therapeutic index, either through the modification of existent methods or by the development of novel approaches. Far too often time and resources are expended on trivial issues that do not ultimately benefit patients. Thoughtful scrutiny of the research question 473

474 Table 1

Siu and Tannock Key Questions when Evaluating a Clinical Trial

• Does the trial address an important question? • Is the design of the study appropriate? Single arm versus randomized? Nonblinded versus blinded? Adequate sample size? • Are the end points appropriate? Are they well defined? • Does the report of the study reflect its results? What is the probability of false-positive or false-negative results? Was the analysis performed in a rigorous way? Are all the patients accounted for? Do the results address the primary end points and hypotheses? • Do the results fit with clinical experience (external validity)? • Are the results generalizable such that they should influence oncologic practice?

to determine its clinical relevance is an essential first step toward a successful investigation. How might one determine whether the question being raised in a clinical trial is important? In a survey examining treatment preferences of physicians for different presentations of non-small cell lung cancer (NSCLC), Mackillop et al. (1) suggested that expert physicians might act as surrogates for their patients and their input might be helpful for ethics committees in evaluating the appropriateness of clinical trials before they are conducted. A subsequent survey showed that lay surrogates were unable to discern differences in the acceptability of clinical experiments that were clear to experts, and most would appreciate access to expert opinions before consenting to trial participation (2). For example, lay surrogates found initially that trials comparing lobectomy versus segmentectomy for operable NSCLC and five different regimens of chemotherapy for treatment of metastatic NSCLC to be equally acceptable. When preferences of expert physicians to accept the first trial but reject the second were made known to the lay surrogates, their acceptance of the second trial declined dramatically. Expert physicians are familiar with the benefits and risks of diagnostic or therapeutic interventions and can therefore make an informed decision about enrollment into a clinical study. Ideally, clinical trials into which a majority of physician surrogates would enter themselves are the ones addressing scientifically sound, clinically relevant, and ethically acceptable questions. Although expert physicians are the logical candidates to evaluate the validity of a trial question, several factors may influence their views, such as their specialty training and their geographic location. Utilizing the physician surrogate method in an attempt to define various controversies in the management of genito-

Problems in Interpreting Clinical Trials

475

urinary malignancies, Moore et al. (3) distributed questionnaires to urologists, radiation oncologists, and medical oncologists who treated genitourinary cancer in Canada, Great Britain, and the United States. There was a tendency for respondents to choose their own treatment modality, and British urologists were collectively more conservative than their North American colleagues with respect to their recommendation of radical surgery or chemotherapy. A follow-up survey was mailed to the same physicians summarizing the treatment selections from the initial questionnaire (4). Some of the scenarios in the questionnaire illustrated clinical equipoise, a state of uncertainty in which there is no consensus within the expert community about the comparative merits of the alternatives to be tested (5). Recognition of the existence of controversy among experts did not substantially alter physicians’ personal biases about entering themselves on trials, but a greater proportion were willing to offer such options to their patients. Although this disparity might reflect a double standard, it seems justifiable that at least some physicians are not imposing their individual beliefs onto their patients, especially when these beliefs are not founded on objective information or scientific proof. Another notable observation that arose from the same survey is the readiness of investigators to accrue to studies that are relatively easy to perform but do not settle important clinical dilemmas (4). For example, a study in metastatic renal cell carcinoma that compares interferon alone with interferon plus vinblastine is feasible but uninspiring, since both treatment arms include agents of low activity and are unlikely to yield a breakthrough in the management of this disease. In contrast, a study that strives to answer a fundamental question, such as one comparing the use of radical prostatectomy and radical radiotherapy for localized prostate cancer, will likely suffer from poor patient recruitment. Paradoxically, expert physicians agree almost unanimously that the comparison of these two drastically different strategies represents a high priority for clinical research, but many are reluctant to rectify it through recruitment of patients to a randomized controlled trial. Nonevidence-based personal bias from the part of physicians can be a deterrent to the execution of clinical trials that answer relevant questions, and conscious efforts to eliminate this factor are warranted.

III. DESIGN OF THE STUDY As in any rigorous scientific experiment, a clinical trial that is being undertaken to address a pertinent issue in cancer research must be goal directed and hypothesis driven. New treatments that appear promising after phase I studies of toxicity and feasibility and phase II studies of biological activity need to be tested in comparison with the currently accepted standard. These phase III studies generally require randomization of patients to receive either the new or the conven-

476

Siu and Tannock

tional therapy, thus ensuring that patient characteristics and other known prognostic factors are well balanced between the treatment arms. The main objectives of early noncomparative clinical studies of a new anticancer therapy are to establish feasibility and to estimate biological activity. Randomized control patients are not needed, and their presence would only obscure the real purposes of these studies and delay their completion (6,7). Results obtained from nonrandomized trials may help to generate new hypotheses but should not be compared with those from another institution or from historical experience at the same institution to draw definitive conclusions. Patients in nonrandomized trials may differ substantially in several factors other than their treatment, such that indirect comparisons of trial outcome often lead to false claims of therapeutic superiority that are not supported by subsequent randomized trials (8). Selection of patients with favorable characteristics such as good performance status and normal organ function and the provision of close medical attention while on trial can contribute to a better outcome in trial subjects than their historical or nonstudy counterparts. Furthermore, it is a general phenomenon that participants in clinical trials tend to have better survival than nonparticipants (9,10). Attempts to match nonrandomized groups of patients by stage do not ensure their comparability. Stage migration refers to the influence on disease detection by newer and more sensitive imaging techniques such as computed tomography and magnetic resonance scanning. Patients with disease that would have been missed by previously used methods are upstaged, and the inclusion of patients with lower tumor bulk into a higher stage category yields a better prognosis for that stage. This effect can produce an apparent improvement in survival for each stage but no change in overall survival (Fig. 1). Stage migration impedes the comparison of currently staged patients with those staged with prior techniques. Feinstein et al. (11) named this effect of stage migration as the Will Rogers phenomenon, based on the quotation that ‘‘when the Okies left Oklahoma and moved to California they raised the average intelligence level in both states.’’ A further source of bias that exists among both randomized and nonrandomized trials is the propensity for investigators to continue and to publish those with positive outcome. For nonrandomized trials, comparison of such selected results with historical data precludes any valid conclusions. Suppose a new treatment that produces no net benefit is being tested in several pilot studies, then even by chance alone the first few patients in some studies will do well, whereas the first few patients in others will do poorly. The former studies are likely to be continued, whereas the latter studies may be quickly abandoned. The adoption of a ‘‘two-stage’’ clinical trial design may reduce this bias and allow a treatment to be evaluated in an adequate number of subjects before the decision is made to proceed or to stop. The error probabilities of treatment acceptance or rejection are predetermined by the investigators (12). Publication bias is another cause of inappropriate optimism concerning the value of a new treatment. It is a well-

Problems in Interpreting Clinical Trials

477

Figure 1 Stage migration. The diagram illustrates that a change in staging investigations may lead to the apparent improvement of results within each stage without changing the overall results. In the hypothetical example, patients with a given type of malignancy have a spectrum of disease and are represented by equal cohorts whose survival is 100% (least disease), 80%, 60%, 40%, 20%, and 0% (most disease). Older staging investigations classify these cohorts into stages as shown at the top. Newer and more sensitive staging investigations classify the cohorts into stages as shown at the bottom. There is an apparent improvement in survival when compared stage by stage, but overall survival of 50% remains unchanged. (Adapted from Ref. 67.)

recognized occurrence as positive findings are much more frequently submitted, presented, and published than negative results (13,14), creating a preponderance of false-positive studies in the literature. All clinical trials should address a hypothesis. A hypothesis is a presumption made a priori, suggested usually by previous preclinical and/or clinical experience. A clear statement of the hypothesis of interest requires that both the dependent (outcome) variables and the independent (treatment, prognostic factors) variables are explicitly described (15). Ideally, a null hypothesis and an alternative hypothesis should be specified before the implementation of a research study (16). In phase III studies, the null hypothesis assumes that there is no treatment

478

Siu and Tannock

effect between the control and the experimental arms and that any observed difference is due to chance, whereas the alternative hypothesis states the opposite. Predetermination of a decision rule is necessary, such that at the completion of the study, one should be capable of disproving the null hypothesis and accepting the alternative, or vice versa, based on the trial results obtained and draw meaningful conclusions. For example, in the first randomized trials of adjuvant chemotherapy in breast cancer patients with positive axillary nodes (17,18), the null hypotheses proposed that no difference existed between the control and the chemotherapy arms. Significant reductions in treatment failures observed in the chemotherapy arms in these studies refuted the null hypotheses and supported the alternative hypotheses, confirming the benefit of adjuvant chemotherapy in this patient population.

IV. APPROPRIATENESS OF THE END POINT The end point of a clinical trial should reflect the outcome that the research question sets out to measure. Typically, phase II clinical trials that evaluate the biological activity of a chemotherapeutic agent or combination of agents against specific malignancies use tumor shrinkage or response as their main end point. It should be emphasized that tumor response is a measure of biological activity and not of benefit to patients; for example, a very toxic drug might cause tumor shrinkage but might also decrease the patients’ quality of life without increasing survival. Phase III clinical trials ask questions about benefit to patients; appropriate end points for these trials are therefore duration and quality of survival (Table 2). Anderson et al. (19) pointed out the fallacy of analyzing survival as a function of tumor response category. The association of better prognosis with response to treatment does not imply cause and effect. Pretreatment characteristics of patients that lead to a favorable prognosis (e.g., high performance status) are often the same as those that yield a good response to therapy. Thus, response may simply be a marker for the patient subgroup who were destined to fare

Table 2 1.

2.

3.

End Points in Clinical Trials

Explanatory or biological end points (used in phase II clinical trials) a. Tumor response b. Marker reduction Pragmatic end points (used in phase III clinical trials) a. Survival b. Quality of life Surrogate end points (that predict pragmatic end points)

Problems in Interpreting Clinical Trials

479

well regardless of therapy. Likewise, the assumption that response to treatment automatically translates into an improvement in quality of life is erroneous. Symptomatic relief obtained from anticancer therapy is likely to require tumor shrinkage, but this may be offset by treatment-induced toxicity; consequently, the overall effect on patients’ quality of life can be variable. There is wide disparity in reported rates of tumor response when patients with the same type of cancer are treated with similar chemotherapy. It is meaningless to talk about a given drug regimen (e.g. combination of drugs A, B, and C) having, for example, a 28% response rate for NSCLC. The reported response rates using 5-fluorouracil for the treatment of metastatic colorectal cancer ranged from 8% to 85% (20), whereas those of methotrexate for head and neck cancer ranged from 8% to 60% (21). These large differences may be accounted for partly by factors that exert a true impact on tumor response, such as differences in the treatment protocol (e.g., dosage and schedule), differences in patient characteristics (e.g., performance status and sites of metastatic disease), and differences in quality of care (Table 3). Comparison of response rates between nonrandomized series is rendered invalid, even if the same drug doses and schedule are used because of the patient-based factors that influence response rates. Other causes of diversity in reported response rates are artifactual and arise from differences in the way investigators gather and report their data (22). Considerable heterogeneity exists in the criteria that have been used to determine response to therapy. Such heterogeneity can create confusion in the interpretation of trial results and render intercomparison of clinical trials difficult or impossible (23,24). The World Health Organization and Cooperative Groups established more uniform response criteria that have helped to standardize outcome assessment in clinical trials of efficacy (25–27). Despite such efforts, limitations exist in the traditional response criteria that are used to assess the biological activity of chemotherapeutic agents. For example, the categorizations of minimal response (25–50% reduction in the sum of cross-sectional areas of index lesions) and stable disease (⬍25% change in area), unless of meaningful duration, do not correlate with patient benefit. Reports of transient or borderline reductions in tumor size are susceptible to large measurement error and should not be used as criteria of response: They are merely a guide to continue therapy, provided that treatment is

Table 3

Factors that Influence Response Rate in Clinical Trials

1. Treatment-based factors: e.g., drug dosage, drug schedule, quality of care 2. Patient-based factors: e.g., performance status, tumor stage, number of metastatic sites 3. Measurement biases: e.g., intraobserver and interobserver errors 4. Variations in response criteria

480

Siu and Tannock

tolerated with acceptable levels of toxicity. Even in the face of stringent response definitions, measurements of malignant lesions by physical examination or radiological evaluation are subject to intraobserver and interobserver errors. Warr et al. (28) demonstrated considerable inaccuracies in tumor measurements performed by physicians on real and simulated malignant lesions, particularly in the case of small-sized nodules (Table 4). Clinical practice commonly involves the comparison of serial measurements with baseline lesions, and errors in the initial and sequential measurements can lead to false categorization of tumor response. In addition, overestimation of tumor response can occur as physicians may be subconsciously biased by their desire to see a response to therapy. Whenever feasible, objective external review of tumor measurements should be instituted to minimize this potential source of bias from investigators. Surrogate or intermediate end points refer to events or observations occurring in the course of a disease that are believed to be precursors of the ultimate outcome of primary interest (29). There are clinical scenarios in which the application of surrogate end points appears logical and practical. The major goal in trials of adjuvant chemotherapy used with effective local treatment is to improve survival. However, determination of survival (e.g., for breast cancer) in these trials will take a long time, whereas the duration required for follow-up with a

Table 4 False Categorization of Response from Comparison of All Pairs of Measurements on the Same Lesion Percent false categorization Measurement

PR

PR ⫹ MR

Progression

Simulated nodules (1.0–2.6 cm) Simulated nodules (3.2–6.5 cm) Neck nodes Lung metastases (CXR) Liver size

12.6

31.0

34.3

1.3

19.7

24.0

13.1 0.8

32.1 11.2

33.4 15.9

(A) 8.5 (B) 18.4

28.7

PR ⫽ partial response; MR ⫽ minor response; CXR ⫽ chest radiograph. PR was a ⬎50% decrease in area, PR ⫹ MR was a ⬎25% decrease in area, and progression was ⬎25% increase in area except for liver size. For liver size, PR was defined as ⬎50% decrease (A) or ⬎30% decrease (B), in the sum of the linear measurements of the liver edge below the costal margin in the midline and midclavicular line; progression was defined as ⬎25% increase in the same measurement. Source: Ref. 28.

Problems in Interpreting Clinical Trials

481

surrogate end point such as relapse-free survival is shorter. As well, a low event rate of the primary end point may preclude feasibility in a clinical trial, or feasibility may be limited only to a multiinstitutional setting. For example, in a primary prevention trial to determine the effect of aspirin on the incidence of colorectal cancer among male physicians, over 22,000 subjects were accrued with 118 new cases of invasive colorectal cancers identified (30). Adenomatous polyps are neoplasms that appear to be the precursors for most invasive cancers in the large bowel and therefore may represent a useful surrogate end point. In the Polyp Prevention Study, aspirin use was assessed in 793 subjects, of whom 259 developed at least one colonic or rectal adenoma detected by colonoscopy 1 year after study entry (31). The practical advantage of using a valid surrogate end point is its need for a much smaller sample size (32). Surrogate end points can be very useful, but caution must be exercised against using nonvalidated surrogate end points with unknown predictive ability for the true end point. Another problem arises if therapy has specific effects to influence a surrogate end point. For example, suramin is known to inhibit release of prostate-specific antigen (PSA) (33), and reduction in PSA is a surrogate end point commonly used in prostatic cancer trials; this end point might therefore be misleading when evaluating the anticancer properties of suramin. The duration and quality of survival of patients with this disease remain the appropriate end points for trials with this drug (34,35). End points of phase III studies should assess benefit to patients and should therefore include the duration of survival and its quality. Unfortunately, achievements in oncology that prolong survival in patients with metastatic cancers do not occur frequently. Thus, it is not realistic to base the design of trials that compare treatments for metastatic cancer in adults on the expectation of a substantial difference in the duration of survival. For patients with advanced malignancies who have a poor outlook, the aim of treatment should be directed toward palliation of their symptoms. Whereas duration of survival is easy to measure and therefore often considered as a ‘‘hard’’ end point, evaluation of quality of life in patients has traditionally been viewed as vague and difficult. The development of validated quality of life instruments has allowed accurate assessment of this important end point. In fact, the improvement in pain control alone has led to the approval of the chemotherapeutic regimen of mitoxantrone plus prednisone for the treatment of symptomatic hormonerefractory prostate cancer (36). Similarly, gemcitabine has been accepted as therapy for the palliation of patients with advanced pancreatic cancer, based on a randomized trial demonstrating its superiority over 5-fluorouracil in providing clinical benefit to such patients (37). Quality of life should no longer be regarded as a ‘‘soft’’ end point of patient benefit. Increasing familiarity with its measurement in patients will provide clinicians with an appropriate measure to evaluate palliative effects of treatment. When using quality of life as a primary or major end point in clinical trials, it is important to establish a hypothesis about the

482

Siu and Tannock

Table 5 1. 2. 3. 4.

Key Elements in the Evaluation of Quality of Life in Clinical Trials

Patient based Define primary end point using one quality of life measure that is relevant to patients in study (others exploratory) Define hypothesis about change that is clinically important Blind assessment where possible

expected degree of change in a predefined important aspect of quality of life, as summarized in Table 5. The primary measure of quality of life might be an overall summary scale or a measure of the dominant symptom such as pain in patients with metastatic prostate cancer. When used in a rigorous way, quality of life end points can give important information about the palliative value of treatment.

V.

FALSE-POSITIVE AND FALSE-NEGATIVE TRIALS

Besides publication bias, two additional factors increase the probability of falsely declaring clinical trial results to be positive: the performance of multiple significance tests and the low prevalence of true-positive studies leading to therapeutic advances. In clinical trials, multiple comparisons are commonly undertaken for various end points, for subgroups, and for serial interim evaluations during patient accrual and follow-up. The type I error, also known as the α error, is the error made by reporting a significant difference between two treatments when it does not exist. Typically, the α error is set at the 5% level, and 1 in 20 p values for comparison of equivalent outcomes will be less than 5% simply by chance alone. The number of implicit and explicit statistical comparisons executed in reports of clinical trials is often large (38–40). There is disagreement about the value of correcting for multiple tests (41,42), and adjustments are rarely done in practice. Many of these comparisons are done in a post hoc fashion, and therefore any detected differences are to be regarded as hypothesis generating, rather than hypothesis testing. Final conclusions from clinical trials should derive only from comparisons of the major predefined end point(s). Excessive manipulation of study data results in ‘‘data torturing,’’ which can lead to the dissemination of incorrect information to the research community and to patients (39). Subgroup analyses of data from randomized trials seek to identify ‘‘effect modifiers,’’ characteristics of the patients or treatment that modify the effect of the intervention under study (43). Judicious application of subgroup analyses can provide ideas for new trials, but they must not be confused with the primary analysis that can give definitive information. Caution should be exercised against the tendency to accept exploratory subset analyses that seem reasonable. For

Problems in Interpreting Clinical Trials

483

example, in the trials of adjuvant chemotherapy for colorectal cancer, reductions in risk of recurrence was associated with younger and female patients in one study but with older and male patients in another (44,45). These divergent results of unplanned subgroup analysis are almost certainly due to statistical artifact. Periodic monitoring of the accumulating data in a trial can give interim information that might indicate early stopping because of a substantial difference between the arms, because one might predict that the trial cannot show a difference if it continues to its preplanned accrual goals, or because of unacceptable toxicity. Review of outcome information should not be undertaken by the investigators since this is likely to introduce bias. Rather, interim data should be reviewed by an independent Data Monitoring Committee. A recommendation for trial closure must be based on sufficient evidence: Early interim analyses of limited available data require very small p values for stopping, whereas later analyses can have stopping p values closer to conventional levels of significance (46). After completion of accrual, multiple reanalyses of the data can also generate multiple significance tests with the possibility of showing apparent effects that are not sustained. For example, in an European Organization for Research and Treatment of Cancer trial of adjuvant therapy for breast cancer, early results suggested survival benefit for patients receiving chemotherapy but not those receiving tamoxifen, whereas the mature 8-year results showed the opposite (47,48). A limited number of analyses should be undertaken at predefined times. The low expectation of therapeutic advances may also contribute to falsepositive reporting (40,49–51). The probability that there will be superior efficacy of a new treatment compared with the reference approach is very low (49). For example, in a review of all breast cancer abstracts published in the Program/ Proceedings of the American Society of Clinical Oncology from 1984 to 1993, only 16% of the adjuvant trials and 2% of the advanced disease trials reported a survival benefit for the experimental treatment (52). Due to the low prevalence of true differences in outcome, clinical trials will have a high likelihood of false positivity (and a low likelihood of false negativity). This concept is not intuitively obvious and may be clarified by the following example (Parmar MKB, personal communication). Suppose 400 trials are undertaken with a significance level of 5% (α error ⫽ 0.05, two-sided) and with a 90% power (the probability of detecting a treatment difference if in truth there is one). If the prevalence of trials with a real difference in favor of experimental treatment is 10% (40 trials), then 36 (40 ⫻ 0.9) trials will be reported correctly as true positives and 9 (360 ⫻ 0.025) trials will be reported erroneously as false positives. Altogether, among the 45 trials that will declare positive results, 1 in 5 (or 9 in 45) are false-positive trials. The remaining 355 of the 400 trials will be reported as negative, and of these, 342 (360 ⫺ 18) are true negatives and 13 are false negatives (or 1 in about 27). In Bayesian terminology, in the setting of a low pretest or prior probability of true positive studies, a single positive study even when well designed and

484

Siu and Tannock

performed has a substantial chance of being false positive, especially if the p value is in the range of 0.01 to 0.05, with only a relatively small increment in the post-test probability of true results. The type II error, or the β error, represents the failure of a clinical trial to recognize a significant difference between treatment groups when it truly exists. False negativity occurs most often because the sample size is too small to detect plausible differences in outcome (53). For example, to detect or rule out a 20% absolute differences in expected outcome events with a 90% power, more than 200 patients are required (the exact number depends on the expected outcome in the control or standard therapy group). Clinical trials that seek differences in outcome events in the range of 10% will require accrual of about 1000 patients to obtain reasonable power (54).

VI. RIGOROUS VERSUS NONRIGOROUS ANALYSIS At the completion of a clinical trial, the collected data must be assimilated using appropriate methodologies and presented in an objective and thoughtful manner. Many guidelines and recommendations are available to promote quality in the reporting of clinical trials (55–59). Essential elements that should be clearly specified in every report include study hypothesis and design; patient population and entry criteria that were used in its selection; actual therapy delivered, especially if different from the intended therapy; treatment complications and toxicity; methods of outcome assessment; statistical evaluation; and the accounting of all study subjects. Failure to adhere to basic standards in the reporting of clinical trials may lead to dissemination of misleading information and eventually to inappropriate treatment of patients. Baar and Tannock (60) illustrated the impact of a rigorous in contrast to a nonrigorous approach in the analysis and reporting of

Table 6 • • • • • •

Guidelines for the Reporting of Clinical Trials

Are the criteria for study entry and patient selection described? Does the report provide details of the control and the experimental therapies? Are protocol violations reported and the reasons for such violations explained? Are stringent response criteria used for disease evaluation? Are all patients accounted for and included in the analysis of results? Does the survival curve include all patients and does not compare survival by treatment response? • Does the report provide a comprehensive analysis of toxicity? • Are measures of quality of life or costs of treatment included?

Problems in Interpreting Clinical Trials

485

clinical trials. Using a demonstration model involving fictional patients and a hypothetical chemotherapy regimen ‘‘CABOOM’’ for metastatic carcinoma of the great toe, a single set of data conveyed dramatically opposite conclusions depending on the way they were interpreted and presented. The inappropriate methods of analysis that led to the spurious results were not fictional; each of them had been used in one or more reports of clinical trials published in a single year in the Journal of Clinical Oncology. The differences between rigorous and nonrigorous analysis used to generate these reports can be used as a guide to reporting (and how not to report) clinical trials and are summarized in Table 6.

VII. EXTERNAL VALIDITY AND GENERALIZABILITY Trials that have rigid selection criteria may be limited in their applicability to patients with similar characteristics, and extrapolation beyond such patients is unsound. For example, in a Cancer and Leukemia Group B trial of radiotherapy with or without induction chemotherapy for stage III NSCLC (61), only patients with high performance status were allowed on study. As a result of this strict inclusion criterion, multiple participating centers took 3 years to accrue 155 eligible patients, despite the common prevalence of this malignancy. Furthermore, the survival benefit noted in favor of combined chemoradiation may not apply to other stage III NSCLC patients who have poorer performance status. Simple entry criteria allow the accrual of larger number of patients and increase the heterogeneity in the study population but will render final results more generalizable. Internally and externally valid clinical trials are those performed using a stringent methodology, reported in a prudent fashion, and produced meaningful results that are consistent with clinical experience (Table 7). For example, in a randomized study of only 60 patients which compared surgery with or without preoperative chemotherapy for stage IIIA NSCLC, a substantial and significant survival advantage was reported for the addition of chemotherapy (62). This study had problems of design and analysis (i.e., poor internal validity) and generated a result that is inconsistent with data from other trials, many of which were much

Table 7

Internal and External Validity

• Internal validity: Were rigorous and appropriate methods used in the design, analysis, and reporting of the trial? • External validity: How do the results of the trial compare with other data and with past clinical experience?

486

Siu and Tannock

larger (i.e., poor external validity) (63–66). Despite these defects, the trial had a substantial influence by virtue of being a lead article in the New England Journal of Medicine, a flagrant example of publication bias.

VIII. CONCLUSION Many subtle problems can influence the design, analysis, and reporting of clinical trials and hence the validity of their results. Quality control imposed by cooperative groups has improved but not eliminated these problems. When reviewing the results of a clinical trial, readers should consider two relatively simple questions: Were the design and analysis of the trial performed according to high standards (internal validity) and do the results fit with clinical experience and those of other trials (external validity)? If the answers to these questions are ‘‘yes,’’ and the trial addresses an important question, it is appropriate to include its results as part of the basis for making clinical decisions.

REFERENCES 1. Mackillop WJ, Ward GK, O’Sullivan B. The use of expert surrogates to evaluate clinical trials in non-small cell lung cancer. Br J Cancer 1986; 54:661–667. 2. Mackillop WJ, Palmer MJ, O’Sullivan B, Ward GK, Steele R, Dotsikas G. Clinical trials in cancer: the role of surrogate patients in defining what constitutes an ethically acceptable clinical experiment. Br J Cancer 1989; 59:388–395. 3. Moore MJ, O’Sullivan B, Tannock IF. How expert physicians would wish to be treated if they had genitourinary cancer. J Clin Oncol 1988; 6:1736–1745. 4. Moore MJ, O’Sullivan B, Tannock IF. Are treatment strategies of urologic oncologists influenced by the opinions of their colleagues? Br J Cancer 1990; 62:988– 991. 5. Freedman B. Equipoise and the ethics of clinical research. N Engl J Med 1987; 317: 141–145. 6. Gehan EA. The determination of the number of patients required in a preliminary and a follow-up trial of a new chemotherapeutic agent. J Chronic Dis 1961; 13:346– 353. 7. Gehan EA, Freireich EJ. Non-randomized controls in cancer clinical trials. N Engl J Med 1974; 290:198–203. 8. Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical controls for clinical trials. Am J Med 1982; 72:233–240. 9. Antman K, Amato D, Wood W, Carson J, Suit H, Proppek K, Carey R, Greenberger J, Wilson R, Frei E III. Selection bias in clinical trials. J Clin Oncol 1985; 3:1142– 1147. 10. Davis S, Wright PW, Schulman SF, Hill LD, Pinkham RD, Johnson LP, Jones TW,

Problems in Interpreting Clinical Trials

11.

12. 13. 14.

15. 16. 17.

18.

19. 20. 21. 22. 23. 24. 25. 26.

27. 28.

487

Kellog Jr HB, Radke HM, Sikkema WW, Jolly PC, Hammar SP. Participants in prospective, randomized clinical trials for resected non-small cell lung cancer have improved survival compared with nonparticipants in such trials. Cancer 1985; 56: 1710–1718. Feinstein AR, Sosin DM, Wells CK. The Will Rogers phenomenon. Stage migration and new diagnostic techniques as a source of misleading statistics for survival in cancer. N Engl J Med 1985; 312:1604–1608. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clin Trials 1989; 10:1–10. Begg CB, Berlin JA. Publication bias and dissemination of clinical research. J Natl Cancer Inst 1989; 81:107–115. De Bellefeuille C, Morrison CA, Tannock IF. The fate of abstracts submitted to a cancer meeting: factors which influence presentation and subsequent publication. Ann Oncol 1992; 3:187–191. Lyman GH, Kuderer NM. A primer for evaluating clinical trials. Cancer Control 1997; 4:413–418. Neyman J, Pearson ES. On the use and interpretation of certain test criteria. Biometrika 1928; 201:175–240. Fisher B, Carbone P, Economou SG, Frelick R, Glass A, Lerner H, Redmond C, Zelen M, Band P, Katrych DL, Wolmark N, Fisher ER. 1-Phenylalanine mustard (l-PAM) in the management of primary breast cancer. N Engl J Med 1975; 292: 117–122. Bonadonna G, Brusamolino E, Valagussa P, Rossi A, Brugnatelli L, Brambilla C, De Lena M, Tancini G, Bajetta E, Musumeci R, Veronesi U. Combination chemotherapy as an adjuvant treatment in operable breast cancer. N Engl J Med 1976; 294:405–410. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin Oncol 1983; 1:710–719. Moertel CG, Thynne GS. Cancer Medicine. 2nd ed. Philadelphia: Lea & Febiger, 1982, pp. 1830–1859. Tannock IF. Chemotherapy for head and neck cancer. J Otolaryngol 1984; 13:99– 104. Rudnick SA, Feinstein AR. An analysis of the reporting of results in lung cancer drug trials. J Natl Cancer Inst 1980; 64:1337–1343. Miller AB, Hoogstraten B, Staquet M, Winkleer A. Reporting results of cancer treatment. Cancer 1981; 47:207–214. Tonkin K, Tritchler D, Tannock IF. Criteria of tumor response used in clinical trials of chemotherapy. J Clin Oncol 1985; 3:870–875. World Health Organization. WHO handbook for reporting results of cancer treatment. WHO Offset Publication No. 48, Geneva, 1979. Oken MM, Creech RH, Tormey DC, Horton J, Davis TE, McFadden ET, Carbone PP. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am J Clin Oncol 1982; 5:649–655. Green S, Weiss GR. Southwest Oncology Group standard response criteria, endpoint definitions and toxicity criteria. Invest New Drugs 1992; 10:239–253. Warr D, McKinney S, Tannock I. Influence of measurement error on assessment of

488

29. 30.

31. 32. 33.

34. 35.

36.

37.

38. 39. 40. 41. 42. 43. 44.

Siu and Tannock response to anticancer chemotherapy: proposal for new criteria of tumor response. J Clin Oncol 1984; 2:1040–1046. Ellenberg SS. Surrogate endpoints. Br J Cancer 1993; 68:457–459. Gann PH, Manson JE, Glynn RJ, Buring JE, Hennekens CH. Low-dose aspirin and incidence of colorectal tumors in a randomized trial. J Natl Cancer Inst 1993; 85: 1220–1224. Greenberg ER, Baron JA, Freeman DH, Mandel JS, Haile R. Reduced risk of largebowel adenomas among aspirin users. J Natl Cancer Inst 1993; 85:912–916. Wittes J, Lakatos E, Probstfield J. Surrogate endpoints in clinical trials: cardiovascular diseases. Stat Med 1989; 8:415–425. La Rocca RV, Danesi R, Cooper MR, Jamis-Dow CA, Ewing MW, Linehan WM, Myers CE. Effect of suramin on human prostate cancer cells in vitro. J Urol 1991; 145:393–398. Clark JW, Chabner BA. Suramin and prostate cancer: where do we go from here? J Clin Oncol 1995; 13:2155–2157. Small EJ, Marshall ME, Reyno L, Meyers F, Natale R, Meyer M, Lenehan P, Chen L, Eisenberger M. Superiority of suramin ⫹ hydrocortisone (S ⫹ H) over placebo ⫹ hydrocortisone (P ⫹ H): results of a multi-center double-blind phase III study in patients with hormone refractory prostate cancer (HRPC). Proc Am Soc Clin Oncol 1997; 17:308a (abstract no. 1187). Tannock IF, Osoba D, Stockler MR, Ernst S, Neville AJ, Moore MJ, Armitage GR, Wilson JJ, Venner PM, Coppin CML, Murphy KC. Chemotherapy with mitoxantrone plus prednisone or prednisone alone for symptomatic hormone-refractory prostate cancer: a Canadian randomized trial with palliative endpoints. J Clin Oncol 1996; 14:1756–1764. Burris III HA, Moore MJ, Andersen J, Green MR, Rothenberg ML, Modiano MR, Cripps MC, Portenoy RK, Storniolo AM, Tarassoff P, Nelson R, Dorr A, Stephens CD, Von Hoff DD. Improvements in survival and clinical benefit with gemcitabine as first-line therapy for patients with advanced pancreatic cancer: a randomized trial. J Clin Oncol 1997; 15:2403–2414. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med 1987; 317:426–432. Mills JL. Data torturing. N Engl J Med 1993; 329:1196–1199. Tannock IF. False-positive results in clinical trials: multiple significance tests and the problem of unreported comparisons. J Natl Cancer Inst 1996; 88:206–207. Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics 1987; 43:487–498. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology 1990; 1:43–46. Oxman AD, Guyatt GH. A consumer’s guide to subgroup analyses. Ann Intern Med 1992; 116:78–82. Laurie JA, Moertel CG, Fleming TR, Wieand HS, Leigh JE, Rubin J, McCormack GW, Gerstner JB, Krook JE, Malliard J, Twito DI, Morton RF, Tschetter LK, Barlow JF. Surgical adjuvant therapy for large-bowel carcinoma: an evaluation of levamisole and the combination of levamisole and fluorouracil. J Clin Oncol 1989; 7:1447– 1456.

Problems in Interpreting Clinical Trials

489

45. Moertel CG, Fleming TR, MacDonald JS, Haller DG, Laurie JA, Goodman PJ, Ungerleider JS, Emerson WA, Tormey DC, Glick JH, Veeder MH, Mailliard JA. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. N Engl J Med 1990; 322:352–358. 46. Pocock SJ. When to stop a clinical trial. BMJ 1992; 305:335–340. 47. Rubens RD, Bartelink H, Engelsman E, Hayward JL, Rotmensz N, Sylvester R, van der Schueren E, Papadiamantis J, Vassilaros SD, Wildiers J, Winter PJ. Locally advanced breast cancer: the contribution of cytotoxic and endocrine treatment to radiotherapy. An EORTC Breast Cancer Co-operative Group Trial (10792). Eur J Cancer Clin Oncol 1989; 25:667–678. 48. Bartelink H, Rubens RD, van der Schueren E, Sylvester R. Hormonal therapy prolongs survival in irradiated locally advanced breast cancer: a European Organization for Research and Treatment of Cancer randomized phase III trial. J Clin Oncol 1997; 15:207–215. 49. Staquet MJ, Rozencweig M, Von Hoff DD, Muggia FM. The delta and epsilon errors in the assessment of cancer clinical trials. Cancer Treat Rep 1979; 63:1917–1921. 50. Ciampi A, Till JE. Null results in clinical trials: the need for a decision-theory approach. Br J Cancer 1980; 41:618–629. 51. Parmar MKB, Ungerleider RS, Simon R. Assessing whether to perform a confirmatory randomized clinical trial. J Natl Cancer Inst 1996; 88:1645–1651. 52. Chlebowski RT, Lillington LM. A decade of breast cancer clinical investigations: results as reported in the program/proceedings of the American Society of Clinical Oncology. J Clin Oncol 1994; 12:1789–1795. 53. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 ‘‘negative’’ results. N Engl J Med 1978; 299:690–694. 54. Boag JW, Haybittle JL, Fowler JF, Emergy EW. The number of patients required in a clinical trial. Br J Radiol 1971; 44:122–125. 55. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clinical trials. N Engl J Med 1982; 306:1332–1337. 56. Zelen M. Guidelines for publishing paper on cancer clinical trials: responsibilities of editors and authors. J Clin Oncol 1983; 1:164–169. 57. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Rep 1985; 69:1–3. 58. Green SJ, Fleming TR. Guidelines for the reporting of clinical trials. Semin Oncol 1988; 15:455–461. 59. Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schulz KF, Simel D, Stroup DF. Improving the quality of reporting of randomized control trials. The CONSORT statement. JAMA 1996; 276:637–639. 60. Baar J, Tannock I. Analyzing the same data in two ways: a demonstration model to illustrate the reporting and misreporting of clinical trials. J Clin Oncol 1989; 7: 969–978. 61. Dillman RO, Seagren SL, Propert KJ, Guerra J, Eaton WL, Perry M, Carey RW, Frei EF III, Green MR. A randomized trial of induction chemotherapy plus highdose radiation versus radiation alone in stage III non-small-cell lung cancer. N Engl J Med 1990; 323:940–945.

490

Siu and Tannock

62. Rosell R, Gomez-Codina J, Camps C, Maestre J, Padille J, Canto A, Mate JL, Li S, Roig J, Olazabal A, Canela M, Ariza A, Skagel Z, Morera-Prat J, Abad A. A randomized trial comparing preoperative chemotherapy plus surgery with surgery alone in patients with non-small-cell lung cancer. N Engl J Med 1994; 330:153– 158. 63. Cocquyt V, De Neve W, Van Belle SJ-P. Chemotherapy and surgery versus surgery alone in non-small-cell lung cancer. N Engl J Med 1994; 330:1756. 64. Chanarin N. Chemotherapy and surgery versus surgery alone in non-small-cell lung cancer. N Engl J Med 1994; 330:1756. 65. Mills NE, Fishman CL, Jacobson DR. Chemotherapy and surgery versus surgery alone in non-small-cell lung cancer. N Engl J Med 1994; 330:1756. 66. McLachlan SA, Stockler M. Chemotherapy and surgery versus surgery alone in nonsmall-cell lung cancer. N Engl J Med 1994; 330:1757. 67. Bush RS. Cancer of the ovary: natural history. In: Peckham MJ, Carter RL, eds. Malignancies of the Ovary, Uterus and Cervix: The Management of Malignant Disease, Series #2. London: Edward Arnold, 1979:26–37.

24 Commonly Misused Approaches in the Analysis of Cancer Clinical Trials James R. Anderson University of Nebraska Medical Center, Omaha, Nebraska

I.

INTRODUCTION

Analysis of cancer clinical trials data is sometimes not straightforward. In this chapter, some commonly used incorrect or inappropriate methods of statistical analysis and appropriate techniques when they exist are discussed.

II. COMPARISON OF SURVIVAL OR OTHER ‘‘TIME-TO-EVENT’’ BY OUTCOME VARIABLES Many comparisons of survival, failure-free survival, or other ‘‘time-to-event’’ variables are made by grouping patients on some outcome of treatment. Examples of these analyses include analysis of survival by best response to treatment, analysis of survival by delivered chemotherapy dose intensity, survival by level of toxicity experienced, and survival by compliance to some protocol specified treatment (these compliance data often resulting from quality control review of radiotherapy or surgical procedures). All these analyses are problematic, first and foremost because patients are classified into groups based on an outcome of treatment that unfolds over time (response, toxicity, dose intensity, treatment compliance). Valid analyses of time491

492

Anderson

to-event data measure the time from some ‘‘time of entry’’ to the occurrence of some ‘‘event-of-interest’’ and compare groups based on characteristics or factors known at the time of entry. Since none of these ‘‘outcomes’’ are known at the time treatment starts, the standard survival analysis approaches are invalid and other statistical techniques must be used (1). Nevertheless, even when appropriate analyses are conducted, the interpretation of the results of these analyses can be difficult. A.

Analysis of Survival by Tumor Response

The effectiveness of treatments for patients with advanced cancers is often assessed by computing complete and partial response rates and overall survival from the start of treatment. For many years, survival by tumor response category was commonly included as part of the reports of these treatments. Patients were characterized as ‘‘responders’’ or ‘‘nonresponders,’’ and Kaplan-Meier life-table estimates of survival from the start of treatment were calculated for each responder group and survival by tumor response compared, with the statistical significance of observed differences assessed using the log-rank or other significance test appropriate for time-to-event data. A review of articles published in Cancer or Cancer Treatment Reports found 228 articles presenting data on responders and nonresponders, with 61% containing formal statistical comparisons of survival by tumor response category (2). The Journal of Clinical Oncology, in its last six issues of 1984 and its first six issues of 1985, published 18 papers that included analyses of survival by tumor response and 10 provided formal statistical comparisons of survival of responders and nonresponders (3). These analyses appear to have been used in two ways. First, responders surviving longer than nonresponders was used as a justification for more aggressive therapy, since therapy that increased the response rate would then necessarily result in increased survival. Second, some investigators considered analysis of survival by response category as a surrogate for a placebo-controlled trial. Responders benefited from the therapy and thus constituted the ‘‘treatment’’ group. Nonresponders did not benefit, and thus their survival could be considered similar to that for untreated patients (placebo controls). However, the usual methods of comparing responders to nonresponders is wrong and should never be used (4). The estimates of the survival distributions are biased, the statistical tests are invalid, and the conclusions drawn are misleading. The bias results from the requirement that responders must live long enough for the response to be observed; there is no such requirement for nonresponders. This bias also affects the validity of the statistical tests. Responders will be counted as ‘‘at risk’’ of death from the start of treatment to the time of response, biasing the analysis in favor of responders.

Misused Approaches to Analysis of Clinical Trials

493

There are a number of valid approaches to the statistical comparison of survival by response category (4). The ‘‘landmark’’ method evaluates response for all patients at some fixed time following the onset of treatment. For phase II studies, this landmark might be a time at which most patients who are going to respond have already done so (say, after three cycles of therapy). Patients who progress or who are ‘‘off-study’’ prior to the landmark are excluded from the analysis. Patients are analyzed according to their response at the landmark, regardless of subsequent changes in response status. Survival estimates and statistical tests are conditional on the patients’ landmark response. A second approach involves the application of the method of Mantel and Byar (5), first applied to the analysis of heart transplant data. Patients accrue follow-up time in various ‘‘states’’ that they occupy during treatment. All patients begin in the nonresponse state and patients move to the response state at the time of their response. This time-dependent covariate analysis removes the bias inherent in the usual analysis method. Simon and Makuch (6) proposed a method of obtaining estimates of survival probabilities for responders and nonresponders, using ideas from the landmark and Mantel-Byar approaches. Even if appropriate statistical methods are used, longer survival for responders, as compared with nonresponders, cannot be used to conclude that response ‘‘caused’’ longer survival. Response may act as a surrogate marker for prognostically favorable patients. Thus, responders may survive longer than nonresponders, not because of an effect of response on survival but because response identifies patients with pretreatment characteristics that favor longer survival. It will generally be impossible to distinguish cases where response prolongs survival and cases where it simply acts as a marker for favorable prognosis patients. Analyses of survival by tumor response category are therefore rarely helpful and should be avoided. Cancer Treatment Reports methodological guidelines indicated they would not publish comparisons of survival by tumor response (7). Few such analyses were published in 1998, although a few inappropriate analyses still make their way past peer review (8–10). Some investigators interested in assessing the effect of response on survival are applying appropriate statistical techniques (11). B. Analysis of Survival by Chemotherapy Dose Intensity In experimental in vivo systems, there is substantial evidence for a steep dose– response curve for many antineoplastic agents (12). However, clinical evidence for a steep dose–response curve beyond those doses used in standard chemotherapy regimens is limited (13). In a few studies where patients have been assigned to doses of chemotherapeutic agents less than those considered standard (as opposed to receiving lower doses because of toxicity experienced), there is evidence

494

Anderson

of disease control and survival less than that expected with standard doses (14– 16). Clinical cancer investigators understandably have attempted to demonstrate the importance of dose or dose intensity in the delivery of cancer treatment. However, much of the clinical evidence for the importance of dose has come either from the retrospective analysis of dose or dose intensity of therapy delivered on clinical trial or from the nonrandomized comparison of results of different treatments classified by a dose-intensity score. In the retrospective analyses of dose delivered, each patient’s actual dose of the chemotherapeutic agents received is calculated as a fraction of the protocolspecified dose (see, for instance, 17–19). Patients are then classified into groups based on the percent of protocol-specified dose delivered (e.g., ⱖ85%, 65–84%, ⬍65%) and outcome is compared among dose groups. Percent of protocolspecified dose delivered is sometimes replaced by relative dose intensity, calculated as mg/m2 /wk of the agents delivered as a percentage of the protocolspecified dose intensity. A statistically significant association between high dose or dose intensity and outcome is claimed to be evidence of a dose–response effect. However, these analyses cannot demonstrate that drug dose is directly related to therapeutic effect. It is possible that toxicity leading to dose reduction acts as a marker of patients with a poor prognosis generally and additional chemotherapy over and above that which can be delivered under normal circumstances would be of no benefit (13,20,21). Evidence that this may be the case comes from a randomized trial in pediatric nonlymphoblastic lymphoma comparing COMP therapy to COMP plus daunorubicin (D-COMP) (22). The addition of daunorubicin had no effect on failure-free survival overall. However, patients randomized to D-COMP who received less than the protocol-specified dose of daunorubicin had a significantly poorer failure-free survival than both those patients on DCOMP receiving full-dose daunorubicin and those patients randomized to COMP alone who received no daunorubicin by design (23). Other dose–response analyses have resulted from the retrospective analysis of published data, in which regimens are allocated a dose intensity score compared with some standard regimen (like a hypothetical ‘‘gold standard’’ for intermediate grade non-Hodgkin’s lymphoma [24,25] or CMF for adjuvant therapy of breast cancer [26]). However, these analyses are particularly prone to problems that are inherent to nonrandomized comparisons. These comparisons may be biased because of differences between series with respect to factors like patient characteristics and/or supportive care measures that may impact substantially on outcome (13). Second- and third-generation regimens for intermediate grade lymphoma (m-BACOD, ProMACE-CytaBOM, MACOP-B), which would have been expected to produce improved disease control based on these ‘‘doseintensity’’ analyses (24,25), were shown to be essentially equivalent to CHOP

Misused Approaches to Analysis of Clinical Trials

495

in a randomized trial comparing all four regimens (27). Differences in patient characteristics known to have a substantial impact on treatment outcome in nonHodgkin’s lymphoma (28) between the pilot experience with the second- and third-generation regimens at single institutions and the CHOP experience in the cooperative groups is the most likely explanation for the observed differences. The best and most convincing evidence for a dose–response effect in cancer treatment comes from studies that randomly assign patients to differing doses or dose intensities of chemotherapeutic agents. However, only a few examples of randomized clinical trials demonstrate that dose intensity greater than that of conventional therapy but less than that requiring stem-cell support produces improved disease control (e.g., 16,29,30). The available clinical research data from randomized trials and elsewhere suggest that small or even moderate reductions in drug dose for nontrivial reasons (like toxicity) do not compromise patient survival.

C. Analysis of Survival by Toxicity Interest in the relationship of chemotherapy dose intensity to outcome has lead to an interest in the association between toxicity experienced by patients and outcome. Sorensen et al. (31) reported that the occurrence of dose-limiting toxicity during treatment was associated with higher response rates and longer survival. However, comparisons of toxicity and survival suffer the same problems as survival by response: the longer a patient’s time on treatment, the greater the chance that toxicity of a certain severity will be observed (32). Breslow and Zandstra (33) observed a similar relationship between the level of bone marrow lymphocyte toxicity observed during postinduction treatment and duration of remission in childhood acute leukemia. The statistical techniques appropriate for analysis of survival by response are appropriate here, as are the caveats regarding interpretation. Toxicity can be a marker of adequate drug dosing and therefore tumor kill, or it can simply be a indicator of more prognostically favorable patients who are better able to withstand the cytotoxic effects of therapy.

D. Analysis of Survival by Compliance to Protocol-specified Treatment Investigators are often interested in assessing how deviations from protocol therapy affect outcome. Analysis of survival by percent of protocol dose delivered is an example of such an analysis. Other analyses of outcome by compliance often result from quality control review of specific modality treatment. For instance, a process of radiation therapy quality assurance might assess whether the volume, dose, and timing of the radiation therapy delivered is consistent with protocol

496

Anderson

requirements. Surgical quality control might assess the completeness of surgical resection and evidence for negative margins. Quality assurance processes are important in that they provide important information on how therapy is delivered in practice. Feedback from the quality control procedures to physicians treating patients may increase the likelihood that future patients receive therapy closer to that specified by protocol. Feedback to the protocol research team may lead to changes in protocol-specified therapy, making it more likely to be delivered according to protocol in the field. For instance, data from the radiation therapy quality assurance process for a recently completed protocol of the Intergroup Rhabdomyosarcoma Study Group showed that some radiation therapists or parents were unwilling to deliver hyperfractionated radiotherapy to children less than 6 years of age, compromising the randomized comparison of conventional versus hyperfractionated radiation for these patients (34). However, investigators are often interested in comparison of outcome between patients who comply with protocol-specified treatment and those who do not or in comparisons of ‘‘treatment received’’ in addition to the randomized treatment comparisons. These analyses have great potential for bias and may be misleading (35). First, the characteristic of being compliant may have prognostic importance, irrespective of the intervention. The Coronary Drug Project Research Group showed improved survival for subjects who complied with protocol treatment, as compared with noncompliers, for both the drug and placebo groups (36). Second, lack of compliance may be related to patient or disease characteristics. The dose of radiation may be decreased from protocol specified because of toxicity; the field size may be reduced from protocol specified in an attempt to reduce toxicity to vital organs. Hard-to-treat tumors may be hard to cure, and adjustment of these compliance comparisons for prognostic factors may be inadequate. An analysis of prognostic factors for children with bulky retroperitoneal embryonal histology rhabdomyosarcoma suggested that patients whose tumors were ‘‘debulked’’ experienced improved survival, as compared with patients who tumors were not debulked (37). Nevertheless, it does not necessarily follow that more extensive surgery for these patients would lead to improved outcome. It is possible that only those patients whose tumors could be debulked actually were and those patients whose tumors were initially biopsied only had disease that both precluded a debulking procedure and made them at higher risk of treatment failure. E.

Conclusions: Survival by Outcome Variables

The interest in the analysis of survival by other outcomes of treatment (response, toxicity, dose intensity, treatment compliance) is understandable, but most of the usual approaches to the analysis of these data produce survival estimates that

Misused Approaches to Analysis of Clinical Trials

497

are biased and statistical tests that are invalid. Even when appropriate analytic techniques are used, the accurate interpretation of the results of these analyses can be difficult.

III. COMPETING RISKS AND ESTIMATION OF CAUSE-SPECIFIC FAILURE RATES Competing risks arise when patients may experience a number of events of interest, but the occurrence of some events preclude the observance of others. Examples of competing risks include • Death from a cause ‘‘not tumor related’’ precluding the observation of death from cancer; • Death precluding the observation of time to diagnosis of a second cancer in lymphoma patients following high-dose chemotherapy and stem cell transplant; • Bone marrow relapse precluding the observation of central nervous system relapse as a first event in children treated for pediatric acute lymphoblastic leukemia. Sometimes, analyses focus inappropriately in cause-specific failures, when the overall event rate is the more suitable end point. When interest is appropriately focused on cause-specific failures in the presence of competing risks, the techniques used for the estimation of the event rates are often incorrect. A. Inappropriate Focus of Cause-specific Failures Since the goal of cancer therapy is to control the cancer and reduce the risk of cancer death, there is sometimes an interest in focusing on disease-related events and discounting events thought not to be tumor-related. For instance, tumor mortality could be estimated using the usual survival analysis methodology, but treating only deaths due to tumor as failures and censoring deaths that were not tumor related. DeVita and colleagues reported tumor mortality for advanced Hodgkin’s disease patients treated with MOPP, censoring patients who died of causes other than Hodgkin’s disease (38,39). However, Hodgkin’s disease patients often die of causes other than Hodgkin’s disease, with many of the deaths a result of effects of their treatment (infections, pulmonary complications, second malignancies). The Cancer and Leukemia Group B reported that 28% of all deaths observed on a study of the treatment of advanced Hodgkin’s disease were not Hodgkin’s disease related (40). Censoring deaths attributable to causes thought not to be tumor related can give an inappropriately favorable view of the likely survival of pa-

498

Anderson

tients after treatment (41). In most cases, the most appropriate analysis of outcome on clinical cancer trials is to count all deaths, irrespective of their cause. Sometimes there is an interest in looking at cause-specific failures in the context of a clinical trial, because therapy may be directed at preventing certain kinds of failures. For instance, if therapies for the treatment of childhood acute lymphoblastic leukemia differed by the intensity of prophylactic treatment of the central nervous system (CNS), comparisons of outcome might focus on differences in the rates of CNS relapse as a first event. If treatments compare local modalities to the primary tumor (e.g., with or without radiation therapy; with or without surgery), the rate of local tumor control may be of interest. However, interpretation of these analyses of ‘‘location-specific failures’’ can be difficult. First, these analyses focus attention away from what should be the primary analysis objective: the comparison of the overall success of the treatments. Second, these analyses can be misleading, because differences in therapeutic strategies may lead to changes observed in the distribution of failures by site without affecting the overall risk of failure (42). This phenomenon was first seen in the treatment of childhood acute lymphoblastic leukemia and was designated ‘‘the dough-boy effect’’ by Dr. Mark Nesbit (43). A regimen with increased intensity of CNS prophylaxis reduced the risk of CNS relapse as the first event but increased the risk of bone marrow relapse as the first event, with the overall failure-free survival being similar to that seen with the use of standard CNS prophylaxis. This paradoxical effect, in which one kind of event is exchanged for another, can occur when ‘‘successful’’ treatment of one compartment of disease allows the disease recurrence to be observed elsewhere. A similar effect is sometimes seen in the treatment of advanced Hodgkin’s disease. Patients receiving radiotherapy to sites of bulk disease are less likely to relapse first in those sites but have a higher rate of ‘‘relapse in previously uninvolved sites’’ and (in most series) an equivalent failure-free survival. Although it is natural for researchers who apply local therapies to evaluate their effect on the risk of local failure, comparisons of treatment outcomes should focus on differences in the distributions of time to first failure. B.

Appropriate Estimation of Cause-specific Failure Rates

Sometimes the focus on cause-specific failure is appropriate, for instance, in the study of risk factors for the development of a second malignancy after cancer treatment. However, estimation of the distribution of ‘‘time to second malignancy’’ or other cause-specific failure using the Kaplan-Meier method and ‘‘censoring’’ patients who die without experiencing the event of interest is problematic. It assumes that the risk of experiencing the event of interest for those who die would be the same as that for those patients followed beyond their time of death had they not died, an assumption that is untestable. In addition, even if

Misused Approaches to Analysis of Clinical Trials

499

this assumption were true, the ‘‘cause-specific time to event’’ curves produced in this way describe the hypothetical risk of experiencing the event of interest if all deaths could be prevented. Most often what is of interest is an estimate of the proportion of the total patient group treated who would be expected to develop the specific event of interest (e.g., a second malignancy) by some time T. Cumulative incidence functions provide the appropriate estimates (44). Additional methods of summarizing competing risks failure time data also exist (45). Gooley et al. (46, and elsewhere in this volume) showed that the estimates of cause-specific incidence rates incorrectly obtained from applying the Kaplan-Meier method are greater than or equal to the estimates obtained from using cumulative incidence estimates, with the difference often substantial when many patients experience the competing risk. Comparisons of the risk of experiencing a cause-specific failure may be made using standard statistical methods for the analysis of failure time data (logrank test, proportional hazards model), treating all other failures as censored observations (44). Darrington et al. (47) investigated the risk of secondary acute myeologenous leukemia (AML) or myelodysplastic syndrome (MDS) in 511 patients who had received high-dose chemotherapy, with or without total body irradiation (TBI), and stem cell transplant for the treatment of relapsed lymphoma. They estimated the cumulative incidence of AML/MDS to be 4% at 5 years for both patients with Hodgkin’s disease and non-Hodgkin’s lymphoma (NHL). They also showed TBI increased the risk of AML/MDS in patients with NHL who were 40 years of age or older when transplanted. C. Conclusions: Competing Risks and Estimation of Cause-Specific Failure Rates The analysis of cause-specific failure rates can sometimes be misleading, either because events related to the effects of treatment are ignored or because differences in therapeutic strategies can lead to a trade-off of site-specific failures, without influencing overall treatment success. When interest is focused on the overall outcome of cancer treatment, analyses should count all relevant events as ‘‘failures.’’ When attention is appropriately focused on the occurrence of certain events subject to competing risks, the cumulative incidence curve approach should be used to estimate cause-specific risk.

REFERENCES 1. Anderson JR, Cain KC, Gelber RD, Gelman RS. Analysis and interpretation of the comparison of survival by treatment outcome variables in cancer clinical trials. Cancer Treat Rep 1985; 69:1139–1144.

500

Anderson

2. Weiss GB, Bunce H III, Hokanson JA. Comparing survival of responders and nonresponders after treatment: a potential source of confusion in interpreting cancer clinical trials. Control Clin Trials 1983; 4:43–52. 3. Anderson JR, Davis RB (Letter). Analysis of survival by tumor response. J Clin Oncol 1986; 4:114–116. 4. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin Oncol 1983; 1:710–719. 5. Mantel N, Byar DP. Evaluation of response-time data involving transient states: an illustration using heart-transplant data. JASA 1974; 69:81–86. 6. Simon R, Makuch RW. A non-parametric graphical representation of the relationship between survival and the occurrence of an event: application to responder versus non-responder bias. Stat Med 1984; 3:35–44. 7. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Rep 1985; 69:1–3. 8. Bonadonna G, Valagussa P, Brambilla C, et al. Primary chemotherapy in operable breast cancer. J Clin Oncol 1998; 16:93–100. 9. Ellis P, Smith I, Ashley S, et al. Clinical prognostic and predictive factors for primary chemotherapy in operable breast cancer. J Clin Oncol 1998; 16:107–114. 10. Dann EJ, Anatasi J, Larson RA. High-dose cladribine therapy for chronic myelogenous leukemia in the accelerated or blast phase. J Clin Oncol 1998; 16:1498–1504. 11. Smith DC, Dunn RL, Strawderman MS, Pienta KJ. Change in serum prostatespecific antigen as a marker of response to cytotoxic therapy for hormone-refractory prostate cancer. J Clin Oncol 1998; 16:1835–1843. 12. Frei E III, Canellos GP. Dose: a critical factor in cancer chemotherapy. Am J Med 1980; 69:585–594. 13. Henderson IC, Hayes DF, Gelman R. Dose-response in the treatment of breast cancer. A critical review. J Clin Oncol 1988; 6:1501–1515. 14. Dixon DO, Neilan B, Jones SE, et al. Effect of age on therapeutic outcome in advanced diffuse histiocytic lymphoma: The Southwest Oncology Group Experience. J Clin Oncol 1986; 3:295–305. 15. Tannock IF, Boyd NF, DeBoer G, et al. A randomized trial of two dose levels of cyclophosphamide, methotrexate and fluorouracil chemotherapy for patients with metastatic breast cancer. J Clin Oncol 1988; 6:1377–1387. 16. Wood WC, Budman DR, Korzun AH, et al. Dose and dose intensity of adjuvant therapy for stage II, node-positive breast carcinoma. N Engl J Med 1994; 330:1253–1259. 17. Bonadonna G, Valagussa P. Dose-response effect of adjuvant chemotherapy in breast cancer. N Engl J Med 1981; 304:10–15. 18. Carde P, MacKintosh FR, Rosenberg SA. A dose and time response analysis of the treatment of Hodgkin’s disease with MOPP chemotherapy. J Clin Oncol 1983; 1: 146–153. 19. Kwak LW, Halpern J, Olshen RA, Horning SJ. Prognostic significance of actual dose intensity in diffuse large-cell lymphoma: results of a tree-structured survival analysis. J Clin Oncol 1990; 8:963–977. 20. Anderson JR, Santarelli MT, Peterson B. Dose intensity in the treatment of diffuse large-cell lymphoma [letter]. J Clin Oncol 1990; 8:1927. 21. Redmond C, Fisher B, Weiand HS. The methodologic dilemma in retrospectively

Misused Approaches to Analysis of Clinical Trials

22.

23. 24. 25.

26. 27.

28. 29.

30. 31.

32. 33. 34. 35. 36.

37.

38.

39.

501

correlating the amount of chemotherapy received in adjuvant therapy protocols with disease-free survival. Cancer Treat Rep 1983; 67:519–526. Chilcote RR, Krailo, M, Kjeldsberg C, et al., Daunomycin plus COMP vs COMP therapy in childhood non-lymphoblastic lymphoma. Proc ASCO 1991; 10: 289(1011). Chilcote RR, Krailo M. Unpublished observations. DeVita VT Jr, Hubbard SM, Longo DL. The chemotherapy of lymphomas: looking back, moving forward. Cancer Res 1987; 47:5810–5824. Meyer RM, Hryniuk WM, Goodyear MD. The role of dose intensity in determining outcome in intermediate-grade non-Hodgkin’s lymphoma. J Clin Oncol 1991; 9: 339–347. Hryniuk W, Levine MN. Analysis of dose intensity for adjuvant chemotherapy trials in stage II breast cancer. J Clin Oncol 1986; 4:1162–1170. Fisher RI, Gaynor ER, Dahlberg S, et al. Comparison of a standard regimen (CHOP) with three intensive chemotherapy regimens for advanced non-Hodgkin’s lymphoma. N Engl J Med 1993; 328:1002–1006. Shipp MA, Harrington DP, Anderson JR, et al. A predictive model for aggressive non-Hodgkin’s lymphoma. N Engl J Med 1993; 329:987–994. Mayer RJ, Davis RB, Schiffer CA, et al. Comparative evaluation of intensive postremission therapy with different dose schedules of Ara-C in adults with acute myeloid leukemia (AML): Initial results of a CALGB Phase III study. Proc ASCO 1992; 11: 261(853). Kaye SB, Lewis CR, Paul J, et al. Randomized study of two doses of cisplatin with cyclophosphamide in epithelial ovarian cancer. Lancet 1992; 340:329–333. Sorensen JB, Hansen HH, Dombernowsky P, et al. Chemotherapy for adenocarcinoma of the lung (WHO III): a randomized study of vindesine versus lomustine, cyclophosphamide and methotrexate versus all four drugs. J Clin Oncol 1987; 5: 1169–1177. Propert KJ, Anderson JR. Assessing the effect of toxicity on prognosis: methods of analysis and interpretation. J Clin Oncol 1988; 6:868–870. Breslow N, Zandstra R. A note on the relationship between bone marrow lymphocytosis and remission duration in acute leukemia. Blood 1970; 36:246–249. Intergroup Rhabdomyosarcoma Study Group. (1998). Unpublished observations. Shuster JJ, Gieser PW. Radiation in pediatric Hodgkin’s disease [reply]. J Clin Oncol 1998; 16:393. Coronary Drug Project Research Group: Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N Engl J Med 1980; 303:1038–1041. Blakely ML, Lobe TE, Anderson JR, et al. Does debulking improve survival in advanced stage retroperitoneal embryonal rhabdomyosarcoma. J Ped Surg 1999; 34: 736–742. DeVita VT Jr, Simon RM, Hubbard SM, et al. Curability of advanced Hodgkin’s disease with chemotherapy: long-term follow-up of MOPP-treated patients at the National Cancer Institute. Ann Intern Med 1980; 92:587–595. Longo DL, Young RC, Wesley M, et al. Twenty years of MOPP therapy for Hodgkin’s disease. J Clin Oncol 1986; 4:1295–1306.

502

Anderson

40. Canellos GP, Anderson JR, Propert KJ, et al. Chemotherapy of advanced Hodgkin’s disease with MOPP, ABVD or MOPP alternating with ABVD. N Engl J Med 1992; 327:1478–1484. 41. Glick JH, Tsiatis A. MOPP/ABVD chemotherapy for advanced Hodgkin’s disease. Ann Intern Med 1986; 104:876–878. 42. Gelman R, Gelber R, Henderson IC, et al. Improved methodology for analyzing local and distant recurrence. J Clin Oncol 1990; 8:548–555. 43. Bleyer WA, Sather HN, Nickerson HJ, et al. Monthly pulses of vincristine and prednisone prevent bone marrow and testicular relapse in low-risk childhood acute lymphoblastic leukemia: a report from the CCG-161 study by the Childrens Cancer Study Group. J Clin Oncol 1991; 9:1012–1021. 44. Kalbfleisch JD, Prentice RL. Statistical Analysis of Failure Time Data. New York: Wiley, pp. 163–171. 45. Pepe MS, Mori M. Kaplan-Meier, marginal or conditional probability curves in summarizing competing risks failure time data. Stat Med 1993; 12:737–751. 46. Gooley TA, Leisenring W, Crowley J, Storer BE. Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Stat Med 1999; 18:695–706. 47. Darrington DL, et al. Incidence and characterization of secondary myelodysplastic syndrome and acute myelogenous leukemia following high-dose chemoradiotherapy and autologous stem cell transplantation for lymphoid malignancies. J Clin Oncol 1994; 12:2527–2534.

25 Dose-Intensity Analysis Joseph L. Pater NCIC Clinical Trials Group, Queen’s University, Kingston, Ontario, Canada

I.

INTRODUCTION

Despite the inherent methodological difficulties to be discussed in this section, analyses that have attempted to relate the ‘‘intensity’’ of cytotoxic chemotherapy to its effectiveness have had substantial influence both on the interpretation of data from cancer clinical trials and on recommendations for practice in oncology. Prior to a discussion of analytic pitfalls, a brief description of the evolution of the concept of ‘‘dose intensity’’ and its impact on the design and analysis of trials are presented. It should be noted from the outset the term dose intensity has not had a consistent definition. In fact, precisely how the amount of treatment given over time is quantified is an important problem in itself (see later). The publication that first provoked interest in the issue of the importance of delivering adequate doses of chemotherapy represented a secondary analysis by Bonnadonna and Valagussa (1) of randomized trials of cyclophosphamide, methotrexate, and 5-fluorouracil in the adjuvant therapy of women with breast cancer. This secondary analysis subdivided patients in the trials according to how much of their protocol prescribed chemotherapy they actually received. There was a clear positive relationship between this proportion and survival. Patients who received 85% or more of their planned protocol dose had a 77% 5-year relapse-free survival compared with 48% in those who received less than 65%. The authors concluded ‘‘it is necessary to administer combination chemotherapy at a full dose to achieve clinical benefit.’’ 503

504

Pater

Shortly afterward, interest in the role of dose intensity was further heightened by a series of publications by Hryniuk and colleagues. Instead of examining outcomes of individual patients on a single trial, Hryniuk’s approach was to correlate outcomes of groups of patients on different trials with the dose intensity of their treatment. In this case, the intensity of treatment was related not to the protocol specified dose but to a standard regimen. In trials both in breast cancer and ovarian cancer the results of these analyses indicated a strong positive correlation between dose intensity and outcome. Thus, in the treatment of advanced breast cancer the correlation between planned dose intensity and response was 0.74 ( p ⬍ 0.001) and between actually delivered dose intensity and response 0.82 ( p ⬍ 0.01) (2). In the adjuvant setting a correlation of 0.81 ( p ⬍ 0.00001) was found between dose intensity and 3-year relapse-free survival (3). In ovarian cancer the correlation between dose intensity and response was 0.57 ( p ⬍ 0.01) (4). Reaction to these publications was quite varied. Although some authors pointed out alternate explanations for the results, most appeared to regard these analyses as having provided insight into an extremely important principle. For example, the then Director of the National Cancer Institute, Vincent Devita, in an editorial commentary on Hryniuk and Levine’s paper on adjuvant therapy, concluded ‘‘a strong case can be made for the point of view that the most toxic maneuver a physician can make when administering chemotherapy is to arbitrarily and unnecessarily reduce the dose’’ (5). Hryniuk himself argued ‘‘since dose intensity will most likely prove to be a major determinant of treatment outcomes in cancer chemotherapy, careful attention should be paid to this factor when protocols are designed or implemented even in routine clinical practice’’ (6). Despite numerous trials designed to explore the impact of dose intensity on outcomes of therapy over the nearly 20 years since Bonnadonna’s publication, the topic remains controversial. Thus, a recent book (7) devoted a chapter to a debate on the topic. Similar discussions continue within other sites. The remainder of this section reviews the methodologic and statistical issues involved in the three settings mentioned above, namely, studies aimed at relating the delivery of treatment to outcomes in individual patients, studies attempting to find a relationship between chemotherapy dose and outcome among a group of trials, and finally trials aimed at demonstrating in a prospective randomized fashion an impact of dose or dose intensity.

II. RELATING DELIVERED DOSE TO OUTCOME IN INDIVIDUAL PATIENTS ON A TRIAL Three years after the publication of Bonnadonna’s article, Redmond et al. (8) published a methodological and statistical critique of the analysis and interpreta-

Dose-Intensity Analysis

505

tion of their findings. In my view, subsequent discussions of this issue have been mainly amplifications or reiterations of Redmond’s article, so it is summarized in detail here.

A. Problems in Analyzing Results in Patients Who Stopped Treatment Before Its Intended Conclusion Cancer chemotherapy regimens are usually given over at least several months, so no matter the setting, some patients are likely to develop disease recurrence or progression prior to the planned completion of therapy. How to calculate the ‘‘expected’’ amount of treatment in such patients is a problem. If the expected duration is considered to be that planned at the outset, patients who progress on treatment will by definition receive less than ‘‘full’’ doses, and a positive relation between ‘‘completeness’’ of treatment and outcome will be likely. An obvious solution to this problem is to consider as expected only the duration of treatment up to the time of recurrence, which was the approach taken by Bonnadonna. However, this method leads to a potential bias in the opposite direction since, generally speaking, toxicity from chemotherapy is cumulative and doses tend to be reduced over time. Thus, patients who stop treatment early may have received a higher fraction of their expected dose over that time period. In fact, in applying this method to data from a National Surgical Adjuvant Breast and Bowel Project (NSABP) trial, Redmond et al. found an inverse relationship between dose and outcome, that is, patients who receive higher doses over the time they were on treatment were more likely to develop disease recurrence. A third approach suggested by Redmond et al. was to use what has become known as the ‘‘landmark’’ (9) method: The effect of delivered dose up to a given point in time is assessed only among patients who were followed and were free of recurrence until that time. This method avoids the biases mentioned above but omits from the analysis patients who recur during the time of greatest risk of treatment failure. The application of this method to the same NSABP trial indicated no effect of variation in the amount of drug received up to 2 years on subsequent disease-free survival. Finally, Redmond et al. proposed using dose administered up to a point in time as a time-dependent covariate recalculated at the time of each recurrence in a Cox proportional hazards model. In their application of this technique to NSABP data, however, they found a significant delivered dose effect in the placebo arm, a finding reminiscent of the well-known results of the clofibrate trial (10). The apparent explanation for this result is that treatment may be delayed or omitted in the weeks preceding the diagnosis of recurrence. Thus, when drug delivery in the 2 months before diagnosis of recurrence was not included in the calculation, the dose effect disappeared both in the placebo and active treatment arms.

506

B.

Pater

Confounding

The methodological problems of analyses that attempt to relate completeness of drug administration to outcome are not confined to the difficulties of dealing with patients who progress while on therapy. Citing the clofibrate trial mentioned above, Redmond et al. also pointed out that even if one could calculate delivered versus expected dose in an unbiased manner, there would still remain the problem that patients who comply with or who tolerate treatment may well differ from those who do not on a host of factors that themselves might relate to ultimate outcome. Thus, the relationship between administered dose and outcome might be real in the sense that it is not a product of bias in the way the data are assembled or analyzed but might not be causal as it is the product of confounding by underlying patient characteristics. Since the clinical application of an association of dose delivery and outcome rests on the assumption that the relationship is causal, the inability to draw a causal conclusion is a major concern (11). Analyses similar to those of Bonnadonna have been repeated many times (12), sometimes with positive and sometimes with negative results. However, it is hard to argue with the conclusion of Redmond et al.’s initial publication, that is, this issue will not be resolved by such analyses. Only a prospective examination of the question can produce definitive findings.

III. ANALYSES COMBINING DATA FROM MULTIPLE STUDIES As mentioned earlier, the approach taken by Hryniuk and colleagues was, instead of relating outcomes of individual patients to the amount of treatment they received, to assess whether the outcomes of groups of patients treated with regimens that themselves varied in intended or delivered dose intensity correlated with that measure. To compare a variety of regimens used, for example, for the treatment of metastatic breast cancer, Hryniuk calculated the degree to which the rate of drug administration in mg/m2 /wk corresponded to a standard protocol and expressed the average of all drugs in a given regimen as a single proportion— relative dose intensity (see Table 1 for an example of such a calculation). He then correlated this quantity calculated for individual arms of clinical trials or for single arm studies with a measure of outcome, for example, response or median survival. Hryniuk’s work has been criticized on two primary grounds: first the method of calculating dose intensity and second the appropriateness of his method of combining data. With respect to the method of calculation, it has been pointed out that the choice of a standard regimen is arbitrary, and different results are obtained if different standards are used (12,13). Further, the approach ignores

Dose-Intensity Analysis Table 1

507

Calculation of Dose Intensity

Drug Cyclophosphamide Methotrexate Fluorouracil Dose intensity for regimen

Dose

mg/m 2 /wk

Reference mg/m2 /wk

Dose intensity

100 mg/m2 days 1–14 40 mg/m 2 days 1 and 8 600 mg/m 2 days 1 and 8

350 20

560 28

0.62 0.71

300

480

0.62 0.65

the impact of drug schedule and makes untested assumptions about the therapeutic equivalence of different drugs. In more recent work (14), Hryniuk et al. accepted these criticisms and addressed them by calculating a new index called ‘‘summation dose intensity.’’ In this approach the intensity of an individual drug is calculated relative to the dose of that drug that produces a 30% response rate as a single agent. This index is more empirically based and avoids reference to an arbitrary standard. As Hryniuk points out, whether it will resolve the debate about dose intensity will depend on its prospective validation in randomized trials. Hryniuk’s work has also been considered a form of meta-analysis and criticized because of its failure to meet standards (15) for such studies. In my view, with the exception of an article by Meyer et al. (16), these studies are not actually meta-analyses. Meta-analyses are conventionally considered to involve combining data from a set of studies, each of which attempted to measure some common parameter, for example, an effect size or odds ratio. Hryniuk’s work, on the other hand, compares results from one study to the next and estimates a parameter— the correlation between dose intensity and outcome—that is not measured in any one study. Irrespective of this distinction, however, the criticism that Hryniuk’s early studies did not clearly state, for example, how trials were identified and selected seems valid. Formal meta-analytic techniques have, however, been applied to this issue. As mentioned, the article by Meyer et al. contains what I would consider a true meta-analysis of trials assessing the role of dose intensity in patients with nonHodgkin’s lymphoma. An accompanying editorial pointed out (13), however, that the arms of these trials differed in more than dose intensity. Thus, because its conclusions were otherwise based on nonrandomized comparisons, the editorial argued that even in this analysis no level I evidence (in Sackett’s [17] terminology) had been developed to support the role of dose intensity (13). In fact, despite the fact that they drew on data from randomized trials, Hryniuk’s studies gener-

508

Pater

ated only what Sackett considers level V evidence since they relied mostly on comparisons involving nonconcurrent nonconsecutive case series. A much more extensive attempt to generate more conclusive evidence by combining data from randomized trials testing dose intensity was carried out by Torri et al. (18). Using standard search techniques, these authors identified all published randomized trials from 1975 to 1989 dealing with the chemotherapy of advanced ovarian cancer. For each arm of each trial they calculated dose with an ‘‘equalised standard method’’ similar to the summation standard method described above. They used a log linear model that included a separate fixed effect term for each study to estimate the relationship between total dose intensity and the log odds of response and log median survival. The inclusion of the fixed effect terms ensured that comparisons were between the arms of the same trial, not across trials. They also used a multiple linear regression model to assess the relative impact of the intensity of various drugs. Their analysis showed a statistically significant relationship between dose intensity and both outcomes, although the magnitude of the relationship was less than that observed in Hryniuk’s studies—a finding they attributed to the bias inherent in across trial comparisons. They concluded ‘‘the validity of the dose intensity hypothesis in advanced ovarian cancer is substantiated based on the utilisation of improved methodology for analysis of available data. This approach suggests hypotheses for the intensification of therapy and reinforces the importance of formally evaluating dose intense regimens in prospective randomised clinical trials.’’

IV. DESIGNING TRIALS TO TEST THE ‘‘DOSE-INTENSITY HYPOTHESIS’’ Authors critical of Bonnadonna’s (8) or Hryniuk’s (13) methodology called for prospective (randomized) tests of the underlying hypothesis—as did Hryniuk (3) himself. However, it does not seem to have been widely appreciated how difficult this clinical research task actually is (19). There are two fundamental problems: the multitude of circumstances in which differences in dose intensity need to be tested and the difficulty of designing individual trials. A.

Settings for Testing

To test in clinical trials the hypothesis that dose intensity of chemotherapy is an important determinant of outcome of cancer treatment, there would need to be consensus on what constitutes convincing evidence for or against the hypothesis. Clearly, a single negative trial would not be sufficient since it could always be argued that the clinical setting or the regimen used was not appropriate. Thus, to build a strong case against this hypothesis, a large number of trials of different

Dose-Intensity Analysis

509

regimens in different settings would have to be completed. Conversely, and perhaps less obviously, a positive trial, whereas it would demonstrate that the hypothesis is true in a specific circumstance, would not necessarily indicate that maximizing dose intensity in other situations would achieve the same result. As Siu and Tannock put it: ‘‘generalizations about chemotherapy dose intensification confuse the reality that the impact of this intervention is likely to be disease, stage, and drug specific (20).’’ It can be argued, in fact, that systematically attempting to test hypotheses of this type in clinical trials is unlikely to be productive, at least from the point of view of providing a basis for choosing treatment for patients (21). In any case, given the number of malignancies for which chemotherapy is used and the number of regimens in use in each of those diseases, the clinical research effort required to explore fully the implications of this hypothesis is enormous. It is perhaps not surprising that the issue is still unresolved (14,20). B. Design of Individual Trials Equally problematic is the design of a trial that is intended to demonstrate clearly the independent role of dose intensity. Dose intensity, as defined by Hryniuk and coworkers, is a rate, the numerator of which is the total dose given and the denominator the time over which it is delivered. Because of the mathematical relationship among the component variables (total dose, treatment duration, and dose intensity), a comparative study that holds one constant will necessarily vary the other two. That is, in order for two treatments that deliver the same total dose to differ in dose intensity, they must also differ in treatment duration. If these treatments produce different results in a randomized trial, it would not be possible to separate the effect of dose intensity from treatment duration. In fact, a trial would have to have a minimum of four arms for each of these three factors to be held constant, whereas the other two are varied. Given the practical limitations on the extent to which existing chemotherapy regimens can be safely modified, designing such a trial is very challenging. Perhaps the best example of a successful attempt to separate the contribution of these variables is the Cancer and Acute Leukemia Group B (CALGB) trial of 5-flourouracil, cyclophosphamide, and doxorubicin (FAC) chemotherapy in the adjuvant treatment of women with nodepositive breast cancer. This study compared three versions of FAC: a standard regimen that delivered conventional doses of the drugs over six three weekly cycles; a dose intense regimen that delivered the same total dose in one half the time; and a low dose regimen delivering one half the standard dose in a conventional six cycles. (The trial lacked a comparison in which dose intensity was held constant but treatment duration and total dose were varied.) The results of this study were updated recently (22). The two arms delivering a higher total dose demonstrated superior relapse-free and overall survival compared with the lowdose arm. However, there was no difference in outcome between these two

510

Pater

‘‘high-dose’’ arms. Thus, it is impossible to separate the impact of total dose and dose intensity. This trial also illustrates another feature of the trials that have been done in this area; namely, it has been much easier to show that lowering doses below conventional levels has an adverse effect on outcome than to demonstrate an advantage to increasing dose above conventional levels (20). In my view (19), considering the difficulties outlined above, Siu and Tannock (20) are also correct that ‘‘further research (in this area) should concentrate on those disease sites and/or chemotherapeutic agents for which there is a reasonable expectation of benefit.’’ Although the retrospective methods described above are inadequate for drawing causal conclusions, they still may be useful in pointing out where further trials will be most fruitful. Whether such studies should be designed to test a hypothesis as in the case of the CALGB trial described above or should test a regimen developed on the basis of that hypothesis as in a recent National Cancer Institute of Canada Clinical Trials Group adjuvant trial (23) is, like the role of dose intensity itself, a subject for debate (13,19).

REFERENCES 1. Bonnadonna G, Valagussa B. Dose-response effect of adjuvant chemotherapy in breast cancer. N Eng J Med 1981; 304:10–15. 2. Hryniuk W, Bush H. The importance of dose intensity in chemotherapy of metastatic breast cancer. J Clin Oncol 1984; 2:1281–1288. 3. Hryniuk W, Levine MN. Analysis of dose intensity for adjuvant chemotherapy trials in stage II breast cancer. J Clin Oncol 1986; 4:1162–1170. 4. Levin L, Hryniuk WM. Dose intensity analysis of chemotherapy regimens in ovarian carcinoma. J Clin Oncol 1987; 5:756–767. 5. Devita VT. Dose-response is alive and well. J Clin Oncol 1986; 4:1157–1159. 6. Hryniuk WM, Figueredo A, Goodyear M. Applications of dose intensity to problems in chemotherapy of breast and colorectal cancer. Semin Oncol 1987; 4(suppl 4): 3–11. 7. Gersheson DM, McGuire WP. Ovarian Cancer—Controversies in Management. New York: Churchill Livingstone, 1998. 8. Redmond C, Fisher B, Wieand HS. The methodologic dilemma in retrospectively correlating the amount of chemotherapy received in adjuvant therapy protocols with disease-free survival. Cancer Treat Rep 1983; 67:519–526. 9. Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumour response. J Clin Oncol 1983; 1:710–719. 10. The Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the coronary drug project. N Engl J Med 1980; 303:1038–1041. 11. Geller NL, Hakes TB, Petrone GR, Currie V, Kaufman R. Association of diseasefree survival and percent of ideal dose in adjuvant breast chemotherapy. Cancer 1990; 66:1678–1684.

Dose-Intensity Analysis

511

12. Henderson IC, Hayes DF, Gelman R. Dose-response in the treatment of breast cancer: a critical review. J Clin Oncol 1988; 6:1501–1513. 13. Gelman R, Neuberg D. Making cocktails versus making soup. J Clin Oncol 1991; 9:200–203. 14. Hryniuk W, Frei E, Wright FA. A single scale for comparing dose-intensity of all chemotherapy regimens in breast cancer: summation dose-intensity. J Clin Oncol 1998; 16:3137–3147. 15. Labbe KA, Detsky AS, O’Rourke K. Meta-analysis in clinical research. Ann Intern Med 1987; 107:224. 16. Meyer RM, Hryniuk WM, Goodyear MDE. The role of dose intensity in determining outcome in intermediate-grade non-Hodgkin’s lymphoma. J Clin Oncol 1991; 9: 339–347. 17. Sackett DL. Rules of evidence and clinical recommendations in the use of antithrombotic agents. Chest 1989; 95(suppl):2S–4S. 18. Torri V, Korn EL, Simon R. Dose intensity analysis in advanced ovarian cancer patients. Br J Cancer 1993; 67:190–197. 19. Pater JL. Introduction: implications of dose intensity for cancer clinical trials. Seminars in Oncology 1987; 14(suppl 4):1–2. 20. Siu LL, Tannock IP. Chemotherapy dose escalation: case unproven. J Clin Oncol 1997; 15:2765–2768. 21. Pater JL, Eisenhauer E, Shelley W, Willan A. Testing hypotheses in clinical trials. Experience of the National Cancer Institute of Canada Clinical Trials Group. Cancer Treat Rep 1986; 70:1133–1136. 22. Budman DR, Berry D, Cirrincione CT, Henderson IC, Wood WC, Weiss RB, et al. Dose and dose intensity as determinants of outcome in the adjuvant treatment of breast cancer. J Natl Cancer Inst 1998; 90:1205–1211. 23. Levine MN, Bramwell VH, Pritchard KL, Norris BD, Shepherd LS, Abu-Zahra H, et al. Randomized trial of intensive cyclophosphamide, epirubicin, and fluorouracil chemotherapy compared with cyclophosphamide, methotrexate, and fluorouracil in premenopausal women with node-positive breast cancer. J Clin Oncol 1998; 16: 2651–2658.

26 Why Kaplan-Meier Fails and Cumulative Incidence Succeeds When Estimating Failure Probabilities in the Presence of Competing Risks Ted A. Gooley, Wendy Leisenring, John Crowley, and Barry E. Storer Fred Hutchinson Cancer Research Center, University of Washington, Seattle, Washington

I.

INTRODUCTION

In many fields of medical research, time-to-event end points are used to assess the potential efficacy of a treatment. The outcome of interest associated with some of these end points may be a particular type of failure, and it is often of interest to estimate the probability of this failure by a specified time among a particular population of patients. For example, the occurrence of end-stage renal failure is an outcome of interest among patients with insulin-dependent diabetes mellitus (IDDM). Given a sample drawn from the population of patients with IDDM, one may therefore wish to obtain an estimate of the probability of developing end-stage renal failure. As other examples, consider the probability of death due to prostate cancer among patients afflicted with this disease and the probability of the occurrence of cytomegalovirus (CMV) retinitis among patients with AIDS.

513

514

Gooley et al.

The examples above share two features. First, the outcome of interest in each is a time-to-event end point, that is, one is interested not only in the occurrence of the outcome but also the time at which the outcome occurs. Second, multiple modes of failure exist for each of the populations considered, namely failure from the outcome of interest plus other types whose occurrence preclude the failure of interest from occurring. In the IDDM example, death without endstage renal failure is a type of failure in addition to the failure of interest, and if a patient with IDDM dies without renal failure, this failure precludes the outcome of interest (end-stage renal failure) from occurring. For the prostate cancer example, deaths from causes unrelated to this cancer comprise types of failure in addition to death from the index cancer, and the occurrence of these alternate failure types preclude the outcome of interest (death from prostate cancer) from occurring. Finally, patients with AIDS who die without CMV retinitis are not capable of going on to develop CMV retinitis, that is, to fail from the cause of interest. In each of these examples, a competing type of failure exists in addition to the failure of interest. These competing causes of failure are referred to as competing risks for the failure of interest. Specifically, we define a competing risk as an event whose occurrence precludes the occurrence of the failure type of interest. The method due to Kaplan and Meier (1) was developed to estimate the probability of an event for time-to-event end points, but the assumptions required to make the resulting estimate interpretable as a probability are not met when competing risks are present. Nonetheless, this method is often used and the resulting estimate misinterpreted as representing the probability of failure in the presence of competing risks. Statistical methods for obtaining an estimate that is interpretable in this way are not new (2–9), and this topic has also received some attention in medical journals (10–13). We refer to such an estimate as a cumulative incidence (CI) estimate (3), although it has also been referred to as the cause-specific failure probability, the crude incidence curve, and cause-specific risk (14). Similarly, the term cumulative incidence has been used for various purposes. Our reference to this term, however, will be consistent with its interpretation as an estimate of the probability of failure. Despite its recognition in both the statistical and medical literature as the appropriate tool to use, CI is not uniformly used in medical research (15) for purposes of estimation in the setting in which competing risks exist. We believe the reason for this is due to both a lack of complete understanding of the KaplanMeier method and a lack of understanding of how CI is calculated and hence the resulting difference between the two estimators. In this article, we present, in a nontechnical fashion, a description of the Kaplan-Meier estimate not commonly seen. We believe this expression is useful for purposes of understanding why the Kaplan-Meier method results in an estimate that is not interpretable as a probability when used in the presence of competing risks. In addition, this

Kaplan-Meier versus Cumulative Incidence

515

alternate characterization will be extended in a way that allows us to represent CI in a manner similar to that used to obtain the estimate from the Kaplan-Meier method, and in so doing, the validity of CI and the difference between the estimators will be clearly demonstrated. In the next section, we describe the estimate associated with the KaplanMeier method in the alternate manner alluded to above in the setting in which competing risks do not exist. The discussion reviews the concept of censoring and provides a heuristic description of the impact that a censored observation has on the estimate. An example based on hypothetical data is used to illustrate the concepts discussed. The subsequent section contains a description of how the two estimates are calculated when competing risks are present, utilizing the description of censoring provided in the preceding section. Data from a clinical trial are then used to calculate each estimate for the end point of disease progression among patients with head and neck cancer, thus providing further demonstration of the concepts discussed previously. We close with a discussion that summarizes and presents conclusions and recommendations.

II. ESTIMATION IN THE ABSENCE OF COMPETING RISKS: KAPLAN-MEIER ESTIMATE For time-to-event data without competing risks, each patient under study will either fail or survive without failure to last contact. We use ‘‘failure’’ here and throughout as a general term. The specific type of failure depends on the end point analyzed. A patient without failure at last contact is said to be censored due to lack of follow-up beyond this time, that is, it is known that such a patient has not failed by last contact, but failure could occur at a later time. The most reasonable and natural estimate of the probability of failure by a prespecified point in time is the simple ratio of the number of failures divided by the total number of patients, provided all patients without failure have followup to this time. This simple ratio is appropriately interpreted as an estimate of the probability of failure. This estimate is not only intuitive but is also unbiased when all patients who have not failed have follow-up through the specified time; unbiasedness is a desirable statistical property for estimators to possess. If one or more patients are censored before the specified time, the simple ratio is no longer adequate and methods that take into account data from the censored patients are required to obtain an estimate consistent with this ratio. The method due to Kaplan and Meier was developed for precisely this purpose, and when competing risks are not present, this method leads to an estimate that is consistent with the desired simple ratio. The resulting estimate is also exactly equal to this ratio when all patients have either failed or been followed through the specified follow-up time. When used to estimate the probability of failure,

516

Gooley et al.

one uses the complement of the Kaplan-Meier estimate, which we shall denote by 1-KM, where the Kaplan-Meier estimate (KM) represents the probability of surviving without failure. 1-KM can be interpreted as an estimate of the probability of failure when competing risks are not present. To appreciate how data from patients without follow-up to the specified time are incorporated into 1-KM, it is necessary to understand how censored observations are handled computationally. We present below a heuristic description of censoring not commonly seen. We believe that the use of this approach leads to a clear understanding of what 1-KM represents. In addition, this alternate explanation is used in the following section to explain why 1-KM fails as a valid estimate of the probability of failure and to highlight how and when 1-KM and CI differ when used in the presence of competing risks. What follows is a nontechnical description. The interested reader can find a detailed mathematical description elsewhere (16). Note that the probability of failure depends on the timepoint at which the associated estimate is desired, and as failures occur throughout time the estimate will increase with each failure. Because an estimate that is consistent with the simple ratio described above is desired, any estimate that achieves this goal will change if and only if a patient fails. Moreover, if it is assumed that all patients under study are equally likely to fail, it can be shown that each failure contributes a prescribed and equal amount to the estimate, provided all patients have either failed or been followed to a specified timepoint. This prescribed amount is simply the inverse of the total number of patients under study. If a patient is censored prior to the time point of interest; however, failure may occur at a time beyond that at which censoring occurred, and this information must be taken into account to obtain an estimate consistent with the desired simple ratio. One way to view the manner in which censored observations are handled is as follows. As stated above, each patient under study possesses a potential contribution to the estimate of the probability of failure, and each time a patient fails the estimate is increased by the amount of the contribution of the failed patient. Since patients who are censored due to lack of follow-up through a specified time remain capable of failure by this time, however, the potential contribution of these patients can not be discounted. In particular, one can consider the potential contribution of a censored patient as being redistributed among all patients known to be at risk of failure beyond the censored time, as noted by Efron (17). It can be shown that this ‘‘redistribution to the right’’ makes the resulting estimate consistent with the simple ratio and therefore interpretable as an estimate of the probability of failure. Because of this redistribution, any failure that takes place after a censored observation contributes slightly more to the estimate than do failures prior to the censored observation, that is, the potential contribution of a patient to the estimate increases after the occurrence of a censored patient. Another way to understand the impact of censoring is to consider such an observa-

Event time (in years)* 0 15 20 30

Hypothetical Data to Illustrate Concept of Redistribution to the Right No. known to be at risk†

No. known failures (total no. known failures by time)‡

No. censored§

Contribution of next failure¶

Incidence estimate**

100 95 70 56

0 5 (5) 0 (5) 14 (19)

0 0 25 —

1/100 ⫽ 0.01 0.01 0.01 ⫹ 25(0.01)(1/70) ⫽ 0.0136 —

0 5(1/100) ⫽ 0.05 0.05 0.05 ⫹ 14 (0.0136) ⫽ 0.24

* Denotes the time at which various events occurred. An event is either a failure or a censored observation. † Denotes the number of patients who have survived to the associated time, i.e., the number at risk of failure beyond this time. ‡ Denotes the number of patients who failed at the associated time. Parenthetically is the total number of known failures by the associated time. § Denotes the number of patients censored at the associated time. Each of these patients could fail at a later time, so the potential contribution to the estimate due to these patients is redistributed evenly among all patients known to be at risk beyond this time. ¶ Denotes the amount that each failure occurring after the associated time will contribute to the estimate. ** Gives the estimate of the probability of failure by the associated time. This is derived by multiplying the number of failures at each time by the associated contribution and summing the results or by summing the number of known failures and the number of projected failures and dividing by 100, i.e., the number initially at risk.

Kaplan-Meier versus Cumulative Incidence

Table 1

517

518

Gooley et al.

tion as a projected failure or survivor to the time at which the next failure occurs, where this projection is based on the experience of patients who have survived to the time at which censoring occurred. To help illustrate the above discussion, consider the following hypothetical example summarized in Table 1. Suppose that a selected group of 100 patients with IDDM is being followed for development of end-stage renal failure and that none of these 100 dies without failure (i.e., no competing-risk events occur). Assume that all patients have complete follow-up to 15 years after diagnosis of IDDM and 5 of the 100 have end-stage renal failure by this time. The 15-year estimate of renal failure is therefore 5 ⫻ (1/100) ⫽ 0.05, where 1/100 is the potential contribution of each patient to the estimate of the probability of failure. Suppose the remaining 95 patients survive without renal failure to 20 years but 25 of these 95 have follow-up only to 20 years (i.e., each of these 25 patients is censored at 20 years). Since each censored patient could develop renal failure beyond 20 years, the potential contribution of each patient cannot be discounted. In particular, the potential contribution to the estimate for each can be thought of as being uniformly redistributed among the 70 patients known to be at risk of failure beyond the censored time. In other words, each remaining patient’s potential contribution is increased over 1/100 by 25 ⫻ (1/100) divided among the 70 remaining patients, or (1/100) ⫹ 25 ⫻ (1/100) ⫻ (1/70) ⫽ 0.0136. Suppose 14 of these 70 go on to develop renal failure by 30 years, so that a total of 19 patients are known to have failed by this time. Since 25 patients had followup less than 30 years, however, the estimate of failure by this time should be larger than 19/100 ⫽ 0.19; 5 patients fail whose contribution to the estimate is 0.01 and 14 fail whose contribution is 0.0136. The estimate consistent with the desired simple ratio is therefore 5 ⫻ (0.01) ⫹ 14 ⫻ (0.0136) ⫽ 0.24. An alternate way to understand this is that since 20% (14/70) of the patients known to be at risk of renal failure beyond 20 years failed by 30 years, it is reasonable to assume that 20% of the 25 patients censored at 20 years will fail by 30 years. This leads to a projected total of 24 failures (5 ⫹ 14 known failures ⫹ 5 projected failures) and a 30-year estimate of renal failure of 24/100 ⫽ 0.24.

III. ESTIMATION IN THE PRESENCE OF COMPETING RISKS: CUMULATIVE INCIDENCE VERSUS KAPLAN-MEIER Consider now the situation in which competing risks exist. In this setting, three outcomes are possible for each patient under study; each will fail from the event of interest, fail from a competing risk, or survive without failure to last contact. In this setting, 1-KM is not capable of appropriately handling failures from a competing risk because in its calculation, patients who fail from a competing

Kaplan-Meier versus Cumulative Incidence

519

risk are treated in the same manner as patients who are censored. Patients who have not failed by last contact retain the potential to fail, however, whereas patients who fail from a competing risk do not. As a result of this, failures from the event of interest that occur after failures from a competing risk contribute more to the estimate than is appropriate in the calculation of 1-KM, as will be demonstrated in the example below. This overestimate is a result of the fact that the potential contribution from a patient who failed from a competing risk, and who is therefore not capable of a later failure, is redistributed among all patients known to be at risk of failure beyond this time. This redistribution has the effect of inflating the estimate above what it should be, that is, 1-KM is not consistent with the desired simple ratio that is an appropriate estimate of this probability. An alternative to 1-KM in the presence of competing risks is the CI estimate. This estimate is closely related to 1-KM, and patients who are censored due to lack of follow-up are handled exactly as is done in the calculation of 1KM. Failures from a competing risk, however, are dealt with in a manner appropriate for purposes of obtaining an estimate interpretable as a probability of failure. In the calculation of CI, patients who fail from a competing risk are correctly assumed to be unable to fail from the event of interest beyond the time of the competing-risk failure. The potential contribution to the estimate for such patients is therefore not redistributed among the patients known to be at risk of failure, that is, failures from a competing risk are not treated as censored as in the calculation of 1-KM. The difference between 1-KM and CI, therefore, comes about from the way in which failures from a competing risk are handled. If there are no failures from a competing risk, 1-KM and CI will be identical. If failures from a competing risk exist, however, 1-KM is always larger than CI at and beyond the time of first failure from the event of interest that follows a competing-risk failure. Returning to the above example on end-stage renal failure, suppose that competing-risk events do occur (i.e., there are deaths without renal failure). Suppose all assumptions are the same as before with the exception that the 25 patients censored above instead die without renal failure at 20 years. The estimate 1-KM at 30 years is the same as previously because these 25 patients are treated as censored at 20 years. Since competing risks have occurred, however, CI is the appropriate estimate, and the 25 patients who die without renal failure should not be treated as censored. In this simple example, all patients have complete follow-up to 30 years, that is, each has either developed renal failure, died without failure, or survived without failure by 30 years. The estimate of the probability of renal failure by 30 years should therefore be the number of failures divided by the number of patients, or 19/100 ⫽ 0.19 (i.e., the desired simple ratio). Due to the inappropriate censoring of the patients who died without renal failure, however, 1-KM ⫽ 0.24, an estimate that is not interpretable as the probability of end-stage renal failure.

520

Gooley et al.

This simple example illustrates how 1-KM and CI differ when competing risks are present. It also demonstrates why treating patients who fail from a competing risk as censored leads to an estimate (i.e., 1-KM) that cannot be validly interpreted as a probability of failure. In general, the calculation of 1-KM and CI is more involved than shown in the above example due to a more complex combination of event times, but the concepts detailed above are identical.

IV. EXAMPLE FROM REAL DATA: SQUAMOUS CELL CARCINOMA To further illustrate the differences between 1-KM and CI, consider the following example taken from a phase III Southwest Oncology Group clinical trial. The objectives of this study were to compare the response rates, treatment failure rates, survival, and pattern of treatment failure between two treatments for patients with advanced-stage resectable squamous cell carcinoma of the head and neck (18). A conventional (surgery and postoperative radiotherapy) and an experimental (induction chemotherapy followed by surgery and postoperative radiotherapy) treatment were considered. We use data from this clinical trial among

Figure 1 The complement of the Kaplan-Meier estimate (1-KM) and the cumulative incidence estimate (CI) of disease progression among 76 patients with head and neck cancer. The numerical values of each estimate are indicated.

Kaplan-Meier versus Cumulative Incidence

521

patients treated with the conventional treatment to calculate both 1-KM and CI for the outcome of disease progression. Among 175 patients entered into the study, 17 were ruled ineligible. Of the 158 eligible patients, 76 were randomized to receive the conventional treatment and 32 had disease progression while 37 died without progression. Therefore, 32 of 76 patients failed from the event of interest (disease progression), whereas 37 of 76 patients failed from the competing risk of death without progression. The remaining seven patients were alive without progression at last followup and were therefore censored. Each of the seven censored patients had followup to at least 7.0 years, and all cases of progression occurred prior to this time. All patients therefore have complete follow-up through 7.0 years after randomization so that the natural estimate of the probability of progression by 7.0 years is 32/76 ⫽ 42.1%, precisely the value of CI (Fig. 1). On the other hand, the value of 1-KM at this time is 51.6%, as shown in Figure 1, the discrepancy being due to the difference in the way that patients who failed from the competing risk of death without progression are handled.

V.

DISCUSSION

We have shown that when estimating the probability of failure for end points that are subject to competing risks, 1-KM and CI can result in different estimates. If it is of interest to estimate the probability of a particular event that is subject to competing risks, 1-KM should never be used, even if the number of competingrisk events is relatively small (and therefore the two estimates not much different.). It is not clear what 1-KM in such situations represents. The only way to interpret 1-KM in this setting is as the probability of failure from the cause of interest in the hypothetical situation where there are no competing risks and the risk of failure from the cause of interest remains unchanged when competing risks are removed. Because of the way the Kaplan-Meier method handles failures from a competing risk, 1-KM will not be a consistent estimate of the probability of failure from the cause of interest. The discrepancy between 1-KM and CI is dependent on the timing and frequency of the failures from a competing risk; the earlier and more frequent the occurrences of such events, the larger the difference. Regardless of the magnitude of this difference, however, reporting 1-KM in such situations, if it is interpreted as an estimate of the probability of failure, is incorrect and can be very misleading. The wide availability of statistical software packages capable of calculating KM estimates but which do not directly calculate CI estimates undoubtedly contributes to the frequent misuse of 1-KM. Although we have not seen the CI estimate offered commercially in any software packages, its calculation is not

522

Gooley et al.

computationally difficult, and programs that accomplish this are reasonably straightforward to write. The focus of this article was to demonstrate that the methods discussed above lead to different estimates in the presence of competing risks, even though each is intended to measure the same quantity. Nonetheless, we emphasize that presenting only results describing the probability of the event of interest falls short of what should be examined so that one can fully appreciate factors that affect the outcome. When analyzing and presenting data where competing risks occur, it is therefore important to describe probabilities of failure not only from the event of interest but also failures due to competing-risk events. One approach to dealing with this problem is to present an estimate of the time to the minimum of the different types of failure. For a discussion of related topics, see Pepe et al. (10). We focused purely on the estimation of probability of failure in this chapter. It is often of interest, however, to compare outcome between two treatment groups. How such comparisons are made and exactly what is compared can be complicated issues and have not been addressed here. Such topics are beyond the scope of this chapter but have been addressed in previous work (19,20). If estimation is the goal and competing risks exist, however, the use of 1-KM is inappropriate and yields an estimate that is not interpretable. In such situations, CI should always be used for purposes of estimation.

ACKNOWLEDGMENT Supported by grants CA 18029, CA 38296, and HL 36444 awarded by the National Institutes of Health.

REFERENCES 1. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1957; 53:457–481. 2. Aalen O. Nonparametric estimation of partial transition probabilities in multiple decrement models. Ann Stat 1978; 6:534–545. 3. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: John Wiley, 1980. 4. Prentice RL, Kalbfleisch JD. The analysis of failure times in the presence of competing risks. Biometrics 1978; 34:541–554. 5. Benichou J, Gail MH. Estimates of absolute cause-specific risk in cohort studies. Biometrics 1990; 46:813–826. 6. Dinse GE, Larson MG. A note on semi-Markov models for partially censored data. Biometrika 1986; 73:379–386.

Kaplan-Meier versus Cumulative Incidence

523

7. Lawless JF. Statistical Models and Methods for Lifetime Data. New York: John Wiley; 1982. 8. Gaynor JJ, Feuer EJ, Tan CC, Wu DH, Little CR, Straus DJ, et al. On the use of cause-specific failure and conditional failure probabilities: examples from clinical oncology data. J Am Stat Assoc 1993; 88:400–409. 9. Pepe MS, Mori M. Kaplan-Meier, marginal or conditional probability curves in summarizing competing risks failure time data? Stat Med 1993; 12:737–751. 10. Pepe MS, Longton G, Pettinger M, Mori M, Fisher LD, Storb R. Summarizing data on survival, relapse, and chronic graft-versus-host disease after bone marrow transplantation: motivation for and description of new methods. Br J Haematol 1993; 83: 602–607. 11. McGiffin DC, Naftel DC, Kirklin JK, Morrow WR, Towbin J, Shaddy R, et al. Predicting outcome after listing for heart transplantation in children: comparison of Kaplan-Meier and parametric competing risk analysis. Pediatrics 1997; 16:713–722. 12. Caplan RJ, Pajak TF, Cox JD. Analysis of the probability and risk of cause-specific failure. Int J Radiat Oncol Biol Phys 1994; 29:1183–1186. 13. Gelman R, Gelber R, Henderson C, Coleman CN, Harris JR. Improved methodology for analyzing local and distant recurrence. J Clin Oncol 1990; 8:548–555. 14. Cheng SC, Fine JP, Wei LJ. Prediction of cumulative incidence function under the proportional hazard model. Biometrics 1998; 54:219–228. 15. Niland JC, Gebhardt JA, Lee J, Forman SJ. Study design, statistical analyses, and results reporting in the bone marrow transplantation literature. Biology of Blood and Marrow Transplantation 1995; 1:47–53. 16. Gooley TA, Leisenring W, Crowley JC, Storer BE. Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Stat Med 1999; 18:695–706. 17. Efron B. The two sample problem with censored data. Proceedings of the fifth Berkeley symposium in mathematical statistics, IV. New York: Prentice-Hall, 1967, pp. 831–853. 18. Schuller DE, Metch B, Stein DW, Mattox D, McCracken JD. Preoperative chemotherapy in advanced resectable head and neck cancer: final report of the Southwest Oncology Group. Laryngoscope 1988; 98:1205–1211. 19. Pepe MS. Inference for events with dependent risks in multiple endpoint studies. J Am Stat Assoc 1991; 86:770–778. 20. Gray RJ. A class of k-sample tests for comparing the cumulative incidence of a competing risk. Ann Stat 1988; 16:1141–1154.

27 Meta-Analysis Luc Duchateau and Richard Sylvester European Organization for Research and Treatment of Cancer, Brussels, Belgium

I.

INTRODUCTION

Individual cancer clinical trials are often not powerful enough to provide a definitive answer to the question of interest. However, data may be available from several different trials that have studied the same or a similar question. A metaanalysis (overview) is the process whereby the quantitative results of separate (but similar) studies are combined together using formal statistical techniques to make use of all the available information. Due to the larger sample size, this provides a more powerful test of the hypothesis and increased precision (lower variance) of the estimated treatment effect. Meta-analyses are often carried out if the individual trials addressing a given question of interest are inconclusive or if there are conflicting results from the various studies (1). It is important that the clinical question addressed in a meta-analysis is neither too broad nor too narrow. If the clinical question is too broad, the overall results might just be an average of the effects of different treatment regimens and/or patient populations and do not reflect any real effect in a specific treatment regimen and/or patient population. On the other hand, if the clinical question is too narrow, few eligible trials might be found and the results might not be of general interest. As compared with a review of the literature in which personal judgment may play a role in drawing conclusions (2), a meta-analysis is more objective since it is quantitative in nature. It is important, though, to ensure objec525

526

Duchateau and Sylvester

tivity, that a protocol is written describing how the meta-analysis will be performed, for instance, clearly stating trial eligibility criteria and statistical methods that will be used.

II. TYPES OF META-ANALYSES There are three different types of meta-analyses. They may be based on the following: 1.

2.

3.

The literature (MAL): A literature search is undertaken to find all relevant publications. The results from these publications are then combined based on the information available in the publication (e.g., p value, log rank statistic, number of events in the treated and control groups, etc.). Summary data (MAS): Again, all relevant publications are identified and a summary of the relevant statistics (e.g., Kaplan-Meier survival estimates in the treated and control groups at a specified time or logrank O, E, and V) is obtained from the authors of the publication. Individual patient data (MAP): A search is not only done in the literature for published trials but also in the scientific community for unpublished trials. For all trials, whether published or not, individual patient data are obtained from the investigator. For each patient, data are requested on the end point of interest, for example, the exact date of death or censoring, along with additional information on the patient’s treatment and prognostic factors.

Performing a MAL in the field of cancer is often difficult since the time to an event, duration of survival, or time to progression is generally the main end point (3). The information that can be obtained from the different publications varies greatly, from the logrank statistic and Kaplan-Meier curves to virtually no information except the number of patients randomized to treatment and control. It is not always clear how this information can be combined together for an overall test. In most cases it is not feasible to perform a genuine time to event analysis. MAS is an alternative that could be considered. It shares, however, weaknesses similar to MAL. First, there are two important sources of bias in both MAL and MAS but not in MAP (4): 1.

Publication bias, which is caused by the fact that trials where there is a statistically significant treatment effect are more likely to be published than ‘‘negative’’ trials, thus leading to an over estimation of the size of the difference.

Meta-Analysis

527

2. Selection bias, which is caused by the fact that some patients may have been excluded from the analysis presented in the publication for reasons that are treatment related. These patients are not included in the MAL or MAS, whereas a MAP is based on all randomized patients. Second, as detailed by Stewart and Clarke (5), meta-analysis of individual patient data presents numerous other advantages as compared with MAL and MAS: 1. Quality control of the individual patient data. For instance the randomization sequence can be verified so that the trials that were not properly randomized, and thus could be biased (6), can be excluded from the MAP. 2. Updated follow-up data can be obtained from the investigator, thus leading to a more powerful analysis as compared with MAL. 3. Subgroup and prognostic factor analyses can be carried out. These are only feasible in MAP as individual patient data are needed for this purpose. Due to the above reasons, MAP is the only type of meta-analysis that can be recommended in oncology, even though it is much more time consuming than the two other types of meta-analyses. In the remainder of this chapter, it is mainly MAP that is discussed.

III. STATISTICAL TECHNIQUES FOR META-ANALYSES A. General Principles for Combining Data Historically, one of the first techniques for combining data from several experiments was proposed by Fisher (7) and was based on the combination of p values. Statistical techniques that combine either p values or standardized test statistics such as the chi-square test statistic should be avoided in meta-analyses. Such techniques do not take into consideration the size of the trial (number of events, total person-years follow up, etc.), the precision of the estimate (variance), or even the direction of the difference in the case of a two-sided test. A trial of 20 patients with a significant p value will thus contribute as much as a trial of 200 patients with the same p value. Another extreme and possibly misleading solution to the problem of combining data from different studies is to disregard the fact that data come from different trials. An example can easily show how this might lead to incorrect conclusions. Assume that the results of two trials, expressed as the number of patients dead and alive in the treated and control groups, are to be combined. The data are summarized in Table 1.

528

Duchateau and Sylvester

Table 1

Data From Two Trials With Binary Outcomes Dead

Alive

Total

42 (70%) 84 (70%) 126 (70%)

60 120 180

36 (30%) 18 (30%) 54 (30%)

120 60 180

Trial 1 Treatment Control Total

18 36 54

Treatment Control Total

84 42 126

Trial 2

Within each trial the survival is the same in the treated and in the control group, but the overall survival in the first trial is 70% as compared with 30% in the second trial. Thus, there is no difference at all between treatment and control. The combination of the data from these two trials is shown in Table 2. For the combined data, the survival in the treated group is only 43% as compared with 57% for the control group, a difference that is significant at the 1% level. This paradoxical result, better known as Simpson’s paradox (8), is caused by the fact that relatively more patients in the treated group come from the study with a low survival rate. Thus, in combining results from different trials, data should not just be pooled and an analysis done on the pooled data. Rather, a statistic that compares the treatment and control groups must be calculated for each trial. Subsequently, the statistics are combined, weighting them by their precision. B.

Whitehead’s Approach

A general parametric approach has been proposed by Whitehead and Whitehead (9). Assume that the results of I trials have to be combined. The true treatment

Table 2

Data From Two Trials Combined Together Pooled data

Treatment Control Total

Dead

Alive

Total

102 78 180

78 (43%) 102 (57%) 180 (50%)

180 180 360

Meta-Analysis

529

effect in the ith trial is given by τ i and an estimate of τ i by τˆ i with asymptotic variance ν i. The measure of treatment effect is taken such that no treatment effect means that τ i ⫽ 0 (e.g., log odds ratio, log hazard ratio). If it is assumed that asymptotically for each trial τˆ i ⬃ N(τ i , ν i )

(1)

then under the null hypothesis of no treatment effect, τ i ⫽ 0, τˆ i ν i⫺1 ⬃ N(0, ν i⫺1 ) As this is true for each trial and the trial results are independent of each other, we have

冢

I

冱

I

τˆ i ν i⫺1 ⬃ N 0,

i⫽1

冱ν

⫺1 i

i⫽1

冣

Under the null hypothesis we thus have

冢冱

冣

2

I

X⫽

τˆ i ν i⫺1

i⫽1

I

冱

⬃ χ 21

(2)

ν i⫺1

i⫽1

Large values of X lead to rejection of the null hypothesis. To estimate the ‘‘overall’’ or ‘‘typical’’ effect, the assumption is made that the treatment effect is the same in each of the trials (τ 1 ⫽ ⋅ ⋅ ⋅ ⫽ τ I ⫽ τ). Then

冢冱

I

冱

I

τˆ i ν i⫺1 ⬃ N τ

i⫽1

I

ν i⫺1,

i⫽1

冱ν i⫽1

⫺1 i

冣

and an unbiased estimate is given by I

τˆ ⫽

冱 i⫽1

冫冱 ν I

τˆ i ν i⫺1

⫺1 i

(3)

i⫽1

Finally, an approximate maximum likelihood estimate for τ i that is asymptotically normally distributed (10) can be obtained by taking the ratio of the score of the likelihood and the Fisher information at τ i ⫽ 0. The approximate maximum likelihood estimates for the odds ratio and the hazard ratio obtained using this method are given in the next section.

530

C.

Duchateau and Sylvester

Testing the Hypothesis of No Treatment Effect and Estimating the Overall Treatment Effect

Most end points in cancer clinical trials are binary (presence or absence of an event) or involve the time to an event (duration of survival, time to recurrence). The next two sections focus on meta-analysis techniques for each of these two types of end points. 1. Binary Data For binary data, the most common measure of treatment effect is the odds ratio, given by OR ⫽

π t /(1 ⫺ π t) π c /(1 ⫺ π c )

where π t (π c ) is the probability of an event, for example of dying, in the treated (control) group. An approximate maximum likelihood estimate for the log odds ratio of a specific trial is given by log(OR) ⫽

D t ⫺ N t (D/N) (D(N ⫺ D)N t N c )/(N 2 (N ⫺ 1))

where D t is the number of deaths in the treated group, N t (N c ) is the number of patients in the treated (control) group, D is the total number of deaths, and N is the total sample size. The asymptotic variance of this estimate is given by Var(log(OR)) ⫽

1 (D(N ⫺ D)N t N c )/(N 2 (N ⫺ 1))

Note that the estimate and the variance of the estimate have the same denominator. To simplify the notation, let V be the denominator of the above expressions. The numerator of the estimate of log(OR) can be expressed as (O ⫺ E), where O ⫽ D t is the observed number of events in the treated group and E is the expected number of events in the treated group if there is no difference between the treated and the control groups. Therefore, log(OR) ⫽

(O ⫺ E) V

and asymptotically [see expression (1)] log(OR) ⬃ N(log(OR), V ⫺1 )

Meta-Analysis

531

To make a distinction between the different trials, we use a subscript i, so that the statistics for the ith trial are given by (O ⫺ E) i , Vi , OR i , and log(OR i ). Using expression (2), the overall test statistic is

冢冱

X⫽

冣冢冱 2

I

log(OR i )Vi

⫽

i⫽1

I

冱V

冣

2

I

(O ⫺ E) i

i⫽1

I

冱V

i

i⫽1

⬃ χ 21

i

i⫽1

and an estimate of the overall effect log(OR) [see expression (3)] is given by I

log(O) ⫽

冱

冫冱 I

log(OR i ) Vi

i⫽1

I

Vi ⫽

i⫽1

冱

冫冱 V I

(O ⫺ E) i

i⫽1

i

i⫽1

with 95% confidence interval ⫺1/2

冢冱冣 I

log(OR) ⫾ 1.96

Vi

i⫽1

The estimate and the confidence interval for OR can be found by taking the exponential of log(OR) and of the lower and upper limits of its confidence interval: OR ⫽ exp(log(OR)) with 95% confidence interval ⫺1/2

⫺1/2

冤exp冢log(OR) ⫺ 1.96冢冱 V 冣冣; exp冢log(OR) ⫹ 1.96冢冱 V 冣冣冥 I

I

i

i

i⫽1

i⫽1

An additional parameter of interest is the percent reduction in the odds of the event in the treated group as compared with the control group, given by 100(1 ⫺ OR) ⫽

100((π c /1 ⫺ π c ) ⫺ (π t /1 ⫺ π t )) (π c /1 ⫺ π c )

and can thus be estimated by 100(1 ⫺ OR). Its variance is given by Var(100(1 ⫺ OR)) ⫽

(100 (1 ⫺ OR)) 2

冢冱 i⫽1

冣冫冱 2

I

(O ⫺ E) i

I

i⫽1

Vi

532

Duchateau and Sylvester

with 95% confidence interval 100(1 ⫺ OR) ⫾ 1.96(Var(100(1 ⫺ OR))) 1/2

2. Time-to-Event Data The analysis of time-to-event (duration of survival) data is more complex and additional notation is needed. Assume that the survival density function is given by f t (t) (f c (t)) for the treated (control) group. The survival distribution function, expressing the probability of surviving beyond time t, is given by St (t) (Sc (t)) for the treated (control) group. The hazard function, the instantaneous death rate or the conditional probability of dying immediately after time t, given survival up to time t, for the treated (control) group is defined by λt (t) ⫽ f t (t)/St (t) (λc (t) ⫽ f c (t)/Sc (t)) In the analysis of survival data, it is often assumed that the hazard functions of the treated and the control group are proportional to each other over time. This does not mean that the functions themselves are constant over time. The proportionality factor is given by the hazard ratio HR ⫽ λt (t)/λc (t) This constant is the most common measure of treatment effect for time to event studies. Assume N t (N c ) patients are randomly assigned to the treated (control) group. Patients are followed up until a certain time t. Either the event (e.g., death) has already occurred by time t or the patient is censored at this timepoint. From the start to the end of the study assume there are k distinct times of death. The number of patients at risk just before the jth death time in the treated (control) group, N tj (N cj ), equals the initial number of patients minus the number of patients who died or were censored in the treated (control) group before the jth death time and N j ⫽ N cj ⫹ N tj . At the jth death time, the number of deaths in the treated (control) group equals D tj (D cj ) and let D j ⫽ D cj ⫹ D tj. An approximate maximum likelihood estimate for the log hazards ratio of a specific trial is given by k

log(HR) ⫽

冱D

tj

⫺ N tj (D j /N j )

j⫽1

k

冱 D (N ⫺ D )N N /N j

j⫽1

j

j

tj

cj

2 j

(N j ⫺ 1)

Meta-Analysis

533

The asymptotic variance of this estimate is given by 1

Var(log(HR)) ⫽

k

冱 D (N ⫺ D )N N /N j

j

j

tj

cj

2 j

(N j ⫺ 1)

j⫽1

Again, we can express the numerator of the estimator of log(HR) for the ith trial as (O ⫺ E)i and the denominator as V i. Thus, (O ⫺ E) i log(HR i ) ⫽ Vi with asymptotically log(HR i ) ⬃ N(log(HR i ), V i⫺1 ) In a similar way as for the combination of odds ratios, the test statistic is given by

冢冱

X⫽

冣冢冱 2

I

log(HR i )Vi

⫽

i⫽1

I

冱V

(O ⫺ E) i

i⫽1

I

冱V

i⫽1

冣

2

I

⬃ χ 21

i

i⫽1

and an estimate of the overall effect log(HR) is given by I

log(HR) ⫽

冱

冫冱冱 I

log(HR i)Vi

i⫽1

I

⫽

i⫽1

i⫽1

冫冱 V I

(O ⫺ E) i

i

i⫽1

with 95% confidence interval ⫺1/2

冢冱冣 I

log(HR) ⫾ 1.96

Vi

i⫽1

The estimate and the confidence interval for HR can be found by taking the ˆ R) and of the lower and upper limit of its confidence interval: exponential of log(H HR ⫽ exp(log(HR))

534

Duchateau and Sylvester

with 95% confidence interval ⫺1/2

⫺1/2

冤exp冢log(HR) ⫺ 1.96冢冱 V 冣冣;exp冢log(HR) ⫹ 1.96冢冱 V 冣冣冥 I

I

i

i

i⫽1

i⫽1

An additional parameter of interest is the percent reduction in the hazard of the treated group as compared with the control group, given by 100 (1 ⫺ HR) ⫽

100(λ c (t) ⫺ λ t (t)) λ c (t)

and can thus be estimated by 100 (1 ⫺ HR). Its variance is given by Var(100(1 ⫺ HR)) ⫽

(100 (1 ⫺ HR)) 2

冢冱 i⫽1

冣冫冱 2

I

(O ⫺ E)i

I

Vi

i⫽1

with 95% confidence interval 100 (1 ⫺ HR) ⫾ 1.96 (Var(100(1 ⫺ HR)) 1/2 3. Example For patients with superficial bladder cancer, four EORTC trials were identified that compared the disease-free interval in patients that were randomized after transurethral resection to either adjuvant treatment or to no further treatment (control) (11). The results of these four trials for patients between 60 and 70 years of age are shown in Table 3. The summary results in the (O ⫺ E) and V columns are presented twice, once based on an analysis of whether or not an event occurred (odds ratio) and once based on an analysis of the actual time to the event (hazard ratio). Odds Ratio. An estimate of the overall or typical OR is given by OR ⫽ exp

冢

冣

⫺1.77 ⫺ 1.82 ⫺ 6.05 ⫺ 0.74 ⫽ 0.67 5.27 ⫹ 6.45 ⫹ 5.16 ⫹ 9.22

with 95% confidence interval equal to [0.46; 0.99]. The chi-square test statistic is equal to X⫽

(⫺1.77 ⫺ 1.82 ⫺ 6.05 ⫺ 0.74) 2 ⫽ 4.13 5.27 ⫹ 6.45 ⫹ 5.16 ⫹ 9.22

and P(χ 21 ⱖ 4.13) ⫽ 0.042

Meta-Analysis

Table 3 Disease-free Interval Statistics From Four Bladder Cancer Trials Comparing Adjuvant Treatment to Control for Patients Between 60 and 70 Years Old Patients

Events

Odds ratio

Hazard ratio

Study

Treatment

No Treatment

Treatment

No Treatment

O⫺E

V

O⫺E

V

30751 30781 30791 30863 Total

82 56 114 76 328

35 55 25 80 195

55 34 53 28 170

26 37 19 31 113

⫺1.77 ⫺1.82 ⫺6.05 ⫺0.74 ⫺10.4

5.27 6.45 5.16 9.22 26.1

⫺5.4 ⫺3.9 ⫺9.4 ⫺0.9 ⫺19.6

15.1 17.4 8.2 14.7 55.4

535

536

Duchateau and Sylvester

The treatment effect is thus significant at the 5% level. The odds reduction is estimated to be 33% with 95% confidence interval [1%; 64%]. Hazard Ratio. Using the same reasoning, the (O ⫺ E) and V statistics from each individual trial can be used to derive an estimate of the HR, 0.70, with 95% confidence interval [0.54; 0.91]. The chi-square statistic for testing the treatment difference equals 6.94 with a p value of 0.008. The estimate of the hazard reduction is 30% with 95% confidence interval [8%; 52%]. The analysis based on the hazard ratio is the correct one to use in this case since it takes into account not just whether or not an event occurred but also the time at which it occurred. Use of this additional information provides a more convincing result. D.

Testing for Heterogeneity

Unlike a multicenter clinical trial, the trials that contribute to a meta-analysis are based on different protocols. As there can be differences between the trials with respect to the treatment regimens, the patient population, and the end point assessment, they can also differ in the size of the treatment effect. It is thus important to investigate whether the results in the different trials are similar. If the results of the different trials are heterogeneous, it is important to further investigate in which trials they are different and to try to identify the reasons for the differences (12). Again a general framework can be constructed to test for heterogeneity based on reasoning similar to that in Section III.B. Testing for heterogeneity corresponds to testing the null hypothesis that the treatment effect is the same in all I trials (H 0 : τ 1 ⫽ . . . ⫽ τ I ⫽ τ) The statistic by which heterogeneity can be tested is given by I

XH ⫽

冱 (τˆ ⫺ τˆ ) i

2

ν i⫺1

i⫽1

If the null hypothesis is true and the treatment effects are similar, their estimates will also be close together and close to the overall estimate τˆ that is based on the different estimates τˆ i. It can be proven that under the null hypothesis X H ⬃ χ 2I⫺1 An alternative expression for X H which is easier to use in computations can be derived:

Meta-Analysis

537

I

XH ⫽

冱 (τˆ ⫺ τˆ ) i

2

ν i⫺1

i⫽1 I

⫽ ⫽

冱

I

i⫽1

冱

I

I

τˆ 2i ν i⫺1 ⫹ τˆ 2

I

ν i⫺1 ⫺ 2τˆ

i⫽1

冱 τˆ ν 2 i

⫺1 i

⫹ τˆ 2

i⫽1

冱ν

冱 τˆ ν i

⫺1 i

i⫽1

⫺1 i

i⫽1

In terms of the statistics (O ⫺ E) i and V i for both binary and time to event data, this becomes

冢冱

冣

2

I

I

XH ⫽

冱 i⫽1

(O ⫺E) 2i ⫺ Vi

(O ⫺ E) i

i⫽1

I

冱V

i

i⫽1

Example The statistic for testing heterogeneity for the time to event data described in Table 3 can be obtained by XH ⫽ ⫺

冢

⫺5.4 2 ⫺3.9 2 ⫺9.4 2 ⫺0.9 2 ⫹ ⫹ ⫹ 15.1 17.4 8.2 14.7

冣

(⫺5.4 ⫺ 3.9 ⫺ 9.4 ⫺ 0.9) 2 ⫽ 6.7 (15.1 ⫹ 17.4 ⫹ 8.2 ⫹ 14.7)

The p value is given by P(χ 23 ⱖ 6.7) ⫽ 0.08 which is not significant at the 5% level. E.

Testing for Interaction and Linear Trend

It is sometimes thought that certain subgroups of patients may benefit more from the treatment than other subgroups. If the division into subgroups is not based on ordered categories, only an interaction can be tested. If the division is based on an ordered variable (e.g., age groups), a test for linear trend can also be carried out. Within each trial the patients are first divided into G subgroups of interest (g ⫽ 1 , . . . , G). For each of these subgroups within trial i the statistics (O ⫺ E) ig and Vig are calculated as before. For each subgroup, the sum of these statistics over all the trials is calculated

538

Duchateau and Sylvester I

(O ⫺ E) g ⫽

冱

I

(O ⫺ E) ig and Vg ⫽

i⫽1

冱V

ig

i⫽1

An estimate of the log odds ratio for the gth subgroup can be obtained by log(OR g ) ⫽ (O ⫺ E) g /Vg and a test for interaction is then given by G

G

冱

XI ⫽

log(OR g )V g⫺1 ⫺ log(OR)

g⫽1

冱V

⫺1 g

g⫽1

Thus, if the estimates of the odds ratios in the different subgroups are similar to each other, the test statistic will be small. Under the null hypothesis X I ⬃ χ 2G⫺1 In the case of time to event data, exactly the same formulae apply with OR replaced by HR. A computationally easier form in terms of (O ⫺ E) g and V g is given by

冢冱

G

冱

XI ⫽

g⫽1

(O ⫺ E) 2g ⫺ Vg

冣

2

G

(O ⫺ E) g

g⫽1

G

冱 Vg g⫽1

Testing for a trend with an ordered variable is more complex. The same statistics are calculated for each subgroup. A number (score) is then assigned to each of the subgroups according to their order, the subgroup with the lowest value being assigned the number 0, the subgroup with the highest value the number G ⫺ 1. Subsequently, the following statistics are calculated: G

A⫽

冱

G

Vg ; B ⫽

g⫽1

冱

G

(g ⫺ 1)Vg ; C ⫽

g⫽1

G

D⫽

冱 (O ⫺ E) ; g

g⫽1

冱 (g ⫺ 1) g⫽1

G

E⫽

冱 (g ⫺ 1)(O ⫺ E)

g

g⫽1

The statistic used to test for linear trend is given by (13) XT ⫽

(E ⫺ (DB)/A) 2 C ⫺ B 2 /A

2

Vg ;

Meta-Analysis

539

Under the null hypothesis of no trend between the subgroups X T ⬃ χ 21 Example Patients in the four previous trials were divided into three age groups. The statistics (O ⫺ E) and V for each of the subgroups from each of the trials are given in the fourth and fifth columns of the forest plot (discussed in the next section). The sum of these statistics over all the trials in a particular subgroup is presented in the subtotals line. The test statistic for interaction using the time to event data can then be derived from these statistics XI ⫽

冢

冣

(⫺11 ⫺ 19.7 ⫺ 8.8) 2 ⫺11 2 ⫺19.7 2 ⫺8.8 2 ⫹ ⫹ ⫽ 0.81 ⫺ 49.7 55.4 46.2 (49.7 ⫹ 55.4 ⫹ 46.2)

The p value is given by P(χ 22 ⱖ 0.81) ⫽ 0.85 which is clearly not significant. For the test for linear trend, the following intermediate results were found A ⫽ 151.3; B ⫽ 147.8; C ⫽ 240.2; D ⫽ ⫺39.5; E ⫽ ⫺37.3 Therefore, the test statistic equals 0.017 and the p value is given by P(χ 21 ⱖ 0.017) ⫽ 0.99

IV. GRAPHICAL PRESENTATION OF THE RESULTS: FOREST PLOTS A summary of the data can be presented in a forest plot (14): One line with the corresponding statistics is generated for each trial. In the case of subgroups, each subgroup is presented separately and each trial that contributes to a subgroup is presented under the heading of that subgroup. An example of such a forest plot is given in Figure 1. The results of the bladder meta-analysis with patients divided into different age subgroups are presented here. Within each of the three age groups, the four contributing trials are listed. On each line, the number of events as a fraction of the total number of patients randomized are shown for both the treatment and control groups, along with estimates of (O ⫺ E) and V. The estimate of the hazard ratio corresponds to the middle of the square and the horizontal line gives the 99% confidence interval. An arrow is used to show that the confidence interval extends beyond the area used for showing the confidence interval.

540

Duchateau and Sylvester

Figure 1 Forest plot of the disease-free interval for the combined analysis of four superficial bladder cancer trials comparing adjuvant treatment to control. Patients are divided into subgroups according to age.

Meta-Analysis

541

If the confidence interval does not contain the value 1, it means that there is a significant difference between treatment and control at the 1% level. The size of the square is proportional to the number of events in the subgroup within that trial and thus represents the amount of information available. The solid vertical line represents no treatment effect and corresponds to a hazard ratio of 1. If the estimate of the HR is to the left of this line, the difference favors the treatment arm. Note that a log scale is used for the hazard ratio so that a hazard ratio of 2 or 0.5 is the same distance from 1. A hazard ratio of 2 means that the hazard of treatment is double that of No treatment, whereas a hazard ratio of 0.5 means that the hazard of treatment is half the hazard of No treatment. Thus, they have the same meaning but the direction of the treatment difference is reversed. For each subgroup, the sum of the statistics is given along with the summary hazard ratio, represented by the middle of the diamond. The extremes of the diamond are the lower and upper limits of the 95% confidence interval. A test of heterogeneity between the trials within a subgroup is given below the summary statistics of that subgroup. All the subgroups have a test for heterogeneity that is significant at the 10% level. The percent reduction of the hazard along with its standard deviation is also given for each subgroup. The largest reduction, 30%, is found for patients between 60 and 70 years. The lower part of the forest plot presents the overall tests. The bottom diamond shows the overall hazard ratio with its 95% confidence interval. The dashed line is extended vertically upward to show the heterogeneity of this ‘‘typical’’ hazard ratio with the estimated hazard ratios for the subgroups and the trials within a subgroup. Below the overall hazard ratio, the test for the treatment effect is given that is significant (p ⫽ 0.0013). The lower left hand corner presents tests for interaction and trend, neither of which is significant.

V.

ADVANCED TECHNIQUES IN META-ANALYSIS

More advanced techniques have been developed with regards to the methods of analysis, especially with respect to investigating and explaining heterogeneity. Alternatives to the fixed effects model presented in this chapter have been proposed, the most important one being the random effects model (15). As compared with the assumption of a fixed treatment effect in a fixed effects model, in a random effects model the treatment effects from the different studies are assumed to come from a normal distribution with a variance describing the heterogeneity between the studies. The advantages of the random effects model are that it takes into consideration the trial to trial variability in deriving the variance of the treatment effect estimate. The merit of the fixed effects model is that it is relatively simple and easy to perform, leading in most practical situations to similar treatment effect estimates as in the random effects model.

542

Duchateau and Sylvester

Bayesian methods have also been applied to meta-analyses (16). Two important graphical tools for checking assumptions are the Galbraith plot (17) and the funnel plot (18). They both present the estimated odds ratios from different studies. The Galbraith plot is used to study heterogeneity and detect outlying studies while the funnel plot is used to assess publication bias. Two additional meta-analytic techniques have been developed. Cumulative meta-analyses analyze and plot studies cumulatively in time and provide a tool to assess whether additional experimental evidence is needed to draw conclusions (19). Another proposed technique is prospective meta-analyses. Studies are prospectively designed to incorporate them afterwards in a meta-analysis. It should, however, be used with great care as this approach may lead to many small trials being carried out with little power to detect a difference.

VI. DISCUSSION While meta-analyses play a very important role in the decision-making process, they can be criticized just as with most scientific methods. Criticisms generally concern the selection of studies, the choice of end point, the interpretation of heterogeneity, and the generalization and application of the results. To a large extent many of these criticisms can be overcome by posing a well-formulated question, excluding improperly randomized trials, collecting the individual patient data, and using a hard end point such as survival. However, it must be recognized that they are not a panacea or cure-all. They most certainly should not be a replacement for large-scale randomized trials and should not be used as an excuse for conducting small underpowered trials. In the absence of large-scale trials, meta-analyses, when properly carried out, provide the best overall evidence of treatment effect. They have now replaced the traditional literature reviews and have been instrumental in stimulating international cooperation and research. They have gained considerable credibility, largely due to the pioneering efforts of the Early Breast Cancer Trialists Collaborative Group (13). Their work has shown that the adjuvant therapy of early breast cancer improves survival and that for every 1 million such women treated, an extra 100,000 10-year survivors could result (20).

REFERENCES 1. Gelber RD, Goldhirsch A. Meta-analysis: the fashion of summing-up evidence. Part I. Rationale and conduct. Ann Oncol 1991; 2:461–468.

Meta-Analysis

543

2. Olkin G. Meta-analyses: reconciling the results of independent studies. Stat Med 1995; 24:457–472. 3. Clarke M, Stewart L, Pignon JP, Bijnens L. Individual patient data meta-analyses in cancer. Br J Cancer 1998; 77:2036–2044. 4. Stewart LA, Parmar MKB. Meta-analyses of the literature or of individual patient data: is there a difference? Lancet 1993; 341:418–422. 5. Stewart LA, Clarke MJ. Practical methodology of meta-analyses (overviews) using updated individual patient data. Stat Med 1995; 14:2057–2079. 6. Jeng GT, Scott JR, Burmeister LF. A comparison of meta-analytic results using literature vs individual patient data. JAMA 1995; 274:830–836. 7. Fisher RA. Statistical methods for research workers. 4th ed. London: Oliver and Boyd, 1932. 8. Simpson CH. The interpretation of interaction in contingency tables. J R Stat Soc 1994; B13:233–241. 9. Whithead A, Whitehead J. A general parametric approach to the meta-analysis of randomized clinical trials. Stat Med 1991; 10:1665–1677. 10. Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta-blockade during and after myocardial infarction: an overview of the randomized trials. Progr Cardiovasc Dis 1985; 27:335–371. 11. Pawinski A, Sylvester R, Kurth KH, Bouffioux C, van der Meijden A, Parmar MKB, Bijnens L. A combined analysis of European Organization for Research and Treatment of Cancer and Medical Research Council randomized clinical trials for the prophylactic treatment of stage TaT1 bladder cancer. J Urol 1996; 156:1934–1941. 12. Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated. BMJ 1994; 309:1351–1355. 13. Early Breast Cancer Trialists’ Collaborative Group. Treatment of Early Breast Cancer. Vol. 1. Worldwide Evidence 1985–1990. Oxford: Oxford University Press, 1990. 14. Demets DL. Methods for combining randomized clinical trials: strengths and limitations. Stat Med 1987; 6:341–348. 15. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clin Trials 1986; 7:177–188. 16. Dumouchel W. Bayesian meta analysis. In: Berry DA (ed.). Statistical Methodology in the Pharmaceutical Sciences. New York: Dekker, 1990. 17. Galbraith RF. A note on graphical presentation of estimated odds ratios from several clinical trials. Stat Med 1988; 7:889–894. 18. Egger M, Davey Smith G. Misleading meta-analysis. BMJ 1995; 310:752–754. 19. Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, Mosteller F, Chalmers TC. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med 1992; 327:248–254. 20. Early Breast Cancer Trialists’ Collaborative Group. Systemic treatment of early breast cancer by hormonal, cytotoxic, or immune therapy. Lancet 1992; 339:1–15, 71–85.

Index

Adverse event (AE), 4–5 Adverse drug reaction (ADR), 5, 23– 24 Alternative hypothesis, 477–478 Area under the curve (AUC), 25–26

Cox regression (See Proportional hazards model) Cumulative incidence, 513–522 Cut point model, 328–332, 421–423 Dose intensity, 493–495, 503–511

Blinding, 251 Bonferoni procedure, 151–152, 162, 234, 331 Cause-specific failure, 497–499 (See also Competing risks) Censored data, right censoring, 414, 458 Change scores, 274 Classification and regression trees (CART) (See Recursive partitioning) Common toxicity criteria, 4–5, 18, 23, 27 Competing losses (See Competing risks) Competing risks, 139, 497–499, 513– 523 Compliance, 495–496 Continual reassessment method (CRM), 12, 15–18, 20–21, 28, 35–72, 76–78, 81, 83–90 stopping rule, 56–57

Early stopping (See Interim analysis) Endpoints (See Outcome measures) Equivalence trials, 173–187 active controls, 175, 178 bioequivalence, 173 confidence intervals, 176 interim analysis, 177 nonsigificance, 174 one-sided vs two-sided, 176 sample size, 177 Explained variation, 397–409 (See also R-squared) Exponential model, 131, 140–141, 162– 163 piecewise exponential, 133–135 Factorial trials, 138, 155–157, 161–171 False positives, 482–484 False negatives, 482–484 Forest plots, 539–541 545

546 Generalizability, 485–486 Generalized estimating equations (GEE), 281–282 Generalized linear models, 274 Global test, 152 (See also Overall test) Good clinical practice (GCP), 26–27 Goodness of fit, 414 Grade (of toxicity) (See Toxicity grades) Graphical methods, 411–456 in meta-analyses, 539–541 Health related quality of life (See Quality of life) Hyperbolic tangent model, 8, 42, 77 Informed consent, 27 Interactions, 138, 156–157, 162–171 in meta-analyses, 537–539 Interim analyses, 189–209 conditional power, 194–197 early stopping rules, 189–209 group sequential monitoring, 130, 136–137, 139, 190–209, 221 information fraction, 137 sequential monitoring, 129–130, 135–137, 139, 193, 211–228 stochastic curtailment, 194–195 stopping for no benefit, 194, 197– 204, 206 International Committee on Harmonization (ICH), 26–27 Kaplan-Meier method, 131, 236, 274, 308, 426–430, 514–522 Landmark method, 493, 505 Logistic regression, 79, 435–438, 442– 443 Logit model, 6, 8, 79 Martingale residuals (See Residuals) Meta-analysis, 314, 506–508, 525–543 Missing at Random (MAR), 274–285, 306

Index Missing completely at random (MCAR), 274–285, 306 Missing data, 233, 261–262, 274–285, 292, 303, 306 Missing not at random (MNAR), 275– 285 Multiplicity, 149 Multiple comparisons, 138, 149–155, 161–162, 330 Multiple imputation, 283–284, 306 Neural networks, 349–359 Null hypothesis, 477–478 Ordered alternatives, 169 Ordered treatments, 124 Outcome measures, 229, 478–482 cost, 229, 242–248, 291–320 cost-effectiveness, 242–243, 292, 296–299 cost-minimization, 292, 296– 297 cost-utility, 242–243, 292, 296– 298 Q-TwiST, 235–242, 245, 263, 285, 298 quality adjusted life year (QALY), 292, 294, 296, 298 quality adjusted survival, 235–242, 285 quality of life, 229–242, 249–267, 269–289, 294–295 CARES-SF, 262 FACT, 272 SF-36, 272 Sickness Impact Profile, 230 Pattern analysis, 261 Pharmacokinetics, 25–26, 60–61, 81 Phase I trials, 1–91 dose levels, 2–4, 8, 12–19, 21, 27– 28, 40–41, 74, 77 dose limiting toxicity (DLT), 2–7, 13–15, 18, 22–23, 27, 30, 74– 76, 83–90, 79–80

Index [Phase I trials] dose escalation scheme, 2–3, 9–24 Bayesian, 18–19, 67–69 best of five, 82–90 continual reassessment method (CRM), 12, 15–18, 20–21, 28, 35–72, 76–78, 81, 83–90 stopping rule, 56–57 modified Fibonacci, 9–11, 75, 82 three ⫹ three, 11–14, 19, 75–76, 82–90 toxicity-response, 22–23 up and down, 12, 14–15, 20–21, 28 dose-response, 3, 6, 74, 77, 82 dose-toxicity, 3–4, 7, 16, 21, 23–24, 27, 37–38, 41–42 maximum tolerated dose (MTD), 2– 6, 10–12, 14–21, 23–27, 29–30, 57, 73–76, 78–90 starting dose, 3, 11, 21 target dose, 40–41, 43 target toxicity, 3, 38, 41, 74–76, 79, 83–90 two-stage designs, 48–50, 78–80, 83–90 Phase II trials, 93–127 attained sample size, 94–95 confidence intervals, 97–99 expected sample size, 94–95 Gehan design, 93 multi-arm, 99–100, 119–127 multiple endpoints, 100–102, 105– 118, 122–123 two-stage designs, 93–103 Phase III trials, 129–228 equivalence, 173–87 factorial, 138, 155–157, 161– 171 multi-arm, 138, 149–171 sample size, 129–147 two-arm, 129–147 Probit model, 6, 8 Prognostic factors, 321–472

547 Proportional hazards model, 129–130, 132–133, 138–139, 162, 180, 190, 216, 274, 308, 310, 333– 341, 359–367, 380–390, 402, 411–432, 438–454, 457, 532– 536 stratified model, 138, 451–453 Publication bias, 476, 526 Recursive partitioning, 341–349, 359– 364, 384–385, 391–392, 457– 472 splitting rules, 459–462 pruning, 462–465 staging, 465–467 Regression trees (See Recursive partitioning) Repeated measures, 233–234, 261–262, 273 Residuals, 415–426 R-squared, 363–364, 400, 404–408 Sample size binomial trials, 131 cost studies, 302–303 equivalence trials, 177 multi-arm trials, 153–154 one-sided vs two-sided tests, 131 phase I trials, 19, 27 phase II trials, 94–95 phase III trials, 129–147, 153–154 prognostic factor studies, 364 quality of life studies, 257–260 selection designs, 121–122 two-stage designs, 48–50, 78–80, 83–90 Selection designs, 99–100, 119–127, 157 binary outcomes, 120–121 sample size, 121–122 survival outcomes, 121–122 ordered treatments, 124 Selection bias, 526 Sequential probability ratio test, 219– 220, 224–224

548 Stage migration, 476–477 Stochastic approximation, 62–64 Surrogate endpoints, 480–482 Survival by response, 492–493 Time-dependent covariates, 416–417, 505 Tolerance distribution, 6 Toxicity grades, 5, 24, 49

Index Transformation of covariates, 441–447 Tree-based methods (See Recursive partitioning) Triangular test, 211–228 Utilities, 232–233 Weibull model, 6