Tutorials in Biostatistics
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino, Boston University, USA
Copyright ? 2004
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone
(+44) 1243 779777
Email (for orders and customer service enquiries):
[email protected] Visit our Home Page on www.wileyeurope.com or www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to
[email protected], or faxed to (+44) 1243 770571. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-470-02370-8 Typeset by Macmillan India Ltd Printed and bound in Great Britain by Page Bros, Norwich This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production.
Contents Preface
vii
Preface to Volume 2
ix
Part I MODELLING A SINGLE DATA SET 1.1 Clustered Data Extending the Simple Linear Regression Model to Account for Correlated Responses: An Introduction to Generalized Estimating Equations and Multi-Level Mixed Modelling. Paul Burton, Lyle Gurrin and Peter Sly
3
1.2 Hierarchical Modelling An Introduction to Hierarchical Linear Modelling. Lisa M. Sullivan, Kimberly A. Dukes and Elena Losina
35
Multilevel Modelling of Medical Data. Harvey Goldstein, William Browne and Jon Rasbash
69
Hierarchical Linear Models for the Development of Growth Curves: An Example with Body Mass Index in Overweight /Obese Adults. Moonseong Heo, Myles S. Faith, John W. Mott, Bernard S. Gorman, David T. Redden and David B. Allison
95
1.3 Mixed Models Using the General Linear Mixed Model to Analyse Unbalanced Repeated Measures and Longitudinal Data. Avital Cnaan, Nan M. Laird and Peter Slasor
127
Modelling Covariance Structure in the Analysis of Repeated Measures Data. Ramon C. Littell, Jane Pendergast and Ranjini Natarajan
159
Covariance Models for Nested Repeated Measures Data: Analysis of Ovarian Steroid Secretion Data. Taesung Park and Young Jack Lee
187
1.4 Likelihood Modelling Likelihood Methods for Measuring Statistical Evidence. Jerey D. Blume
209
Part II MODELLING MULTIPLE DATA SETS: META-ANALYSIS Meta-Analysis: Formulating, Evaluating, Combining, and Reporting. Sharon-Lise T. Normand v
249
vi
CONTENTS
Advanced Methods in Meta-Analysis: Multivariate Approach and Meta-Regression. Hans C. van Houwelingen, Lidia R. Arends and Theo Stijnen
289
Part III MODELLING GENETIC DATA: STATISTICAL GENETICS Genetic Epidemiology: A Review of the Statistical Basis. E. A. Thompson
327
Genetic Mapping of Complex Traits. Jane M. Olson, John S. Witte and Robert C. Elston
339
A Statistical Perspective on Gene Expression Data Analysis. Jaya M. Satagopan and Katherine S. Panageas
361
Part IV DATA REDUCTION OF COMPLEX DATA SETS Statistical Approaches to Human Brain Mapping by Functional Magnetic Resonance Imaging. Nicholas Lange
383
Disease Map Reconstruction. Andrew B. Lawson
423
PART V SIMPLIFIED PRESENTATION OF MULTIVARIATE DATA Presentation of Multivariate Data for Clinical Use: The Framingham Study Risk Score Functions. Lisa M. Sullivan, Joseph M. Massaro and Ralph B. D’Agostino, Sr.
447
Index
477
Preface The development and use of statistical methods has grown exponentially over the last two decades. Nowhere is this more evident than in their application to biostatistics and, in particular, to clinical medical research. To keep abreast with the rapid pace of development, the journal Statistics in Medicine alone is published 24 times a year. Here and in other journals, books and professional meetings, new theory and methods are constantly presented. However, the transitions of the new methods to actual use are not always as rapid. There are problems and obstacles. In such an applied interdisciplinary eld as biostatistics, in which even the simplest study often involves teams of researchers with varying backgrounds and which can generate massive complicated data sets, new methods, no matter how powerful and robust, are of limited use unless they are clearly understood by practitioners, both clinical and biostatistical, and are available with well-documented computer software. In response to these needs Statistics in Medicine initiated in 1996 the inclusion of tutorials in biostatistics. The main objective of these tutorials is to generate, in a timely manner, brief well-written articles on biostatistical methods; these should be complete enough so that the methods presented are accessible to a broad audience, with sucient information given to enable readers to understand when the methods are appropriate, to evaluate applications and, most importantly, to use the methods in their own research. At rst tutorials were solicited from major methodologists. Later, both solicited and unsolicited articles were, and are still, developed and published. In all cases major researchers, methodologists and practitioners wrote and continue to write the tutorials. Authors are guided by four goals. The rst is to develop an introduction suitable for a well-dened audience (the broader the audience the better). The second is to supply sucient references to the literature so that the readers can go beyond the tutorial to nd out more about the methods. The referenced literature is, however, not expected to constitute a major literature review. The third goal is to supply sucient computer examples, including code and output, so that the reader can see what is needed to implement the methods. The nal goal is to make sure the reader can judge applications of the methods and apply the methods. The tutorials have become extremely popular and heavily referenced, attesting to their usefulness. To further enhance their availability and usefulness, we have gathered a number of these tutorials and present them in this two-volume set. Each volume has a brief preface introducing the reader to the aims and contents of the tutorials. Here we present an even briefer summary. We have arranged the tutorials by subject matter, starting in Volume 1 with 18 tutorials on statistical methods applicable to clinical studies, both observational studies and controlled clinical trials. Two tutorials discussing the computation of epidemiological rates such as prevalence, incidence and lifetime rates for cohort studies and capture–recapture settings begin the volume. Propensity score adjustment methods and agreement statistics such as the kappa statistic are dealt with in the next two tutorials. A series of tutorials on survival analysis methods applicable to observational study data are next. We then present ve tutorials on the development of prognostics or clinical prediction models. Finally, there are six tutorials on clinical trials. These range from designing
vii
viii
PREFACE
and analysing dose response studies and Bayesian data monitoring to analysis of longitudinal data and generating simple summary statistics from longitudinal data. All these are in the context of clinical trials. In all tutorials, the readers is given guidance on the proper use of methods. The subject-matter headings of Volume 1 are, we believe, appropriate to the methods. The tutorials are, however, often broader. For example, the tutorials on the kappa statistics and survival analysis are useful not only for observational studies, but also for controlled clinical studies. The reader will, we believe, quickly see the breadth of the methods. Volume 2 contains 16 tutorials devoted to the analysis of complex medical data. First, we present tutorials relevant to single data sets. Seven tutorials give extensive introductions to and discussions of generalized estimating equations, hierarchical modelling and mixed modelling. A tutorial on likelihood methods closes the discussion of single data sets. Next, two extensive tutorials cover the concepts of meta-analysis, ranging from the simplest conception of a xed eects model to random eects models, Bayesian modelling and highly involved models involving multivariate regression and meta-regression. Genetic data methods are covered in the next three tutorials. Statisticians must become familiar with the issues and methods relevant to genetics. These tutorials oer a good starting point. The next two tutorials deal with the major task of data reduction for functional magnetic resonance imaging data and disease mapping data, covering the complex data methods required by multivariate data. Complex and thorough statistical analyses are of no use if researchers cannot present results in a meaningful and usable form to audiences beyond those who understand statistical methods and complexities. Reader should nd the methods for presenting such results discussed in the nal tutorial simple to understand. Before closing this preface to the two volumes we must state a disclaimer. Not all the tutorials that are in these two volumes appeared as tutorials. Three were regular articles. These are in the spirit of tutorials and t well within the theme of the volumes. We hope that readers enjoy the tutorials and nd them benecial and useful. RALPH B. D’AGOSTINO, SR. EDITOR Boston University Harvard Clinical Research Institute
Preface to Volume 2 The 16 tutorials in this volume address the statistical modelling of complex medical data. Here we have topics covering data involving correlations among subjects in addition to hierarchical or covariance structures within a single data set, multiple data sets requiring meta-analyses, complex genetic data, and massive data sets resulting from functional magnetic resonance imaging and disease mapping. Here we briey mention the general themes of the ve parts of the volume and the articles within them. Part I is concerned with modelling a single data set. Section 1.1, on clustered data, presents a tutorial by Burton, Gurrin, and Sly which is an introduction to methods dealing with correlations among subjects in a single data set. The generalized estimating equations method and the multilevel mixed model methods are nicely introduced as extensions of linear regression. This is a wonderful introduction to these important topics. Section 1.2 deals with hierarchical modelling and contains three tutorials. The rst, by Sullivan, Dukes and Losina, is an excellent introduction to the basic concepts of hierarchical models, and gives many examples clarifying the models and their computational aspects. Next is an article by Goldstein, Browne and Rasbash which presents more sophisticated hierarchical modelling methods. These two tutorials are among the most popular in the series of tutorials. The last tutorial in this section, by Heo, Faith, Mott, Gorman, Redden and Allison, is a carefully developed example of hierarchical modelling methods applied to the development of growth curves. Section 1.3 is on the major area of mixed models. Three major tutorials are included here. The rst is by Cnaan, Laird and Slasor, the second by Littell, Pendergast and Natarajan, and the third by Park and Lee. These tutorials address the complexity of modelling mixed model data and covariance structures where there are longitudinal data measured, possibly, at unequal intervals and with missing data. All contain extensive examples with ample computer and visual analyses. Further, they carefully illustrate the use of major computer software such as SAS (Proc Mixed) and BMDP. These tutorials are major tools for learning about these methods and understanding how to use them. Section 1.4 contains a single article by Blume on the use of the likelihood ratio for measuring the strength of statistical evidence. This claries the basic concepts of modelling data and illustrating the importance and central role of the likelihood model. Part II, ‘Modelling Multiple Data Sets: Meta-Analysis’, contains two tightly written articles. These tutorials cover the concepts of meta-analysis ranging from the simplest conception of a xed eects model to random eects models, Bayesian modelling and highly involved models involving multivariate analysis and meta-regression. The rst article, by Normand, deals with formulating the meta-analysis problem, evaluating the data available for a metaanalysis, combing data for the meta-analysis and reporting the meta-analysis. She presents xed eects and random eects models as well as three modes of inference: maximum likelihood, restricted maximum likelihood and Bayesian. The second article is by van Houwelingen, Arends and Stijnen and carefully describes more sophisticated methods such as multivariate methods and meta-regression. These two articles constitute a major review of meta-analysis.
ix
x
PREFACE TO VOLUME 2
Genetic concepts and analyses permeate almost all clinical problems today, and statisticians must become familiar with the issues and methods relevant to genetics. The three tutorials in Part III, ‘Modelling Genetic Data: Statistical Genetics’, should oer a good start. The rst, by Thompson, reviews the statistical basis for genetic epidemiology. This article did not appear as a tutorial and pre-dates the tutorials by 10 years. It is, however, an excellent introduction to the topic and ts well with the mission of the tutorials. The next two tutorials cover genetic mapping of complex traits (by Olson, Witte and Elston) and a statistical perspective on gene expression data analysis (by Satagopan and Panageas). The former discusses methods for nding genes that contribute to complex human traits. The latter involves the analysis of microarray data. These methods deal with the simultaneous analysis of several thousand genes. Both contain careful development of concepts such as linkage analysis, transmission/disequilibrium tests and gene expression. Both set out the complexities of dealing with large genetics data sets generated from potentially a small number of subjects. The second tutorial further provides S-Plus and SAS computer examples and codes. Part IV is on data reduction of complex data sets, and consists of two important tutorials. The rst, by Lange, deals with the data reduction and analysis methods related to functional magnetic resonance imaging. This is among the longest tutorials and carefully reviews the vocabulary and methods. It supplies an excellent introduction to this important topic that will occupy more and more of the statistical community’s time. The second tutorial, by Lawson, is on disease map construction. It focuses on identifying the major issues in disease variations and the spatial analysis techniques aimed at good presentation of disease mappings with minimal noise. Part V, on simplied presentation of multivariate data, contains one tutorial by Sullivan, Massaro and D’Agostino. The premise of the tutorial is that complex and thorough statistical analyses are of no use if researchers cannot present results in a meaningful and usable form to a broad audience, an audience beyond those who understand statistical methods. The methods for presenting such results discussed in the tutorial are simple to understand. They were developed to help the Framingham Heart Study present its ndings to the medical community. The methods reduce multivariate models to simple scoring algorithms. Again, we hope that readers will nd these tutorials enjoyable and useful.
Part I MODELLING A SINGLE DATA SET
1.1 Clustered Data
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 3
4
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
5
6
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
7
8
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
9
10
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
11
12
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
13
14
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
15
16
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
17
18
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
19
20
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
21
22
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
23
24
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
25
26
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
27
28
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
29
30
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
31
32
P. BURTON, L. GURRIN AND P. SLY
EXTENDING THE SIMPLE LINEAR REGRESSION MODEL
33
1.2 Hierarchical Modelling
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 35
36
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
37
38
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
39
40
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
41
42
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
43
44
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
45
46
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
47
48
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
49
50
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
51
52
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
53
54
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
55
56
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
57
58
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
59
60
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
61
62
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
63
64
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
65
66
L. SULLIVAN, K. DUKES AND E. LOSINA
AN INTRODUCTION TO HIERARCHICAL LINEAR MODELLING
67
68
L. SULLIVAN, K. DUKES AND E. LOSINA
TUTORIAL IN BIOSTATISTICS Multilevel modelling of medical data Harvey Goldstein∗; † , William Browne and Jon Rasbash Institute of Education; University of London; London; U.K.
SUMMARY This tutorial presents an overview of multilevel or hierarchical data modelling and its applications in medicine. A description of the basic model for nested data is given and it is shown how this can be extended to t exible models for repeated measures data and more complex structures involving cross-classications and multiple membership patterns within the software package MLwiN. A variety of response types are covered and both frequentist and Bayesian estimation methods are described. Copyright ? 2002 John Wiley & Sons, Ltd. KEY WORDS:
complex data structures; mixed model, multilevel model; random eects model; repeated measures
1. SCOPE OF TUTORIAL The tutorial covers the following topics 1. The nature of multilevel models with examples. 2. Formal model specication for the basic Normal (nested structure) linear multilevel model with an example. 3. The MLwiN software. 4. More complex data structures: complex variance, multivariate models and cross-classied and multiple membership models. 5. Discrete response models, including Poisson, binomial and multinomial error distributions. 6. Specic application areas including survival models, repeated measures models, spatial models and meta analysis. 7. Estimation methods, including maximum and quasi likelihood, and MCMC. Further information about multilevel modelling and software details can be obtained from the web site of the Multilevel Models Project, http://multilevel.ioe.ac.uk/. ∗
Correspondence to: Harvey Goldstein, Institute of Education; University of London; 20 Bedford Way, London WC1H OAL, U.K. † E-mail:
[email protected]
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 69
70
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
2. THE NATURE OF MULTILEVEL MODELS Traditional statistical models were developed making certain assumptions about the nature of the dependency structure among the observed responses. Thus, in the simple regression model yi =0 + 1 xi + ei the standard assumption is that the yi given xi are independently identically distributed (i.i.d.), and the same assumption holds also for generalized linear models. In many real life situations, however, we have data structures, whether observed or by design, for which this assumption does not hold. Suppose, for example, that the response variable is the birthweight of a baby and the predictor is, say, maternal age, and data are collected from a large number of maternity units located in dierent physical and social environments. We would expect that the maternity units would have dierent mean birthweights, so that knowledge of the maternity unit already conveys some information about the baby. A more suitable model for these data is now yij =0 + 1 xij + uj + eij
(1)
where we have added another subscript to identify the maternity unit and included a unitspecic eect uj to account for mean dierences amongst units. If we assume that the maternity units are randomly sampled from a population of units, then the unit specic eect is a random variable and (1) becomes a simple example of a two-level model. Its complete specication, assuming Normality, can be written as follows: yij = 0 + 1 xij + uj + eij uj ∼ N(0; u2 ); eij ∼ N(0; e2 ) cov(uj ; eij ) = 0
(2)
cov(yi1 j ; yi2 j | xij ) = u2 ¿ 0 where i1 ; i2 are two births in the same unit j with, in general, a positive covariance between the responses. This lack of independence, arising from two sources of variation at dierent levels of the data hierarchy (births and maternity units) contradicts the traditional linear model assumption and leads us to consider a new class of models. Model (2) can be elaborated in a number of directions, including the addition of further covariates or levels of nesting. An important direction is where the coecient (and any further coecients) is allowed to have a random distribution. Thus, for example the age relationship may vary across clinics and, with a slight generalization of notation, we may now write (2) as yij =0ij x0ij + 1j x1ij 0ij =0 + u0j + e0ij 1j =1 + u1j x0ij =1 2 ; var(u0j )=u0
cov(u0j u1j )=u01 ;
2 var(u1j )=u1
var(e0ij )=e02
(3)
MULTILEVEL MODELLING OF MEDICAL DATA
71
and in later sections we shall introduce further elaborations. The regression coecients 0 , 1 are usually referred to as ‘xed parameters’ of the model and the set of variances and covariances as the random parameters. Model (3) is often referred to as a ‘random coecient’ or ‘mixed’ model. At this point we note that we can introduce prior distributions for the parameters of (3), so allowing Bayesian models. We leave this topic, however, for a later section where we discuss MCMC estimation. Another, instructive, example of a two-level data structure for which a multilevel model provides a powerful tool, is that of repeated measures data. If we measure the weight of a sample of babies after birth at successive times then the repeated occasion of measurement becomes the lowest level unit of a two-level hierarchy where the individual baby is the level-2 unit. In this case model (3) would provide a simple description with x1ij being time or age. In practice linear growth will be an inadequate description and we would wish to t at least a (spline) polynomial function, or perhaps a non-linear function where several coecients varied randomly across individual babies, that is each baby has its own growth pattern. We shall return to this example in more detail later, but for now note that an important feature of such a characterization is that it makes no particular requirements for every baby to be measured at the same time points or for the time points to be equally spaced. The development of techniques for specifying and tting multilevel models since the mid1980s has produced a very large class of useful models. These include models with discrete responses, multivariate models, survival models, time series models etc. In this tutorial we cannot cover the full range but will give references to existing and ongoing work that readers may nd helpful. In addition the introductory book by Snijders and Bosker [1] and the edited collection of health applications by Leyland and Goldstein [2] may be found useful by readers. A detailed introduction to the two-level model with worked examples and discussion of hypothesis tests and basic estimation techniques is given in an earlier tutorial [3] that also gives details of two computer packages, HLM and SAS, that can perform some of the analyses we describe in the present tutorial. The MLwiN software has been specically developed for tting very large and complex models, using both frequentist and Bayesian estimation and it is this particular set of features that we shall concentrate on.
3. MARGINAL VERSUS HIERARCHICAL MODELS At this stage it is worth emphasizing the distinction between multilevel models and so-called ‘marginal’ models such as the GEE model [4, 5]. When dealing with hierarchical data these latter models typically start with a formulation for the covariance structure, for example, but not necessarily based upon a multilevel structure such as (3), and aim to provide estimates with acceptable properties only for the xed parameters in the model, treating the existence of any random parameters as a necessary ‘nuisance’ and without providing explicit estimates for them. More specically, the estimation procedures used in marginal models are known to have useful asymptotic properties in the case where the exact form of the random structure is unknown. If interest lies only in the xed parameters, marginal models may be useful. Even here, however, they may be inecient if they utilize a covariance structure that is substantially incorrect. They are, however, generally more robust than multilevel models to serious mis-
72
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
specication of the covariance structure [6]. Fundamentally, however, marginal models address dierent research questions. From a multilevel perspective, the failure explicitly to model the covariance structure of complex data is to ignore information about variability that, potentially, is as important as knowledge of the average or xed eects. Thus, in the simple repeated measures example of baby weights, knowledge of how individual growth rates vary between babies, possibly dierentially according to say demographic factors, will be important data and in a later section we will show how such information can be used to provide ecient predictions in the case of human growth. When we discuss discrete response multilevel models we will show how to obtain information equivalent to that obtained from marginal models. Apart from that the remainder of this paper will be concerned with multilevel models. For a further discussion of the limitations of marginal models see the paper by Lindsey and Lambert [7].
4. ESTIMATION FOR THE MULTIVARIATE NORMAL MODEL We write the general Normal two-level model as follows, with natural extensions to three or more levels: Y=XR + E Y= {yij }; E=E E
(2)
(2)
X = {Xij };
+E
= {Ej(2) };
Ej(2) = zj(2) ej(2) ;
(2) (2) zij(2) = {z0j ; z1j ; : : : ; zq(2) }; 2j
E(1) = {Eij(1) };
zj(2) = {zij(2) }
(2) (2) ej(2) = {e0j ; e1j ; : : : ; eq(2) }T 2j
Eij(1) =zij(1) eij(1)
(1) (1) zij(1) = {z0j ; z1j ; : : : ; zq(1) }; 1j
e(2) = {ej(2) };
Xij = {x0ij ; x1ij ; : : : ; xpij }
(1)
(1) (1) eij(1) = {e0ij ; e1ij ; : : : ; eq(1) }T 1 ij
ej(1) = {eij(1) }
e(2) ∼ N(0; 2 );
ej(1) ∼ N(0; 1j )
[Typically 1j =1 ] (2) (2) (1) (1) (2) (1) eh j )j=j = E(ehij eh i j )i=i =E(ehj eh i j )=0 E(ehj
yields the block diagonal structure V = E(Y˜ Y˜ T )= (V2j + V1j ) j
Y˜ = Y − XR
(4) T
V2j = zj(2) 2 zj(2) ;
V1j =
i
T
zij(1) 1j zij(1)
MULTILEVEL MODELLING OF MEDICAL DATA
73
In this formulation we allow any number of random eects or coecients at each level; we shall discuss the interpretation of multiple level-1 random coecients in a later section. A number of ecient algorithms are available for obtaining maximum likelihood (ML) estimates for (4). One [8] is an iterative generalized least squares procedure (IGLS) that will also produce restricted maximum likelihood estimates (RIGLS or REML) and is formally equivalent to a Fisher scoring algorithm [9]. Note that RIGLS or REML should be used in small samples to correct for the underestimation of IGLS variance estimates. The EM algorithm can also be used [10, 11]. Our examples use RIGLS (REML) estimates as implemented in the MlwiN software package [12] and we will also discuss Bayesian models. A simple description of the IGLS algorithm is as follows: From (4) we have V= E(Y˜ Y˜ T)=
(V2j + V1j )
j
˜ Y − X Y= The IGLS algorithm proceeds by rst carrying out a GLS estimation for the xed parameters () using a working estimator of V. The vectorized cross-product matrix of ‘raw’ residuals ˆ is then used as the response in a GLS estimation where the explanatory Y˜ˆ Y˜ˆ T where Y˜ˆ =Y −X, variable design matrix is determined by the last line of (4). This provides updated estimates for the 1j and 2 and hence V . The procedure is repeated until convergence. In the simple case we have been considering so far where the level 1 residuals are i.i.d., for a level-2 unit (individual) with just three level-1 units (occasions) there are just six distinct raw residual terms and the level-1 component V1j is simply e2 I3 . Written as a vector of the lower triangle this becomes 1 0 2 1 (5) e 0 0 1 and the vector of ones and zeroes becomes the level-1 explanatory variable for the GLS estimation, in this case providing the coecient that is the estimator of e2 . Similarly, for a model where there is a single variance term at level 2, the level-2 component V2j written as a lower triangle vector is 1 1 2 1 u 1 1 1
74
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
Goldstein [13] shows that this procedure produces maximum likelihood estimates under Normality. 5. THE MLwiN SOFTWARE MLwiN has been under development since the late 1980s, rst as a command-driven DOS based program, MLn, and since 1998 in a fully-edged windows version, currently in release 1.10. It is produced by the Multilevel Models Project based within the Institute of Education, University of London, and supported largely by project funds from the U.K. Economic and Social Research Council. The software has been developed alongside advances in methodology and with the preparation of manuals and other training materials. Procedures for tting multilevel models are now available in several major software packages such as STATA, SAS and S-plus. In addition there are some special purpose packages, which are tailored to particular kinds of data or models. MIXOR provides ML estimation for multi-category responses and HLM is used widely for educational data. See Zhou et al. [14] for a recent review and Sullivan et al. [3] for a description of the use of HLM and SAS. Many of the models discussed here can also be tted readily in the general purpose MCMC software package WinBUGS [15]. MLwiN has some particular advanced features that are not available in other packages and it also has a user interface designed for fully interactive use. In later sections we will illustrate some of the special features and models available in MLwiN but rst give a simple illustration of the user interface. We shall assume that the user wishes to t the simple two-level model given by (1). In this tutorial we cannot describe all the features of MLwiN, but it does have general facilities for data editing, graphing, tabulation and simple statistical summaries, all of which can be accessed through drop-down menus. In addition it has a macro language, which can be used, for example, to run simulations or to carry out special purpose modelling. One of the main features is the method MLwiN uses to set up a model, via an ‘equation window’ in which the user species a model in more or less exactly the format it is usually written. Thus to specify model (1) the user would rst open the equation window which, prior to any model being specied, would be as shown in Figure 1. This is the default null model with a response that is Normal with xed predictor represented by X and covariance matrix represented by . Clicking on the N symbol delivers a drop
Figure 1. Default equation screen with model unspecied.
MULTILEVEL MODELLING OF MEDICAL DATA
75
Figure 2. Equation screen with model display.
down menu, which allows the user to change the default distribution to binomial, Poisson or negative binomial. Clicking on the response y allows the user to identify the response variable from a list and also the number and identication for the hierarchical levels. Clicking on the x0 term allows this to be selected from a list and also whether its coecient 0 is random at particular levels of the data hierarchy. Adding a further predictor term is also a simple matter of clicking an ‘add term’ button and selecting a variable. There are simple procedures for specifying general interaction terms. Model (1), including a random coecient for x1 in its general form as given by (3), will be displayed in the equation window as shown in Figure 2. Clicking on the ‘Estimates’ button will toggle the parameters between their symbolic representations and the actual estimates after a run. Likewise, the ‘Name’ button will toggle actual variable names on and o. The ‘Subs’ button allows the user to specify the form of subscripts, for example giving them names such as in the screen shown in Figure 3, where we also show the estimates and standard errors from an iterative t. In the following sections we will show some further screen shots of models and results.
6. A GROWTH DATA EXAMPLE We start with some simple repeated measures data and we shall use them to illustrate models of increasing complexity. The data set consists of nine measurements made on 26 boys between the ages of 11 and 13.5 years, approximately 3 months apart [16].
76
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
Figure 3. Equation screen with estimates.
Figure 4.
Figure 4, produced by MLwiN, shows the mean heights by the mean age at each measurement occasion. We assume that growth can be represented by a polynomial function, whose coecients vary from individual to individual. Other functions are possible, including fractional polynomials or non-linear functions, but for simplicity we conne ourselves to examining a fourth-order polynomial in age (t) centred at an origin of 12.25 years. In some applications of growth curve modelling transformations of the time scale may be useful, often
MULTILEVEL MODELLING OF MEDICAL DATA
77
to orthogonal polynomials. In the present case the use of ordinary polynomials provides an accessible interpretation and does not lead to computational problems, for example due to near-collinearities. The model we t can be written as follows: 4 yij = hj tijh + eij h=0
0j =0 + u0j 1j =1 + u1j 2j =2 + u2j 3j =3 4j =4
(6)
u0
u1 ∼ N(0; u )
u2
2 u0
u = u01
2 u1
u02
u12
2 u2
e ∼ N(0; e2 ) This is a two-level model with level-1 being ‘measurement occasion’ and level 2 ‘individual boy’. Note that we allow only the coecients up to the second order to vary across individuals; in the present case this provides an acceptable t. The level 1 residual term eij represents the unexplained variation within individuals about each individual’s growth trajectory. Table I shows the restricted maximum likelihood (REML) parameter estimates for this model. The log-likelihood is calculated for the ML estimates since this is preferable for purposes of model comparison [17]. From this table we can compute various features of growth. For example, the average growth rate (by dierentiation) at age 13.25 years (t =1) is 6:17 + 2 × 1:13 + 3 × 0:45 − 4 × 0:38=8:26 cm=year. A particular advantage of this formulation is that, for each boy, we can also estimate his random eects or ‘residuals’, u0j ; u1j ; u2j , and use these to predict their growth curve at each age [18]. Figure 5, from MLwiN, shows these predicted curves (these can be produced in dierent colours on the screen). Goldstein et al. [16] show that growth over this period exhibits a seasonal pattern with growth in the summer being about 0:5 cm greater than growth in the winter. Since the period of the growth cycle is a year, this is modelled by including a simple cosine term, which could also have a random coecient. In our example we have a set of individuals all of whom have nine measurements. This restriction, however, is not necessary and (6) does not require either the same number of occasions per individual or that measurements are made at equal intervals, since time is modelled as a continuous function. In other words we can combine data from individuals with very dierent measurement patterns, some of whom may only have been measured once
78
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
Table I. Height modelled as a fourth-degree polynomial on age. REML estimates. Fixed eects Intercept t t2 t3 t4
Estimate
Standard error
149.0 6.17 1.13 0.45 −0:38
1.57 0.36 0.35 0.16 0.30
Random: level-2 (individual) correlation matrix, variances on diagonal
Intercept t t2
Intercept
t
t2
64.0 0.61 0.22
2.86 0.66
0.67
Random: level-1 variance = 0:22. −2 log-likelihood(ML) = 625:4.
Figure 5.
and some who have been measured several times at irregular intervals. This exibility, rst noted by Laird and Ware [10], means that the multilevel approach to tting repeated measures data is to be preferred to previous methods based upon a multivariate formulation assuming a common set of xed occasions [19, 20]. In these models it is assumed that the level-1 residual terms are independently distributed. We may relax this assumption, however, and in the case of repeated measures data this may be necessary, for example where measurements are taken very close together in time. Suppose we wish to t a model that allows for correlations between the level-1 residuals, and to start with for simplicity let us assume that these correlations are all equal. This is easily
79
MULTILEVEL MODELLING OF MEDICAL DATA
Table II. Height modelled as a fourth-degree polynomial on age, including a seasonal eect and serial correlation. REML estimates. Fixed eects Intercept t t2 t3 t4 Cos(t)
Estimate
Standard error
148.9 6.19 2.16 0.39 −1:55 −0:24
0.36 0.45 0.17 0.43 0.07
Random: level-2 (individual) correlation matrix, covariances on diagonal
Intercept t t2
Intercept
t
t2
63.9 0.61 0.24
2.78 0.69
0.59
Random: level 1(SE in brackets) e2 0:24(0:05) 6:59(1:90) −2 log-likelihood (ML) = 611:5.
accomplished within the GLS step for the random parameters by modifying (5) to 1 0 0 1 1 0 2 e + 0 1 0 1 1 0
(7)
so that the parameter is the common level-1 covariance (between occasions). Goldstein et al. [16] show how to model quite general non-linear covariance functions and in particular those of the form cov(et et−s )=e2 exp(−g(; s)), where s is the time dierence between occasions. This allows the correlation between occasions to vary smoothly as a function of their (continuous) time dierence. A simple example is where g=s, which, in discrete time, produces an AR(1) model. The GLS step now involves non-linear estimation that is accomplished in a standard fashion using a Taylor series approximation within the overall iterative scheme. Pourahmadi [21, 22] considers similar models but restricted to a xed set of discrete occasions. Table II shows the results of tting the model with g=s together with a seasonal component. If this component has amplitude, say, we can write it in the form cos(t ∗ +), where t ∗ is measured from the start of the calendar year. Rewriting this in the form 1 cos(t ∗ )−2 sin(t ∗ ) we can incorporate the cos(t ∗ ), sin(t ∗ ) as two further predictor variables in the xed part of the model. In the present case 2 is small and non-signicant and is omitted. The results
80
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
show that for measurements made three months apart the serial correlation is estimated as 0:19 (e−6:59=4 ) and as 0:04 (e−6:59=2 ) for measurements taken at 6-monthly intervals. This suggests, therefore, that in practice, for such data when the intervals are no less than 6 months apart serial correlation can be ignored, but should be tted when intervals are as small as 3 months. This will be particularly important in highly unbalanced designs where there are some individuals with many measurements taken close together in time; ignoring serial correlation will give too much weight to the observations from such individuals. Finally, on this topic, there will typically need to be a trade-o between modelling more random coecients at level 2 in order to simplify or eliminate a level-1 serial correlation structure, and modelling level 2 in a parsimonious fashion so that a relatively small number of random coecients can be used to summarize each individual. An extreme example of the latter is given by Diggle [23] who ts only a random intercept at level 2 and serial correlation at level 1.
7. MULTIVARIATE RESPONSE DATA We shall use an extension of the model for repeated measures data to illustrate how to model multivariate response data. Consider model (6) where we have data on successive occasions for each individual and in addition, for some or all individuals, we have a measure, say, of their nal adult height y3(2) , and their (log) income at age 25 years, y4(2) , where the superscript denotes a measurement made at level 2. We can include these variables as further responses by extending (6) as follows: yij(1) =
4 h=0
hj tijh + eij
0j =0 + u0j 1j =1 + u1j 2j =2 + u2j 3j =3 4j =4 (2) y3j =3 + u3j (2) y4j =4
+ u4j
u0
u1 u2 ∼ N(0; u ) u3
u4 e ∼ N(0; e2 )
(8)
MULTILEVEL MODELLING OF MEDICAL DATA
81
We now have a model where there are response variables dened at level 1 (with superscript (1)) and also at level 2 (with superscript (2)). For the level-2 variables we have specied only an intercept term in the xed part, but quite general functions of individual level predictors, such as gender, are possible. The level-2 responses have no component of random variation at level 1 and their level-2 residuals covary with the polynomial random coecients from the level-1 repeated measures response. The results of tting this model allow us to quantify the relationships between growth events, such as growth acceleration (dierentiating twice) at t =0, age 12.25 years, (22j ) and adult height and also to use measurements taken during the growth period to make ecient predictions of adult height or income. We note that for individual j the estimated (posterior) residuals uˆ3j ; uˆ4j are the best linear unbiased predictors of the individual’s adult values; where we have only a set of growth period measurements for an individual these therefore provide the required estimates. Given the set of model parameters, therefore, we immediately obtain a system for ecient adult measurement prediction given a set of growth measurements [24]. Suppose, now, that we have no growth period measurements and just the two adult measurements for each individual. Model (8) reduces to (2) =3 + u3j y3j (2) y4j =4 + u4j
u3 ∼ N(0; u ) u4
V1j =0; V2j =
(9)
2 u3
u34
2 u4
Thus we can think of this as a two-level model with no level-1 variation and every level 2 unit containing just two level 1 units. The explanatory variables for the simple model given by (9) are just two dummy variables dening, alternately, the two responses. Thus we can write (9) in the more compact general form 2 1 if response 1 yij = 0hj xhij ; x1ij = ; x2ij =1 − x1ij 0 if response 2 h=1 0hj =0h + uhj
u1
u2
=
2 u1
u12
(10)
2 u2
Note that there is no need for every individual to have both responses and so long as we can consider ‘missing’ responses as random, the IGLS algorithm will supply maximum likelihood estimates. We can add further covariates to the model in a straightforward manner by forming interactions between them and the dummy variables dening the separate response intercepts. The ability to t a multivariate linear model with randomly missing responses nds a number of applications, for example where matrix or rotation designs are involved (reference [18], Chapter 4), each unit being allocated, at random, a subset of responses. The possibility of
82
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
having additionally level-1 responses allows this to be used as a very general model for meta analysis where there are several studies (level-2 units) for some of which responses are available only in summary form at level 2 and for others detailed level-1 responses are available. Goldstein et al. [25] provide a detailed example. 8. CROSS-CLASSIFIED AND MULTIPLE MEMBERSHIP STRUCTURES Across a wide range of disciplines it is commonly the case that data have a structure that is not purely hierarchical. Individuals may be clustered not only into hierarchically ordered units (for example occasions nested within patients nested within clinics), but may also belong to more than one type of unit at a given level of a hierarchy. Consider the example of a livestock animal such as a cow where there are a large number of mothers, each producing several female ospring that are eventually reared for milk on dierent farms. Thus, an ospring might be classied as belonging to a particular combination of mother and farm, in which case they will be identied by a cross-classication of these. Raudenbush [26] and Rasbash and Goldstein [27] present the general structure of a model for handling complex hierarchical structuring with random cross-classications. For example, assuming that we wish to formulate a linear model for the milk yield of ospring taking into account both the mother and the farm, then we have a cross-classied structure, which can be modelled as follows: yi( j1 ; j2 ) =(X)i( j1 ; j2 ) + uj1 + uj2 + ei( j1 j2 ) j1 =1; : : : ; J1 ;
j2 =1; : : : ; J2 ;
i =1; : : : ; N
(11)
in which the yield of ospring i, belonging to the combination of mother j1 and farm j2 , is predicted by a set of xed coecients (X)i( j1 ; j2 ) . The random part of the model is given by two level-2 residual terms, one for the mother (uj1 ) and one for the farm (uj2 ), together with the usual level-1 residual term for each ospring. Decomposing the variation in such a fashion allows us to see how much of it is due to the dierent classications. This particular example is somewhat oversimplied, since we have ignored paternity and we would also wish to include factors such as age of mother, parity of ospring etc. An application of this kind of modelling to a more complex structure involving Salmonella infection in chickens is given by Rasbash and Browne [28]. Considering now just the farms, and ignoring the mothers, suppose that the ospring often change farms, some not at all and some several times. Suppose also that we know, for each ospring, the weight wij2 , associated with the j2 th farm for ospring i with Jj22=1 wij2 =1. These weights, for example, may be proportional to the length of time an ospring stays in a particular farm during the course of our study. Note that we allow the possibility that for some (perhaps most) animals only one farm is involved so that one of these probabilities is one and the remainder are zero. Note that when all level-1 units have a single non-zero weight of 1 we obtain the usual purely hierarchical model. We can write for the special case of membership of up to two farms {1; 2}: yi(1; 2) =(X)i(1; 2) + wi1 u1 + wi2 u2 + ei(1; 2) (12) wi1 + wi2 =1
83
MULTILEVEL MODELLING OF MEDICAL DATA
and more generally yi{ j} =(X)i{ j} +
wih uh + ei{ j}
h∈{ j}
wih =1;
h
var(
h
var(uh )=u2
wih uh )=u2
h
(13)
wih2
Thus, in the particular case of membership of just two farms with equal weights we have
wi1 =wi2 =0:5; var wih uh =u2 =2 h
Further details of this model are given by Hill and Goldstein [29]. An extension of the multiple membership model is also possible and has important applications, for example in modelling spatial data. In this case we can write yi{ j1 }{ j2 } =(X)i{ j} + w1ih u1h + w2ih u2h + ei{ j}
w1ih =W1 ;
h
cov(u1h ; u2h )=u12 ;
h∈{ j1 }
w2ih =W2 ;
h
h∈{ j2 }
2 var(u1h )=u1 ;
2 var(u2h )=u2
(14)
j = { j1 ; j2 }
There are now two sets of higher level units that inuence the response and in general we can have more than two such sets. In spatial models one of these sets is commonly taken to be the area where an individual (level 1) unit occurs and so does not have a multiple membership structure (since each individual belongs to just one area, that is we replace h w1ih u1h by u1j1 ). The other set consists of those neighbouring units that are assumed to have an eect. The weights will need to be carefully chosen; in spatial models W2 is typically chosen to be equal to 1 (see Langford et al. [30] for an example). Another application for a model such as (14) is for household data where households share facilities, for example an address. In this case the household that an individual resides in will belong to one set and the other households at the address will belong to the other set. Goldstein et al. [31] give an application of this model to complex household data.
9. META-ANALYSIS Meta-analysis involves the pooling of information across studies in order to provide both greater eciency for estimating treatment eects and also for investigating why treatments eects may vary. By formulating a general multilevel model we can do both of these eciently within a single-model framework, as has already been indicated and was suggested by several authors [32, 33]. In addition we can combine data that are provided at either individual subject level or aggregate level or both. We shall look at a simple case but this generalizes readily [25].
84
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
Consider an underlying model for individual level data where a pair of treatments are being compared and results from a number of studies or centres are available. We write a basic model, with a continuous response Y as yij =(X)ij + 2 tij + uj + eij var(uj )=u2 ;
var(eij )=e2
(15)
with the usual assumptions of Normality etc. The covariate function is designed to adjust for initial clinic and subject conditions. The term tij is a dummy variable dening the treatment (0 for treatment A, 1 for treatment B). The random eect uj is a study eect and the eij are individual-level residuals. Clearly this model can be elaborated in a number of ways, by including random coecients at level 2 so that the eect of treatment varies across studies, and by allowing the level-1 variance to depend on other factors such as gender or age. Suppose now that we do not have individual data available but only means at the study level. If we average (15) to the study level we obtain y:j =(X): j + 2 t: j + uj + e: j
(16)
where y:j is the mean response for the jth study etc. The total residual variance for study j in this model is u2 + e2 =nj where nj is the size of the jth study. It is worth noting at this point that we are ignoring, for simplicity, levels of variation that might exist within studies, such as that between sites for a multi-site study. If we have the values of y:j ; (X): j ; t: j where the latter is simply the proportion of subjects with treatment B in the jth study, and also the value of nj then we will be able to obtain estimates for the model parameters, so long as the nj dier. Such estimates, however, may not be very precise and extra information, especially about the value of e2 , will improve them. Model (16) therefore forms the basis for the multilevel modelling of aggregate level data. In practice the results of studies will often be reported in non-standard form, for example with no estimate of e2 but it may be possible to estimate this from reported test statistics. In some cases, however, the reporting may be such that the study cannot be incorporated in a model such as (16). Goldstein et al. [25] set out a set of minimum reporting conventions for meta-analysis studies subsequently to be carried out. While it is possible to perform a meta-analysis with only aggregate level data, it is clearly more ecient to utilize individual-level data where these are available. In general, therefore, we will need to consider models that have mixtures of individual and aggregate data, even perhaps within the same study. We can do this straightforwardly by specifying a model which is just the combination of (15) and (16), namely yij =0 + 1 xij + 2 tij + uj + eij y:j =0 + 1 x: j + 2 t: j + uj + ej zj √ zj = nj−1 ; ej ≡ eij
(17)
What we see is that the common level-1 and level-2 random terms link together the separate models and allow a joint analysis that makes fully ecient use of the data. Several issues immediately arise from (17). One is that the same covariates are involved. This is also a requirement for the separate models. If some covariate values are missing at either level then it is possible to use an imputation technique to obtain estimates, assuming a suitable random
MULTILEVEL MODELLING OF MEDICAL DATA
85
missingness mechanism. The paper by Goldstein et al. [25] discusses generalizations of (17) for several treatments and the procedure can be extended to generalized linear models.
10. GENERALIZED LINEAR MODELS So far we have dealt with linear models, but all of those so far discussed can be modied using non-linear link functions to give generalized linear multilevel models. We shall not discuss these in detail (see Goldstein [18] for details and some applications) but for illustration we will describe a two-level model with a binary response. Suppose the outcome of patients in intensive care is recorded simply as survived (0) or died (1) within 24 hours of admission. Given a sample of patients from a sample of intensive care units we can write one model for the probability of survival as logit(ij )=(X)ij + uj yij ∼ Bin(ij ; 1)
(18)
Equation (18) uses a standard logit link function assuming binomial (Bernoulli) error distribution for the (0,1) response yij . The level-2 random variation is described by the term uj within the linear predictor. The general interpretation is similar to that for a continuous response model, except that the level-1 variation is now a function of the predicted value ij . While in (18) there is no separate estimate for the level-1 variance, we may wish to t extrabinomial variation which will involve a further parameter. We can modify (18) using alternative link functions, for example the logarithm if the response is a count, and can allow further random coecients at level 2. The response can be a multi-category variable, either ordered or unordered, and this provides an analogue to the multivariate models for continuously distributed responses. As an example of an ordered categorical response, consider extending the previous outcome to three categories: survival without impairment (1); survival with impairment (2); death (3). If ij(h) ; h=1; 2; 3 are, respectively, the probabilities for each of these categories, we can write a proportional odds model using a logit link as logit(ij(1) )=(1) + (X)ij + uj(1) logit(ij(1) + ij(2) )=(2) + (X)ij + uj(2)
(19)
and the set of three (0,1) observed responses for each patient is assumed to have a multinomial distribution with mean vector given by ij(h) ; h=1; 2; 3. Since the probabilities add to 1 we require two lines in (19) which dier only in terms of the overall level given by the intercept term as well as allowing for these to vary across units. Unlike in the continuous Normal response case, maximum likelihood estimation is not straightforward and beyond fairly simple two-level models involves a considerable computational load typically using numerical integration procedures [34]. For this reason approximate methods have been developed based upon series expansions and using quasi-likelihood approaches [18] which perform well under a wide range of circumstances but can break down in certain conditions, especially when data are sparse with binary responses. High-order Laplace
86
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
approximations have been found to perform well [35] as have simulation-based procedures such as MCMC (see below). It is worth mentioning one particular situation where care is required in using generalized linear multilevel models. In (15) and (16) we assume that the level-1 responses are independently distributed with probabilities distributed according to the specied model. In some circumstances such an assumption may not be sensible. Consider a repeated measures model where the health status (satisfactory/not satisfactory) of individuals is repeatedly assessed. Some individuals will typically always respond ‘satisfactory’ whereas some others can be expected to respond always ‘not satisfactory’. For these individuals the underlying probabilities are either zero or one, which violates the model assumptions and what one nds if one tries to t a model where there are non-negligible numbers of such individuals are noticeable amounts of underdispersion. Barbosa and Goldstein [36] discuss this problem and propose a solution based upon tting a serial correlation structure. We can also have multivariate multilevel models with mixtures of discrete and continuous responses. Certain of these can be tted in MLwiN using quasi-likelihood procedures [12] and MCMC procedures for such models are currently being implemented. See also Olsen and Schafer [37] for an alternative approach.
11. SURVIVAL MODELS Several dierent formulations for survival data modelling are available; to illustrate how these can be extended to multilevel structures, where they are often referred to as frailty models [38], we consider the proportional hazards (Cox) model and a piecewise discrete time model. Goldstein [18] gives other examples. Consider a simple two-level model with, say, patients within hospitals or occasions within subjects. As in the standard single-level case we consider each time point in the data as dening a block indicated by l at which some observations come to the end of their duration due to either failure or censoring and some remain to the next time or block. At each block there is therefore a set of observations – the total risk set. To illustrate how the model is set up we can think of the data sorted so that each observation within a block is a level-1 unit, above which, in the repeated measures case, there are occasions at level 2 and subjects at level 3. The ratio of the hazard for the unit which experiences a failure at a given occasion referred to by (j; k) to the sum of the hazards of the remaining risk set units [39] is exp(1 x1ijk + ujk ) exp(1 x1ijk + ujk )
(20)
j; k
where j and k refer to the real levels 2 and 3, for example occasion and subject. At each block denoted by l the response variable may be dened for each member of the risk set as
yijk(l) =
1
failed
0
not
MULTILEVEL MODELLING OF MEDICAL DATA
87
Because of equivalence between the likelihood for the multinomial and Poisson distributions, the latter is used to t model (20). This can be written as E(yijk(l) )= exp(l + Xjk k )
(21)
Where there are ties within a block then more than one response will be non-zero. The terms l t the underlying hazard function as a ‘blocking factor’, and can be estimated by tting either a set of parameters, one for each block, or a smoothed polynomial curve over the blocks numbered 1; : : : ; p. Thus if the hth block is denoted by h, l is replaced by a low-order polynomial, order m, mt=0 t ht , where the t are (nuisance) parameters to be estimated. Having set up this model, the data are now sorted into the real two-level structure, for example in the repeated measures case by failure times within subjects with occasions within the failure times. This retains proportional hazards within subjects. In this formulation the Poisson variation is dened at level 1, there is no variation at level 2 and the between-subject variation is at level 3. Alternatively we may wish to preserve overall proportionality, in which case the failure times dene level 3 with no variation at that level. See Goldstein [18] for a further discussion of this. Consider now a piecewise survival model. Here the total time interval is divided into short intervals during which the probability of failure, given survival up to that point, is assumed constant. Denote these intervals by t (1; 2; : : : ; T ) so that the hazard at time t is the probability that, given survival up to the end of time interval t − 1, failure occurs in the next interval. At the start of each interval we have a ‘risk set’ nt consisting of the survivors and during the interval rt fail. If censoring occurs during interval t then this observation is removed from that interval (and subsequent ones) and does not form part of the risk set. A simple, single-level, model for the probability can be written as i(t) =f[t zit ; (X )it ]
(22)
where zt = {zit } is a dummy variable for the tth interval and t , as before, is a ‘blocking factor’ dening the underlying hazard function at time t. The second term is a function of covariates. A common formulation would be the logit model and a simple such model in which the rst blocking factor has been absorbed into the intercept term could be written as logit(i(t) )=0 + t zit + 1 x1i ;
(z2 ; z3 ; : : : ; zT )
(23)
Since the covariate varies across individuals, in general the data matrix will consist of one record for each individual within each interval, with a (0,1) response indicating survival or failure. The model can now be tted using standard procedures, assuming a binomial error distribution. As before, instead of tting T − 1 blocking factors, we can t a low-order polynomial to the sequentially numbered time indicator. The logit function can be replaced by, for example, the complementary log-log function that gives a proportional hazards model, or, say, the probit function and note that we can incorporate time-varying covariates such as age. For the extension to a two-level model we write logit(ij(t) )=0 +
p h=1
h∗ (zit∗ )h + 1 x1ij + uj
(24)
88
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
where uj is the ‘eect’, for example, of the jth clinic, and is typically assumed to be distributed normally with zero mean and variance u2 . We can elaborate (24) using random coecients, resulting in a heterogeneous variance structure, further levels of nesting etc. This is just a two-level binary response model and can be tted as described earlier. The data structure has two levels so that individuals will be grouped (sorted) within clinics. For a competing risks model with more than one outcome we can use the two-level formulation for a multi-category response described above. The model can be used with repeated measures data where there are repeated survival times within individuals, for example multiple pregnancy states.
12. BAYESIAN MODELLING So far we have considered the classical approach to tting multilevel models. If we add prior distributional assumptions to the parameters of the models so far considered we can t the same range of models from a Bayesian perspective, and in most applications this will be based upon MCMC methods. A detailed comparison of Bayesian and likelihood procedures for tting multilevel models is given in Browne and Draper [40]. A particular advantage of MCMC methods is that they yield inferences based upon samples from the full posterior distribution and allow exact inference in cases where, as mentioned above, the likelihood based methods yield approximations. Owing also to their approach of generating a sample of points from the full posterior distributions, they can give accurate interval estimates for non-Gaussian parameter distributions. In MCMC sampling we are interested in generating samples of values from the joint posterior distribution of all the unknown parameters rather than nding the maximum of this distribution. Generally it is not possible to generate directly from this joint distribution, so instead the parameters are split into groups and for each group in turn we generate a set of values from its conditional posterior distribution. This can be shown to be equivalent to sampling directly from the joint posterior distribution. There are two main MCMC procedures that are used in practice: Gibbs sampling [41] and Metropolis–Hastings (MH) [42, 43] sampling. When the conditional posterior for a group of parameters has a standard form, for example a Normal distribution, then we can generate values from it directly and this is known as Gibbs sampling. When the distribution is not of standard form then it may still be possible to use Gibbs sampling by constructing the distribution using forms of Gibbs sampling such as adaptive rejection sampling [44]. The alternative approach is to use MH sampling where values are generated from another distribution called a proposal distribution rather than the conditional posterior distribution. These values are then either accepted or rejected in favour of the current values by comparing the posterior probabilities of the joint posterior at the current and proposed new values. The acceptance rule is designed so that MH is eectively sampling from the conditional posterior even though we have used an arbitrary proposal distribution; nevertheless, choice of this proposal distribution is important for eciency of the algorithm. MCMC algorithms produce chains of serially correlated parameter estimates and consequently often have to be run for many iterations to get accurate estimates. Many diagnostics are available to gauge approximately for how long to run the MCMC methods. The chains are also started from arbitrary parameter values and so it is common practice to ignore the
89
MULTILEVEL MODELLING OF MEDICAL DATA
rst N iterations (known as a burn-in period) to allow the chains to move away from the starting value and settle at the parameters’ equilibrium distribution. We give here an outline of the Gibbs sampling procedure for tting a general Normal two-level model yij = Xij + Zij uj + eij uj ∼ N(0; u ); eij ∼ N(0; e2 ); i = 1; : : : ; nj ;
j =1; : : : ; J;
nj =N
j
We will include generic conjugate prior distributions for the xed eects and variance parameters as follows: ∼ N(p ; Sp ); u ∼ W −1 (u ; Su ); e2 ∼ SI 2 (e ; se2 ) The Gibbs sampling algorithm then involves simulating from the following four sets of conditional distributions:
p( | y; e2 ; u) ∼ N
Dˆ e2
nj J
j=1 i=1
XijT (yij
− Zij uj ) +
nj Dˆj ∼N ZT (yij − Xij ); Dˆj e2 i=1 ij
J −1 T p(u | u) ∼ W J + u ; uj uj + Su
Sp−1 p
; Dˆ
p(uj | y; u ; e2 ; )
j=1
p(e2
−1
| y; ; u) ∼
N + e 1 ; 2 2
e se2
+
nj J j=1 i=1
eij2
where
−1 XijT Xij −1 ˆ D= + Sp e2 ij
and
−1 nj 1 T −1 ˆ Dj = 2 Z Zij + u e i=1 ij
Note that in this algorithm we have used generic prior distributions. This allows the incorporation of informative prior information but generally we will not have this information and so will use so-called ‘diuse’ prior distributions that reect our lack of knowledge. Since there are only 26 level-2 units from which we are estimating a 3 × 3 covariance matrix, the exact
90
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
Table III. Height modelled as a fourth-degree polynomial on age using MCMC sampling. Fixed eects
MCMC Estimate
Standard error
149.2 6.21 1.14 0.45 −0:38
1.58 0.38 0.36 0.17 0.30
Intercept t t2 t3 t4
Random: level-2 (individual) correlation matrix, variances on diagonal, estimates are mean(mode) Intercept Intercept t t2
74.4 (68.6) 0.61 0.22
t 3.34 (3.06) 0.66
t2
0.67 (0.70)
Random: level-1 variance = 0:23(0:22).
choice of prior is important. We here use the following set of priors for the child growth example considered in Section 6: p(0 ) ∝ 1;
p(1 ) ∝ 1;
p(2 ) ∝ 1;
p(u ) ∼ inverse Wishart 3 [3; 3 × Su ];
p(3 ) ∝ 1; p(4 ) ∝ 1; p(5 ) ∝ 1 64:0 Su = 8:32 2:86 1:42 0:92 0:67
p(e02 ) ∼ inverse gamma(0:001; 0:001) The inverse Wishart prior matrix is based upon the REML estimates, chosen to be ‘minimally informative’, with degrees of freedom equal to the order of the covariance matrix. The results of running model (5) for 50000 iterations after a burn-in of 500 can be seen in Table III. Here we see that the xed eect estimates are very similar to the estimates obtained by the maximum likelihood method. The variance (chain mean) estimates are however inated due to the skewness of the variance parameters. Modal estimates of the variance parameters, apart from that for the quadratic coecient, are closer, as is to be expected. If we had used the uniform prior p(u ) ∝ 1 for the covariance matrix, the estimates of the xed coecients are little changed but the covariance matrix estimates are noticeably dierent. For example, the variance associated with the intercept is now 93.0 and those for the linear and quadratic coecients become 4.2 and 1.1, respectively. Figure 6 shows the diagnostic screen produced by MLwiN following this MCMC run. The top left hand box shows the trace for this parameter, the level-2 intercept variance. This looks satisfactory and this is conrmed by the estimated autocorrelation function (ACF) and partial autocorrelation function (PACF) below. A kernel density estimate is given at the top right and
MULTILEVEL MODELLING OF MEDICAL DATA
91
Figure 6. MCMC summary screen.
the bottom left box is a plot of the Monte Carlo standard error against number of iterations in the chain. The summary statistics give quantiles, mean and mode together with accuracy diagnostics that indicate the required chain length. The MCMC methods are particularly useful in models like the cross-classied and multiple membership models discussed in Section 8. This is because whereas the maximum likelihood methods involve constructing large constrained variance matrices for these models, the MCMC methods simulate conditional distributions in turn and so do not have to adjust to the structure of the model. For model tting, one strategy (but not the only one) is to use the maximum or quasilikelihood methods for performing model exploration and selection due to speed. Then MCMC methods could be used for inference on the nal model to obtain accurate interval estimates.
13. BOOTSTRAPPING Like MCMC the bootstrap allows inferences to be based upon (independent) chains of values and can be used to provide exact inferences and corrections for bias. Two forms of bootstrapping have been studied to date: parametric bootstrapping, especially for correcting biases in generalized linear models [45], and non-parametric bootstrapping based upon estimated residuals [46]. The fully parametric bootstrap for a multilevel model works as for a singlelevel model with simulated values generated from the estimated covariance matrices at each level. Thus, for example, for model (2) each bootstrap sample is created using a set of uj ; eij sampled from N(0; u2 ); N(0; e2 ), respectively.
92
H. GOLDSTEIN, W. BROWNE AND J. RASBASH
In the non-parametric case, full ‘unit resampling’ is generally only possible by resampling units at the highest level. For generalized linear models, however, we can resample posterior residuals, once they have been adjusted to have the correct (estimated) covariance structures and this can be shown to possess particular advantages over a fully parametric bootstrap where asymmetric distributions are involved [46]. 14. IN CONCLUSION The models that we have described in this paper represent a powerful set of tools available to the data analyst for exploring complex data structures. They are being used in many areas, including health, with great success in providing insights that are unavailable with more conventional methods. There is a growing literature extending these models, for example to multilevel structural equation models, and especially to the application of the multiple membership models in areas such as population demography [31]. An active e-mail discussion group exists which welcomes new members (www.jiscmail.ac.uk/multilevel). The data set used for illustration in this paper is available on the Centre for Multilevel Modelling website (multilevel.ioe.ac.uk/download/macros.html) as an MLwiN worksheet. REFERENCES 1. Snijders T, Bosker R. Multilevel Analysis. Sage: London, 1999. 2. Leyland AH, Goldstein H (eds). Multilevel Modelling of Health Statistics. Wiley: Chichester, 2001. 3. Sullivan LM, Dukes KA, Losina E. Tutorial in biostatistics: an introduction to hierarchical linear modelling. Statistics in Medicine 1999; 18:855 – 888. 4. Zeger SL, Liang K, Albert P. Models for longitudinal data: a generalised estimating equation approach. Biometrics 1988; 44:1049 – 1060. 5. Liang K-Y, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society, Series B 1992; 54:3 – 40. 6. Heagerty PJ, Zeger SL. Marginalized multilevel models and likelihood inference (with discussion). Statistical Science 2000; 15:1 – 26. 7. Lindsey JK, Lambert P. On the appropriateness of marginal models for repeated measurements in clinical trials. Statistics in Medicine 1998; 17:447 – 469. 8. Goldstein H, Rasbash J. Ecient computational procedures for the estimation of parameters in multilevel models based on iterative generalised least squares. Computational Statistics and Data Analysis 1992; 13:63 – 71. 9. Longford NT. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested eects. Biometrika 1987; 74:812 – 827. 10. Laird NM, Ware JH. Random-eects models for longitudinal data. Biometrics 1982; 38:963 – 974. 11. Bryk AS, Raudenbush SW. Hierarchical Linear Models. Sage: Newbury Park, California, 1992. 12. Rasbash J, Browne W, Goldstein H, Yang M, Plewis, I, Healy M, Woodhouse G, Draper D, Langford I, Lewis T. A User’s Guide to MlwiN. (2nd edn.). Institute of Education: London, 2000. 13. Goldstein H. Multilevel mixed linear model analysis using iterative generalised least squares. Biometrika 1986; 73:43 – 56. 14. Zhou X, Perkins AJ, Hui SL. Comparisons of software packages for generalized linear multilevel models. American Statistician 1999; 53:282 – 290. 15. Spiegelhalter DJ, Thomas A, Best NG. WinBUGS Version 1.3: User Manual. MRC Biostatistics Research Unit: Cambridge, 2000. 16. Goldstein H, Healy MJR, Rasbash J. Multilevel time series models with applications to repeated measures data. Statistics in Medicine 1994; 13:1643 – 1655. 17. Kenward MG, Roger JH. Small sample inference for xed eects from restricted maximum likelihood. Biometrics 1997; 53:983 – 997. 18. Goldstein H. Multilevel Statistical Models. Arnold: London, 1995. 19. Grizzle JC, Allen DM. An analysis of growth and dose response curves. Biometrics 1969; 25:357 – 361. 20. Albert PS. Longitudinal data analysis (repeated measures) in clinical trials. Statistics in Medicine 1993; 18: 1707 – 1732.
MULTILEVEL MODELLING OF MEDICAL DATA
93
21. Pourahmadi M. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika 1999; 86:677 – 690. 22. Pourahmadi M. Maximum likelihood estimation of generalised linear models for multivariate Normal covariance matrix. Biometrika 2000; 87:425 – 436. 23. Diggle PJ. An approach to the analysis of repeated measurements. Biometrics 1988; 44:959 – 971. 24. Goldstein H. Flexible models for the analysis of growth data with an application to height prediction. Revue Epidemiologie et Sante Publique 1989; 37:477 – 484. 25. Goldstein H, Yang M, Omar R, Turner R, Thompson S. Meta analysis using multilevel models with an application to the study of class size eects. Journal of the Royal Statistical Society, Series C 2000; 49: 399 – 412. 26. Raudenbush, SW. A crossed random eects model for unbalanced data with applications in cross sectional and longitudinal research. Journal of Educational Statistics 1993; 18:321 – 349. 27. Rasbash J, Goldstein H. Ecient analysis of mixed hierarchical and cross classied random structures using a multilevel model. Journal of Educational and Behavioural Statistics 1994; 19:337 – 350. 28. Rasbash J, Browne W. Non-hierarchical multilevel models. In Multilevel Modelling of Health Statistics, Leyland A, Goldstein H (eds). Wiley: Chichester, 2001. 29. Hill PW, Goldstein H. Multilevel modelling of educational data with cross-classication and missing identication of units. Journal of Educational and Behavioural Statistics 1998; 23:117 – 128. 30. Langford I, Leyland AH, Rasbash J, Goldstein H. Multilevel modelling of the geographical distributions of diseases. Journal of the Royal Statistical Society, Series C 1999; 48:253 – 268. 31. Goldstein H, Rasbash J, Browne W, Woodhouse G, Poulain M. Multilevel models in the study of dynamic household structures. European Journal of Population 2000; 16:373 – 387. 32. Hedges LV, Olkin IO. Statistical Methods for Meta Analysis. Academic Press: Orlando, Florida, 1985. 33. Raudenbush S, Bryk AS. Empirical Bayes meta-analysis. Journal of Educational Statistics 1985; 10:75 – 98. 34. Hedeker D, Gibbons R. A random eects ordinal regression model for multilevel analysis. Biometrics 1994; 50:933 – 944. 35. Raudenbush SW, Yang M, Yosef M. Maximum likelihood for generalised linear models with nested random eects via high-order multivariate Laplace approximation. Journal of Computational and Graphical Statistics 2000; 9:141 – 157. 36. Barbosa MF, Goldstein H. Discrete response multilevel models for repeated measures; an application to voting intentions data. Quality and Quantity 2000; 34:323 – 330. 37. Olsen MK, Schafer JL. A two-part random eects model for semi- continuous longitudinal data. Journal of the American Statistical Association 2001; 96:730 – 745. 38. Clayton DG. A Monte Carlo method for Bayesian inference in frailty models. Biometrics 1991; 47:467 – 485. 39. McCullagh P, Nelder J. Generalised Linear Models. Chapman and Hall: London, 1989. 40. Browne W, Draper D. Implementation and performance issues in the Bayesian and likelihood tting of multilevel models. Computational Statistics 2000; 15:391 – 420. 41. Geman S, Geman D. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern analysis and Machine Intelligence 1984; 45:721 – 741. 42. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machines. Journal of Chemical Physics 1953; 21:1087 – 1092. 43. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970; 57:97 – 109. 44. Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistics Society, Series C 1992; 41:337 – 348. 45. Goldstein H. Consistent estimators for multilevel generalised linear models using an iterated bootstrap. Multilevel Modelling Newsletter 1996; 8(1):3 – 6. 46. Carpenter J, Goldstein H, Rasbash J. A nonparametric bootstrap for multilevel models. Multilevel Modelling Newsletter 1999; 11(1):2 – 5.
TUTORIAL IN BIOSTATISTICS Hierarchical linear models for the development of growth curves: an example with body mass index in overweight=obese adults Moonseong Heo1 , Myles S. Faith2 , John W. Mott3 , Bernard S. Gorman4 , David T. Redden5 and David B. Allison5; 6; ∗; † 1 Department
of Psychiatry; Weill Medical School of Cornell University; White Plains; NY; U.S.A. Research Center; St. Luke’s-Roosevelt Hospital Center; Columbia University College of Physicians and Surgeons; New York; NY; U.S.A. 3 New York State Psychiatry Institute; New York; NY; U.S.A. 4 Nassau Community College; New York; NY; U.S.A. 5 Department of Biostatistics; School of Public Health; The University of Alabama at Birmingham; Birmingham; Alabama; U.S.A. 6 Clinical Nutrition Research Center; School of Public Health; The University of Alabama at Birmingham; Birmingham; Alabama; U.S.A. 2 Obesity
SUMMARY When data are available on multiple individuals measured at multiple time points that may vary in number or inter-measurement interval, hierarchical linear models (HLM) may be an ideal option. The present paper oers an applied tutorial on the use of HLM for developing growth curves depicting natural changes over time. We illustrate these methods with an example of body mass index (BMI; kg=m2 ) among overweight and obese adults. We modelled among-person variation in BMI growth curves as a function of subjects’ baseline characteristics. Specically, growth curves were modelled with two-level observations, where the rst level was each time point of measurement within each individual and the second level was each individual. Four longitudinal databases with measured weight and height met the inclusion criteria and were pooled for analysis: the Framingham Heart Study (FHS); the Multiple Risk Factor Intervention Trial (MRFIT); the National Health and Nutritional Examination Survey I (NHANES-I) and its follow-up study; and the Tecumseh Mortality Follow-up Study (TMFS). Results indicated that signicant quadratic patterns of the BMI growth trajectory depend primarily upon a combination of age and baseline BMI. Specically, BMI tends to increase with time for younger people with relatively moderate obesity (256BMI¡30) but decrease for older people regardless of degree of obesity. The gradients of these changes are inversely related to baseline BMI and do not substantially depend on gender. Copyright ? 2003 John Wiley & Sons, Ltd. KEY WORDS:
obesity; body mass index; growth curves; hierarchical linear model; pooling
∗ Correspondence
to: David B. Allison, Department of Biostatistics, Clinical Nutrition Research Center, School of Public Health, The University of Alabama at Birmingham, 327M Ryals Public Health Building, 1665 University Boulevard, Birmingham, AL 35294-0022, U.S.A. † E-mail:
[email protected] Contract=grant sponsor: National Institute of Health; contract=grant numbers: R01DK51716, P01DK42618, P30DK26687, K08MH01530.
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 95
96
M. HEO ET AL.
1. INTRODUCTION It is useful in many applied settings to estimate the expected pattern of within-individual change in some continuously distributed variable over time. This may involve developing depictions of growth or maturation in development studies (for example, reference [1]), ‘course’ in a chronic disorder (for example, reference [2]), short-term patterns of change in response to a stimulus such as an oral glucose load (for example, reference [3]), or change in a clinical trial [4]. There are many possible approaches to estimating such patterns of change. However, many have serious limitations when longitudinal changes in data vary with individual characteristics; not all individuals are measured the same number of times, and not all individuals are measured at the same intervals. These are common occurrences in observational studies. Hierarchical linear modelling (HLM) [5], an extension of the mixed model – also referred to as random coecient regression modelling [6], multilevel linear modelling [7], nested modelling, or seemingly unrelated regression – is a very exible approach to such situations [8–11]. Among other reviews, Sullivan et al. [12] recently presented a tutorial on HLM in the spirit of analysis of covariance (ANCOVA) with random coecients. The purpose of the present paper is to illustrate how HLM can be extended to develop growth curves with an easily understandable and clinically important example, body mass index (BMI; kg=m2 ). In Section 2, we present a pedagogical exposition with hypothetical data and introduce available statistical software for conducting HLM. In Section 3, the clinical implication of derived BMI growth curves is discussed. In Section 4, procedures for selecting specic databases for developing BMI growth curves are presented. In Section 5, specic statistical approaches for conducting HLM are introduced. In Section 6, results are presented and the nuances of the growth curves discussed. In Section 7, methodological issues related to HLM and clinical application of developed curves are reviewed.
2. PEDAGOGICAL EXPOSITION OF HLM 2.1. Example It has been said that a statistician is someone who can drown in a river that has an average depth of one foot. Obviously, one has to consider variability of depths around average levels. We also have to realize that when we pool individual time series and growth curves, these curves may vary in their mean levels, their trends, and in their variability around mean levels. In this section we will show how multilevel modelling can be used to assemble composite growth curves that account for variability attributable to dierent data sources, dierent data collection periods and dierent individuals. As a prelude to the analysis of body mass index (BMI; kg=m2 ) data, we will start with a simple example. We will rst analyse the data by ordinary least squares (OLS) methods and then reanalyse the data by means of hierarchical linear modelling (HLM). Table I presents hypothetical growth curve data from 30 cases observed on ten occasions. Plots of their growth are shown in Figure 1. We shall designate the rst 15 cases as ‘group 1’. The next 15 cases will be designated as ‘group 2’. Because we believed that some cases show both linear and curvilinear growth, we formed a quadratic equation model to predict the score of any
97
BMI GROWTH CURVE
Table I. Hypothetical example data with 30 cases in two groups over 10 time periods. Case Group
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Contrast code
−1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Time period T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
6.46 7.03 7.37 6.11 6.91 5.73 7.06 6.61 7.58 5.90 6.65 6.50 6.57 6.27 6.01 7.55 7.81 6.05 6.96 6.99 7.91 6.74 7.79 7.97 7.88 7.71 8.30 8.11 7.33 8.05
8.33 8.15 9.31 9.16 8.50 10.03 9.54 8.17 8.02 9.00 7.73 8.43 8.73 9.26 8.41 10.75 9.95 10.55 10.99 10.44 10.14 10.29 9.92 10.90 11.06 9.64 10.48 10.34 9.89 11.40
11.48 10.76 11.15 10.36 9.90 10.17 9.66 10.89 10.88 10.74 10.06 9.71 10.37 10.89 10.23 12.91 13.36 13.83 13.99 13.26 12.71 14.26 13.01 14.12 12.62 13.24 12.45 13.37 13.52 14.59
13.28 12.53 13.68 14.09 12.11 11.69 12.15 12.90 13.75 12.29 11.96 11.31 12.74 13.53 15.38 17.13 17.07 17.45 17.49 17.33 15.56 17.38 16.26 17.78 17.31 15.47 16.49 16.34 18.25 17.56
15.53 14.10 17.30 16.03 13.94 12.34 15.61 14.08 16.28 15.19 13.97 13.40 14.73 16.26 16.50 21.40 21.36 19.68 21.13 19.50 19.18 20.19 19.66 21.75 20.01 18.93 19.18 19.80 22.13 21.20
17.10 15.99 19.23 18.87 16.01 14.86 17.90 16.71 18.16 17.50 15.10 14.31 15.85 18.67 19.38 25.22 25.16 24.62 26.96 24.79 22.69 25.51 24.49 26.94 23.87 22.37 22.76 24.41 26.33 25.40
20.37 18.14 22.46 21.53 17.69 16.05 21.32 18.30 20.27 19.40 17.20 15.92 17.68 20.92 23.76 30.76 29.13 29.22 31.96 28.23 25.70 29.97 29.92 31.77 28.34 26.06 26.83 28.59 31.87 31.25
22.65 20.55 25.92 25.23 17.89 16.84 24.74 21.16 24.54 22.91 18.03 18.90 20.15 25.26 27.52 35.37 35.11 33.88 37.68 32.72 29.76 34.88 33.19 37.68 31.86 31.07 30.62 33.64 37.63 35.69
26.10 21.88 30.08 28.67 20.08 17.86 27.87 24.22 26.59 25.03 19.90 19.44 22.11 28.44 30.97 41.71 40.54 40.17 42.93 36.85 34.41 41.39 38.97 44.49 37.10 34.55 35.88 37.41 43.01 41.67
29.46 25.29 34.27 32.22 21.91 19.14 30.90 27.04 29.18 28.14 21.93 21.98 23.96 31.56 35.49 47.78 45.87 46.30 50.39 43.05 37.04 47.17 45.34 50.07 41.16 38.73 39.65 43.41 49.58 48.06
individual j, given any time point t. Our equation is as follows: yjt = 0 + 1 t + 2 t 2 + ejt in which 0 is the intercept, 1 is the linear term, 2 is the quadratic term, and e is the residual. Equations that characterize the prediction of the smallest data unit are called ‘level-1’ equations. In this example, the smallest data unit is a score from an individual at a given time point. Thus, we have 300 level-1 units. An OLS solution based on the 300 scores obtained from eight cases at 10 time periods produced the level-1 equation y = 5:11 + 1:95t + 0:12t 2 . Both the linear and quadratic terms were signicantly dierent from zero at the 0.01 level. Note that level-1 units are nested within larger groupings. In the case of time series and growth curve data, the higher level is typically the individual case. There are 30 ‘level-2’ units in this example. If we wished to do so, we could have considered even higher-level units. For example, we could have considered the two groups of four cases each to be ‘level-3’ units.
98
M. HEO ET AL.
50 Group 1 Group 2
outcome
40
30
20
10
2
4
6
8
10
Time
Figure 1. Graphical representation of the growth curves for the 30 hypothetical cases over 10 time periods in data of Table I; group 1 and 2 members are represented by solid and dotted lines, respectively.
However, for our purposes, we will consider group membership to be a characteristic of each level-2 unit. Let us show how level-2 characteristics can be related to the level-1 equation. First, let us carry out the same level-1 regression equation within each group. When we select only the four cases in group 1, we obtain the equation (Table II) y = 4:91+1:74t+0:05t 2 . The equation for group 2 is y = 5:30 + 2:16t + 0:18t 2 . Next, we shall carry out the equation within each one of our 30 level-2 units. Their intercepts, linear and quadratic coecients are presented in Table II, in which coecients for intercepts, linear and quadratic terms diered considerably among cases and between groups. While cases in both groups had positive linear components, the linear and quadratic components for the members of group 2 are typically larger than for those in group 1. We can form equations that show how slopes and intercepts for level-2 units (cases) dier as a function of characteristics such as group membership. These equations in which the dependent variables are regression coecients and the independent variables are characteristics of level-2 units, such as group membership, age, health statuses etc., are called ‘level-2’ equations. In our example, we consider only one independent variable: group membership. If we create a contrast code for group membership, G, so that members of group 1 are coded as −1 and those in group 2 are coded as +1 – that is, ‘G’ = −1 and = 1 for the subjects in groups 1 and 2, respectively – then we can regress group membership on the slopes and intercepts of each case. Using this approach, often called ‘slopes as outcome’, yields the OLS level-2 equations (Table III): 0 = 4:916 + 0:196; 1 = 1:950 + 0:209G; and 2 = 0:116 + 0:065G, for the intercept, linear and quadratic terms, respectively; the eects of group membership are statistically signicant only for 2 at 0.05 -level. Inserting the group
99
BMI GROWTH CURVE
Table II. Individual and group growth functions. Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Means All cases Group 1 Group 2
Group
Contrast code
0
1
2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
−1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4.9717 5.3630 5.8408 4.4905 4.7575 4.8202 5.5502 5.2838 4.9323 4.4497 4.5135 5.2107 4.8520 4.7593 3.8638 5.4038 5.1818 4.7237 4.9400 4.7427 5.2245 4.9218 5.5230 5.6982 5.4235 5.2567 6.0532 5.7103 4.3565 6.3695
1.7028 1.5594 1.4252 1.8721 1.9535 1.9449 1.3102 1.4760 1.9040 1.8710 1.9063 1.4509 1.8775 1.8126 2.0476 2.0231 2.1956 2.1944 2.2586 2.4489 2.2963 2.2094 1.8689 2.0624 2.3266 2.1308 1.8966 2.0230 2.5504 1.8932
0.0711 0.0390 0.1402 0.0893 −0.0261 −0.0531 0.1278 0.0676 0.0550 0.0485 −0.0193 0.0207 0.0028 0.0877 0.1101 0.2215 0.1891 0.1917 0.2256 0.1326 0.0949 0.2001 0.2092 0.2417 0.1273 0.1247 0.1501 0.1742 0.1966 0.2261
5.1063 4.9106 5.3019
1.9497 1.7409 2.1585
0.1156 0.0508 0.1804
contrast codes into the level-2 equations yields the mean slopes for each group as shown in Table II. If we used the group-specic coecients for slopes instead of the original coecients from an overall level-1 equation based on all 300 units to predict scores within each case, we would have increased prediction accuracy. Thus, it can be seen that level-2 equations can provide useful information for level-1 equations because we can see how level-1 coecients are functions of level-2 characteristics. The two-step OLS procedure described above was used to illustrate the notion of multilevel modelling. However, it has several drawbacks. For one, it is extremely cumbersome to rst compute separate level-1 equations within each level-2 unit and then compute level-2 equations to link slopes and intercepts to level-2 information. In the BMI study to follow, this would involve thousands of individual equations. For another, the data for each
100
M. HEO ET AL.
Table III. Estimates of xed and random coecients. OLS
HLM
Estimate
SE
0 01
5.1063 0.1957
0:4918∗ 0.4918
5:1063∗ 0.1957
0.1072 0.1072
1 11
1.9497 0.2088
0:2054∗ 0.2054
1:9497∗ 0:2088∗
0.0449 0.0449
2 21
0.1156 0.0648
0:0182∗ 0:0182∗
0:1156∗ 0:0648∗
0.0092 0.0092
var(u0 ) cov(u0 ; u1 ) var(u1 ) cov(u0 ; u2 ) cov(u1 ; u2 ) var(u2 )
0:2451∗ −0:0887∗ 0:0446∗ 0.0045 −0:0043∗ 0:0025∗
0.0633 0.0250 0.0115 0.0046 0.0021 0.0006
0.0034 0.0011 0.0009 −0:0025 −0:0004 0:0021∗
0.0950 0.0375 0.0166 0.0057 0.0025 0.0007
0:1799∗
0.0148
0:2466∗
0.0241
Fixed coecients Intercept Linear Quadratic
Random coecients Level-2
Level-1 var(e)
Estimate
SE
∗
p-value¡0:05.
level-2 unit is collapsed to a few data points, thus losing both statistical power and information about intra-individual variation. Further, to have comparable equations using OLS methods, each level-2 unit should have the same number of observations based on the same times of observations with no missing data values. While highly desirable, this is often not feasible for many clinical studies. Finally, OLS estimation is based on the assumption that the residual from the prediction of each dependent variable observation is independent of those of other observations. This is highly unlikely to occur with longitudinal data because observations are nested within higher-level units. In the present example, while there were 300 observations for scores, these observations were nested within 30 individuals. While estimates of coecients will not necessarily be biased, standard errors for these coecients will almost always be biased. Hierarchical linear modelling (HLM) provides a more elegant approach to computing multilevel regression problems than does OLS. In HLM, we state both the level-1 and level-2 equation and then iteratively solve for coecients at both levels. This is typically achieved by means of numerical methods such as full information maximum likelihood estimation (FIML). Typically, OLS solutions are used as starting values of level-1 and level-2 coecients but these coecients are then rened through iteration. We typically designate the leve1-1 coefcients in HLM models using to symbolize slopes and intercepts. We label the coecients of our level-2 equation using . Level-1 residuals are designated by e while level-2 residuals are designated by u. Figure 2 displays a structural model for our example.
101
BMI GROWTH CURVE
γ 0 + γ 01* Group + u0
Constant
e1 b0
Time
Score
b1
γ 1 + γ 11 * Group + u1 b2
γ 2 + γ 21 * Group + u2 Time^2
Figure 2. Diagram of the model tting the growth curves in Figure 1 of the 30 hypothetical cases used for a pedagogical exposition.
As in the OLS example, 0 ; 1 and 2 are the respective level-1 intercept, linear and quadratic terms. Level-2 equations predict the level-1 coecients. For example, to predict 0 , the intercept, we use 0 to predict a common intercept for all level-1 units and 01 to account for additional eects of group membership while u 0 indicates a residual that is unique to the level-2 unit. Similarly, for 1 , the linear term, we use 1 to represent a common linear term for all level-1 units and 11 to represent the eect of group membership, and u1 represents the unique residual for the linear term. Finally, to predict the quadratic coecient, 2 represents a common quadratic eect, 21 represents the eect of group on the quadratic coecient and u2 represents a unique residual. It is assumed that e and any of u are independent, but u 0 , u1 and u2 are correlated with each other. We could combine the level-1 and level-2 equations into one equation: y = 0 + 01 G + 1 t + 11 Gt + 2 t 2 + 21 Gt 2 + tu1 + t 2 u2 + u 0 + e The terms with the coecients are called xed coecients because they are specied as independent variables in the level-1 and level-2 equations. The terms involving u and e are random coecients, as they cannot be measured beforehand. It may be useful to think of terms in which coecients are multipliers of independent variables as moderator eects. Thus, depending on their magnitude, they will moderate the strength and, possibly, the direction of the relationships between the independent and dependent variables. The program MLA [13] (http:==www.fsw.leidenuniv.nl=www=w3 ment=medewerkers=bushing=MLA.HTM), a userfriendly software program, was used to compute the HLM model for the present example. A code le for this program can be found in Appendix A1. The results of the HLM analysis and OLS analysis are presented in Table III.
102
M. HEO ET AL.
Examination of Table III shows that while the estimates for the xed coecients for OLS and HLM are the same, their standard errors dier. While this is not always the case, the standard errors are smaller for the HLM solution. Table III also reveals that while the OLS produced the same xed coecients, they diered for the random coecients. This eect is due to the fact that HLM’s iterative algorithms allow it to use more information. The present example used a data set in which no case had any missing data values. Because HLM revises its parameter estimates in an iterative manner, it is not absolutely necessary to have complete data sets. Of course, the more complete the data, the better the estimates. As shown in the following BMI growth curve example, this will allow us to combine fragments of growth curves and then ‘splice’ them together to form a larger curve. The present example used level-2 equations that modelled both common and group specic gammas and unique residuals for the slope, the linear and the quadratic term. However, HLM provides a great deal of exibility for modelling eects. For example, we could specify the same slopes for all subjects but permit the intercepts to vary. If we created a model with the same slopes and intercepts for each case, then we would have a model that is identical to an OLS model that ignores clustering within subjects. At this point, one might ask about how we might select the best model. As with other multivariate procedures, the best model is typically one that balances minimization of error variance with minimization of parameters to be estimated. All of the programs that perform HLM provide measures of model t, ranging from mean-squared error terms to deviance measures. The use of Akaike’s information criterion (AIC) [14] and Bayesian information criterion (BIC) [15] indexes can help choose models that minimize errors while adjusting for the number of parameters to be estimated. These criteria are applied in the subsequent analyses. 2.2. Statistical software Besides MLA, there exist a range of statistical packages that can conduct HLM. Goldstein (reference [7, Chapter 11]) described all of the following packages in terms of source, modelling, adopted methodologies and limitations: BIRAM; BMDP-5V, BUGS (http:==www.mrcbsu.cam.ac.uk=bugs); EGRET (http:==www.cytel.com); GENMOD; GENSTAT (http:==www. nag.co.uk=stats=TT soft.asp); HLM (http:==www.ssicentral.com); ML3/MLn (http:==multilevel. ioe.ac.uk=index.html); SABRE (http:==www.cas.lancs.ac.uk=software=sabre3.1=sabre.html); SAS (http:==www.sas.com); SUDAAN (http:==www.rti.org), and VARCL (distributed by ProGamma, http:==www.gamma.rug.nl). Kreft et al. [16] reviewed ve particular packages: BMDP-5V; GENMOD; HLM; ML3, and VARCL. Singer [17] demonstrated the practical use of the SAS PROC MIXED program in multilevel modelling with program codes. Sullivan et al. [12] further detailed programming with SAS PROC MIXED and HLM=2L [18]. The most recent revisions of standard SEM packages – such as LISREL (http:==www.ssicentral. com), EQS (http:==www.mvsoft.com=), Mplus (http:==www.statmodel.com=index2.html) and AMOS (http:==www.smallwaters.com=amos=) – have multilevel analysis capability. In addition, GLLAM6 (http:==www.iop.kcl.ac.uk=iop=departments=biocomp=programs=gllamm.html), MIXOR/MIXREG (http:==www.uic.edu= ∼hedeker=mix.html) and OSWALD (http:==www. maths.lancs.ac.uk=Software=Oswald) should perform the multilevel analyses. In the Appendix A, we include code and output les from the following three software using the data in Table I: MLA; SAS, and MLwin. The rst two used the maximum-likelihood (ML) method for tting purposes but MLwin used the iterative generalized least squares (IGLS)
BMI GROWTH CURVE
103
method, which is asymptotically equivalent to the ML method when the normal distribution assumption is met [7]. Although all these software programs produced the same estimates of the xed eects and their standard errors, other estimates are slightly dierent. These dierences, an issue for future investigation, may be due to program-specic maximization routines and=or the relatively small sample size of the example data in Table I. However, the signicances of the estimated random eects agree among the three software. For the subsequent analyses and calculations, we used MLwin v1.0 software [19] (http:== multilevel.ioe.ac.uk=index.html), the window version of MLn. The other software in the Appendix were not able to accommodate the size of data that we analysed below.
3. CLINICAL IMPLICATION OF DEVELOPMENT OF BMI GROWTH CURVE Obesity is increasingly common in the United States and a serious risk factor for decreased longevity and many medical disorders such as diabetes and cardiovascular disease [20]. According to the third National Health and Nutrition Examination Survey (NHANES-III), over half the U.S. adult population is overweight or obese [21] and the prevalence continues to increase [22]. From an epidemiologic point of view, BMI alone may not capture the true relations of body composition to health outcomes [23] because BMI does not distinguish between lean mass and fat mass. Nevertheless, BMI is highly correlated with weight and fat mass, and tends to be approximately uncorrelated with height. BMI is arguably the most commonly used index of relative weight in clinical trials and epidemiological studies of the health consequences of obesity [24]. Furthermore, BMI was the index adapted in the most recent NHLBI ‘Clinical guidelines for the identication, education, and treatment of overweight and obesity in adults’ [21]. For these reasons, we used BMI in our analyses. In this paper, overweight=obesity is dened as a body mass index (BMI; kg=m2 ) greater than or equal to 25. Although the NHLBI guidelines [21] classify this as ‘overweight’ (25 6 BMI¡30) and ‘obesity’ (BMI ¿ 30), we combine these groups for pedagogical reasons. A growing number of clinical trials evaluating adult obesity treatments have been initiated in recent years [21]. A primary aim of these clinical trials is to assess the long-term eectiveness of the experimental treatment [25]. At present, however, there are few data to predict the expected change in weight associated with the passage of time among obese persons who do not receive treatment. For ethical and practical reasons, it is generally not possible to maintain a ‘pure’ control group of obese adults from who all treatments are withheld over a number of years. As Brownell and Jeerey [26] argue, this makes it dicult to assess the long-term eectiveness of an experimental treatment since there is no control group against which it can be compared. Furthermore, the point of such comparison is a moving target. This issue is especially relevant in light of proposals that obese people receive treatment for life [27]. One possible solution to this problem is the development of ‘natural growth curves’ of body weight for overweight=obese individuals in the general population. In theory, such growth curves could provide ideal control group data against which experimental groups could be compared over time. That is, growth curves could serve as a ‘surrogate’ for the performance of control groups. A few cross-sectional studies (for example, reference [22]) have attempted to model BMI changes with age. However, this approach can yield misleading results since it assumes that the within-individual growth curve is the same as the between-individual curve. This need
104
M. HEO ET AL.
not be the case, especially if the growth trajectory depends on individuals’ characteristics such as sex, age and initial baseline BMI. For example, the trajectories could be linearly increasing, linearly decreasing, U-shaped or inverted U-shaped etc. Other limitations of the cross-sectional studies include susceptibility to cohort secular eects and selection bias. This necessitates modelling variation in growth curves between individuals as a function of such characteristics. As demonstrated in the following sections, HLM oers a exible approach for developing such ‘BMI growth curves’. Given these modelling exibilities and the potential clinical utility of BMI growth curves, this study used HLM to develop an estimate of natural growth curves for overweight=obese adults in the U.S. population.
4. SELECTION OF DATABASES TO DEVELOP BMI GROWTH CURVES Databases satisfying the following requirements were included in the present study: 1. Raw individual subject data on measured weight and height available at multiple time points. Measured height and weight were required since self-reported weights are often underestimated, especially by obese individuals [28]. 2. Individuals age¿18 years at baseline. Younger individuals, if any, were excluded. 3. Included overweight=obese individuals, dened as BMI¿25, at baseline. Individuals with lower BMIs were excluded. 4. Baseline information available on sex and age. A thorough search of eligible databases was conducting via the following data sources: electronic government databases on health information published through National Center for Health Statistics (NCHS) for the databases of Centers for Disease Control and Prevention (CDC); Young et al. [29]; archival data repositories at the Henry A. Murray Research Center and Institute for Social Research. Four eligible databases were located. They are summarized below and in Table IV.
4.1. Framingham Heart Study (FHS) This longitudinal prospective cohort study was an epidemiological investigation of cardiovascular disease. It was initiated in 1948 and followed an initial sample of 5209 residents from Framingham County of Massachusetts every other year for 30 years, yielding up to 16 examinations per subject [30]. Our inclusion criteria resulted in 2665 eligible subjects.
4.2. Multiple Risk Factor Intervention Trial (MRFIT ) This study, initiated in 1974, followed 12866 men every year for 7–8 years, yielding up to eight examinations per subject [31, 32]. The objective of this trial was to intervene on the primary risk factors for coronary heart disease. This study included an intervention and a control group. We excluded 6428 subjects assigned to the intervention group to avoid possible eects of interventions on natural weight change. The inclusion criteria were therefore applied only to the control group subjects and resulted in 4599 subjects eligible for the present study.
45.7 46.4 52.1 44.8 48.1
(51.9%) (56.1%) (14.9%) (42.1%) (38.9%)
(8.5;28,62) (6.0;35,58) (14.7;25,77) (14.6;18,91) (11.8;18,91)
Baseline age
Ever‡ (%)
1383 2579 746 888 5596
2665(50) 4599(100) 4999(48) 2112(54) 14375(66)
Sample size (% male)
(47.4%) (43.9%) (13.9%) (57.9%) (36.2%)
Never (%)
20 (0.7%) 0 (0.0%) 3561 (71.2%) 1 (0.0%) 3852 (24.9%)
†
2665 4599 4999 2112 14375
Total
11.2 (4.75;1,16) 6.8 (1.58;1,8) 1.8 (0.40;1,2) 2.4 (0.83;1,3) 5.2 (4.23;1,16)
Number of time points∗ of measurements
Missing (%)
22.0 (9.2;0,30) 6.06 (1.4;0,7.9) 7.96† (4.1;0,12.6) 5.81 (3.5;0,10.1) 9.63 (7.7;0,30)
Follow-up years from baseline
Baseline smoking status
(3.42;25,56.7) (2.89;25,40.9) (4.08;25,63.3) (3.90;25,56.1) (3.60;25,63.3)
1262 2020 692 1223 5197
28.7 28.8 29.3 29.0 29.0
Baseline BMI
Mean (SD; min, max) over subjects
Including baseline. Among those who had both measurements performed (baseline and follow-up), the average follow-up time was nearly 10 years. ‡ ‘Current’ plus ‘former’ smoker.
∗
FHS MRFIT NHANES-I and NHEFS TMFS Total
Study
FHS MRFIT NHANES-I and NHEFS TMFS Total
Study
Table IV. Descriptive statistics of eligible subjects by databases.
BMI GROWTH CURVE
105
106
M. HEO ET AL.
4.3. National Health and Nutrition Examination Survey I (NHANES-I ) and its National Health Epidemiologic Follow-up (NHEFS) The NHANES-I, designed to measure the nutritional status and health of the U.S. population aged 1–74 years, was conducted between 1971 and 1975, and examined 23808 subjects. A subset was followed-up in 1982–1984, 1986, 1987 and 1992 [33]. Height and weight were measured only in the follow-up between 1982 and 1984. Subsequently, height and weight were self-reported [34]. Therefore, the subjects were measured, at most, twice in this study. The inclusion criteria resulted in 4999 eligible subjects. 4.4. Tecumseh Mortality Follow-up Study (TMFS) This study was a longitudinal prospective investigation of health and disease status among residents at Tecumseh, Southeast Michigan. Heights and weights of subjects were measured over the entire follow-up period. Heights and weights of 8800 residents of 2400 households were rst measured in 1959–1960 and served as our baseline. Two more follow-ups occurred in 1962–1965 and 1967–1969 [35]. A total of 2112 subjects were eligible for the present study. 4.5. Pooled data We excluded from the analysis measurements at any follow-up time point taken from women who were pregnant at that time point. The four databases above were pooled into a single data set, yielding a total of 75351 data units (the rst-level observations) from 14375 subjects (the second-level observations). Additionally, baseline smoking status (‘ever’ versus ‘never’) was available on 10793 subjects, yielding 68830 data units. When smoking status was included in models, subjects with missing observations were excluded from analyses.
5. STATISTICAL ANALYSES 5.1. Model construction To estimate BMI trajectories over ‘time’, it is important to select an appropriate variable to reect time in our model construction. The selection of the time variable and its origin must be clear in the context of a study. For example, in a quality control study of a copier, number of copies produced by the copier could be a better time variable than the calendar time from its year of manufacture. Similarly, in the present study there may be at least two alternative time variables: age of the subjects and time-on-study, the calendar time from the baseline. One can construct HLM using age for the key ‘time’ variables as discussed in the context of Cox proportional hazards regression model [36], although the origin of the time variable age, that is 0, may not be appropriate for the present study. In the present example, time-on-study may be a more appropriate variable because its time origin is clear and it is appealing from a clinical point of view to predict future BMI at dierent points in time. Let us denote ‘time’ by Tij , the length of the time interval in years between the baseline and the consecutive ith (i = 1; 2; : : : ; nj ) follow-up for the jth (j = 1; 2; : : : ; N ) individual in the pooled data set. For notational convenience, i = 1 for the baseline, which implies T1j = 0,
BMI GROWTH CURVE
107
for all j. First, for the rst level of time points for each individual, the BMI, say Yij , was modelled in terms of Tij as follows: Yij = 0j + 1j Tij + 2j Tij2 + eij = Tij Rj + eij
(1)
where Tij = (1; Tij ; Tij2 ) is a vector and Rj = (0j ; 1j ; 2j ) is a random vector and independent from eij , the rst-level residual. We assume that E(eij ) = 0 and cov(eij ; efg ) = I {i = f & j = g} 2 , where I {:} is an indicator function, that is, I {i = f & j = g} = 1, if i = f and j = g, and = 0, otherwise; variance of the rst-level error e is assumed to be homogeneous and independent among subjects as well as among dierent time points within subjects. This assumption implies that the rst-level residuals are uncorrelated after adjusting for trend, and var(eij ) = 2 for all i and j. However, this assumption does not necessarily imply that the rst-level observations within individuals are uncorrelated – see (4). The random coecients ’s are further modelled (as functions of covariates Z) for the second-level individuals by decomposing into xed and random components as follows: 0j = 00 + 01 Z1j + 02 Z2j + · · · + 0p0 Zp0 j + u 0j 1j = 10 + 11 Z1j + 12 Z2j + · · · + 1p1 Zp1 j + u1j and 2j = 20 + 21 Z1j + 22 Z2j + · · · + 2p2 Zp2 j + u2j where u 0j , u1j and u2j represent residuals for intercepts, slope and quadratic terms, respectively. It follows that for k = 0; 1 and 2 kj = Zkj k + ukj
(2)
where Zkj ’s are vectors with dierent lengths depending on k and their rst component is 1, Zkj = (1; Z1j ; Z2j ; : : : ; Zpk j ) and k = (k0 ; k1 ; k2 ; : : : ; kpk ) . The other components of Zkj ’s are the covariate values for the jth individual’s baseline characteristic. The vector S’s are xed parameter column vectors, and E(uj ) = 0 = (0; 0; 0) ; uj = (u 0j ; u1j ; u2j ) and var(u 0j ) var(u1j ) cov(uj ) = u = cov(u 0j ; u1j ) cov(u 0j ; u2j ) cov(u1j ; u2j ) var(u2j ) for all j. This covariance is modelled for the second-level variance of the growth curve. In sum, by introducing the second-level model (2) into the rst-level model (1), we have Yij = (Z0j S0 ; Z1j S1 ; Z2j S2 )Tij + Tij uj + eij the rst term in the right hand side (RHS) representing the xed eects and the other terms in the RHS the random eects (or parameters). Thus, the hierarchical linear model can be
108
M. HEO ET AL.
represented in terms of expectation and covariance in general as E(Y) = X
and
cov(Y) =
(3)
where Y = (Y1 ; : : : ; Yj ; : : : ; YN ) and Yj = (Y1j ; Y2j ; : : : ; Ynj j ) . The expectation X and the covariance matrix represent xed and random eects, respectively. Specically, for the xed eects, the row of the design matrix X corresponding to Yij is (Zj ; Tij ; Z1j ; Tij2 Z2j ), and = (S0 ; S1 ; S2 ) , the xed coecient vector. On the other hand, regarding elements of cov(Yij ; Yfg ) = I {i = f & j = g}2 + I {j = g}Tij u Tfg
(4)
In other words, the variance of the rst-level observation is var(Yij ) = 2 + Tij u Tij , which is the sum of the rst- and the second-level variance, and the covariance between observations at two dierent time points i and f within each individual is Tij u Tfg . This implies that the covariance matrix is block-diagonal with size Nj=1 nj × Nj=1 nj , that is, = Nj=1 (2 Inj + Vj ), where ⊕ denotes the direct sum of matrices, Inj is the nj × nj identity matrix, and Vj is the nj × nj matrix with (i; f)th element Tij u Tfg for the jth individual. In this case, the ith time point is not necessarily dierent from the fth one. This covariance structure is crucial to the estimation procedure described in the following section, because the second level covariance must be taken into consideration, which traditional approaches such as OLS estimation do not accommodate. OLS assumes that Vj is a zero matrix for all j. 5.2. Estimation Under model (3) with unknown , which is a general case, estimation of both xed and random eects would require iterative generalized least squares (IGLS; for example, reference ˜ −1 X)−1 X ˜ −1 Y for some [37]). Briey, the generalized least squares estimate of is ˆ = (X ˜ ˜ ˆ initial estimate . With the newly estimated and residuals calculated from it, estimates ˜ ˆ can be updated. With the updated estimates , estimates will again be obtained. Such iteration continues until some convergence criteria are met. For further details of estimation procedures and properties of resulting estimates, see Goldstein (reference [7, Chapter 2]). ˜ the variance at the rst level, the variance at the second level at time With a nal , ˜ −1 X)−1 , Tij = t, and the covariance of ˆ will be estimated as ˜ 2 , (1; t; t 2 )˜ u (1; t; t 2 ) and (X respectively. The IGLS estimates are shown to be equivalent to maximum likelihood estimates [37] when the error terms are normally distributed, that is, eij ∼ N(0; 2 ) and uj ∼ MN(0; u ), which in turn implies Y ∼ MN(X; )
(5)
However, even when the data are not multivariate normal, the IGLS still yields consistent estimates [7]. 5.3. Goodness of t For comparisons of several competing models with dierent rst- and second-level variables X (especially with dierent Z), relative goodness of t can be measured by the resulting log-likelihood. That is, under normality assumption (5) apart from constants log L(Y; X; ) = −tr[−1 (Y − X)(Y − X) ] − log ||
109
BMI GROWTH CURVE
where tr is the trace, that is, sum of diagonal terms of a square matrix, and || is the determinant of the matrix . Even individuals with only single time point measurements can contribute not only to the estimation of the intercept time course of BMI change but also to the estimation of slopes, that is, the longitudinal changes in BMI because of the iterative nature of tting and thus calculation of the likelihood. Alternatively, for the comparisons, we used the correlation r between predicted values from estimated xed eects and observations, that is ˆ r = corr(Y; X) and the mean squared error (MSE), that is ˆ = (Y − X) ˆ (Y − X)=(l(Y) ˆ MSE(X) − l()) where l(Y) is the length of vector Y. Signicance testing of joint eects of additional covariates was performed by comparing the so-called ‘deviance’, that is, twice minus the log-likelihood dierence, which is asymptotically distributed as 2 with the number of additional covariates for the degrees of freedom. To see a relative increment of goodness of t by those additional covariates, we calculated the pseudo multiple R2 improvement described by Everitt (reference [38], p. 267), that is, the per cent increment in log-likelihoods due to addition of more independent variables in a model: log L(Y; X; ) %R2 (X : W) ↑ = 1 − × 100 log L(Y; W; ) where W and are subsets of X and , respectively. In other words, the model with W is nested in the model with X. For comparisons among non-nested models (that is, when W and are not necessarily subsets of X and , respectively), both Akaike’s information criterion (AIC) [14] and the Bayesian information criterion (BIC) [15] can be used. The AIC is simply the reduction in loglikelihood minus the increase in number of parameters; positive AIC indicates an improvement in t (reference [39], p. 272). Specically, AIC can be dened as AIC(X : W) = log L(Y; X; ) − log L(Y; W; ) − [l() − l()] Similarly, BIC can be dened as
BIC(X : W) = log L(Y; X; ) − log L(Y; W; ) − [l() − l()] log
N
nj
j=1
Positive BIC also indicates an improvement in t. 5.4. Models considered We considered ve models, namely M1 to M5. The associated coecients and parameters can be found in Table V. These ve models have the same form as in (1). However, they
00 : Intercept 01 : Age 02 : Sex† 03 : BMI 04 : FHS‡ 05 : MRFIT‡ 06 : NHANES‡
10 : Intercept 11 : Age 12 : Sex 13 : BMI 14 : FHS 15 : MRFIT 16 : NHANES
20 : Intercept 21 : Age 22 : Sex 23 : BMI 24 : FHS 25 : MRFIT 26 : NHANES
1j
2j
−3:1 × 10−3 (2:1 × 10−3 ) −1:2 × 10−4∗ (2:0 × 10−5 ) 7:0 × 10−5 (4:2 × 10−4 ) 1:6 × 10−4∗ (6:0 × 10−5 )
— — —
−3:7 × 10−3 (2:1 × 10−3 ) −1:1 × 10−4∗ (2:0 × 10−5 ) 3:3 × 10−4 (4:2 × 10−4 ) 1:6 × 10−4∗ (6:0 × 10−5 )
— — —
— — — — — —
−3:8 × 10−3∗ (2:1 × 10−4 )
— — —
— — —
0:751∗ (0:041)
−4:9 × 10−3∗ (4:0 × 10−4 ) −0:023∗ (0:009) −0:016∗ (1:2 × 10−3 )
— — —
0:759∗ (0:041)
−5:1 × 10−3∗ (4:0 × 10−4 ) −0:026∗ (0:009) −0:016∗ (1:2 × 10−4 )
0:753∗ (0:023)
0:298∗ (0:077) 2:4 × 10−4 (7:2 × 10−4 ) 0:051∗ (0:019) 0:987∗ (0:002) −0:070∗ (0:025) −0:143∗ (0:024) 0:075∗ (0:026)
−6:7 × 10−3∗ (2:2 × 10−4 ) −0:019∗ (4:7 × 10−4 ) −0:013∗ (6:7 × 10−4 )
0:266∗ (0:074) 1:4 × 10−3∗ (7:0 × 10−4 ) −0:036∗ (0:017) 0:986∗ (0:002) — — —
0:265∗ (0:072) 0:002∗ (6:8 × 10−4 ) −0:040∗ (0:016) 0:985∗ (0:002) — — —
Model 4 (M4)
— — —
−2:6 × 10−3 (2:1 × 10−3 ) −1:1 × 10−4∗ (2:0 × 10−5 ) 0:001∗ (4:6 × 10−4 ) 1:8 × 10−4∗ (6:0 × 10−5 )
0:037∗ (0:010) 0:010(9:9 × 10−3 )
0:786∗ (0:042)
−5:3 × 10−3∗ (4:1 × 10−4 ) −0:051∗ (0:010) −0:016∗ (1:2 × 10−3 ) −0:051∗ (0:010)
0:268∗ (0:077) 6:5 × 10−4 (7:2 × 10−4 ) 0:071∗ (0:020) 0:987∗ (0:002) −0:038(0:026) −0:163∗ (0:026) 0:051(0:027)
Model 3 (M3)
Model 2 (M2)
covariate (Z)
Model 1 (M1)
Estimated xed coecients (SE) for
Baseline
0j
0:011∗ (2:9 × 10−3 ) −1:2 × 10−4∗ (2:0 × 10−5 ) −7:0 × 10−5 (4:6 × 10−4 ) 1:5 × 10−4∗ (6:0 × 10−5 ) −0:012∗ (0:002) 9:5 × 10−3∗ (2:5 × 10−3 ) −0:020∗ (0:004)
0:189∗ (0:041)
−0:016(0:023)
0:071∗ (0:021)
0:657∗ (0:045)
−5:1 × 10−3∗ (4:1 × 10−4 ) −0:031∗ (0:010) −0:016∗ (1:2 × 10−3 )
0:337∗ (0:077) 4:3 × 10−4 (7:2 × 10−4 ) 0:059∗ (0:020) 0:987∗ (0:002) −0:101∗ (0:028) −0:090∗ (0:028) 0:002∗ (0:028)
Model 5 (M5)
Table V. Results of hierarchical linear model tting with 75 351 data units from 14 257 subjects; the TMFS subjects are the referent.
110 M. HEO ET AL.
0.824 4.26 0.058% (+; +) 0.048% (+,+) —
0.824 4.26 0.011% (+,−) —
0.824 4.27 —
¡0:0001 ¡0:0001 —
¡0:0001 —
—
260,780
1:161∗ (6:7 × 10−3 ) 0.000(0.000) 0:149∗ (2:3 × 10−3 ) 1:9 × 10−4∗ (0:000) 0.000(0.000) 0.000(0.000) −0:005∗ (1:0 × 10−4 )
260,904
1:163∗ (6:7 × 10−3 ) 0.000(0.000) 0:149∗ (2:3 × 10−3 ) 1:9 × 10−4∗ (0:000) 0.000(0.000) 0.000(0.000) −0:005∗ (1:0 × 10−4 )
260,932
1:163∗ (6:7 × 10−3 ) 0.000(0.000) 0:150∗ (2:3 × 10−3 ) 2:0 × 10−4∗ (0:000) 0.000(0.000) 0.000(0.000) −0:005∗ (1:0 × 10−4 )
‡
†
p-value¡0:05. Sex = 1 for male, 0 for female. FHS = 1, if subjects belong to this study, 0 otherwise. MRFIT and NHANES are similarly dened. § Signs of AIC and BIC: the positive sign indicates improvements and the negative sign no improvements.
∗
Goodness of t −2 log L p-value versus model 1 versus model 2 versus model 3 versus model 4 r MSE (AIC, BIC)§ %R2 ↑ versus model 1 versus model 2 versus model 3 versus model 4
Estimated random eects (SE) ˜ 2 ˜ u var(u 0j ) var(u1j ) var(u2j ) cov(u 0j ; u1j ) cov(u 0j ; u2j ) cov(u1j ; u2j )
0.101% (+,+) 0.090% (+,+) 0.043% (+,+) —
¡0:0001 ¡0:0001 ¡0:0001 — 0.824 4.26
260,668
1:159∗ (6:7 × 10−3 ) 0.000(0.000) 0:150∗ (2:3 × 10−3 ) 2:0 × 10−4∗ (0:000) 0.000(0.000) 0.000(0.000) −0:005∗ (1:0 × 10−4 )
0.194% 0.183% 0.136% 0.093%
(+,+) (+,+) (+,+) (+,+)
¡0:0001 ¡0:0001 ¡0:0001 ¡0:0001 0.825 4.25
260,426
1:155∗ (6:6 × 10−3 ) 0.000(0.000) 0:149∗ (2:3 × 10−3 ) 2:0 × 10−4∗ (0:000) 0.000(0.000) 0.000(0.000) −0:005∗ (1:0 × 10−4 )
BMI GROWTH CURVE
111
112
M. HEO ET AL.
Intercept
Age
γ 00
Intercept
Sex
γ 01
γ 02
BMI
γ 03
β0 j
eij
Y u0 j Intercept
γ 10
u2 j
BMI
γ 13
γ 21
γ 11 γ 12
β1 j
Sex
Age
Age
γ 22 γ 23
u1 j
Intercept
γ 20
Sex
β2 j BMI
T T2
Figure 3. Diagram of model M2. Thick objects represent the rst-level relationships. Thin objects represent the second-level relationships. Arrows with an open head are associated with xed parameters. Arrows with a closed head are associated with random parameters. Solid lines are associated with coecients. Dotted lines are associated with errors. The circle with shadow represents the dependent variable.
dier in modelling random coecients – see (2). For example, the rst two models M1 and M2 are M1:
0j = 00 + 01 Agej + 02 Sexj + 03 BMIj + u 0j 1j = 10 + 11 Agej + 12 Sexj + 13 BMIj + u1j 2j = 20 + u2j
M2:
0j = 00 + 01 Agej + 02 Sexj + 03 BMIj + u 0j 1j = 10 + 11 Agej + 12 Sexj + 13 BMIj + u1j 2j = 20 + 21 Agej + 22 Sexj + 23 BMIj + u2j
BMI GROWTH CURVE
113
where Sex = 1 for males and 0 for females and BMI represents the baseline BMI. These rst two models dier only in modelling the random coecient for the quadratic term for time and do not control for the contributions of dierent studies so that predictions of BMI could be based on these models. Figure 3 shows a diagram depicting a path of eects of variables in model M2. On the other hand, the other three models M3 to M5 are concerned with possible heterogeneity of growth curves over dierent studies using the study indicators for the second-level covariates and taking the TMFS study subjects for the referent. They can be similarly written based on the coecients specied in Table V. Specically, the rst-level model of M1 and M2 allowed BMI to be a linear and quadratic function of time, respectively. In both models M1 and M2, the second-level models for the intercept, the linear time term, and for model M2, the quadratic time term, included age, sex and initial BMI. In models M3, M4 and M5, the second-level model allowed the intercept to be a quadratic function of time. In model M3, the second-level model allowed the intercepts to be a function of study site in addition to age, sex and initial BMI. Model M3 allowed the intercept and linear time term to be a function of study site in addition to age, sex and initial BMI. Model M5 allowed the intercept and linear and quadratic time term to be a function of study site in addition to age, sex and initial BMI. In short, models M3–M5 can be used for testing the signicance of eect modication (or, equivalently heterogeneity) of the BMI growth trajectories by studies. That is, these models test whether the trajectories are the same across studies.
6. RESULTS 6.1. Main results Table V shows results from tting models M1 to M5 for the BMI growth curve over time from all available second-level 14375 subjects with the rst-level 75351 data units. M2 ts signicantly better (p¡0:0001) than M1 with positive AIC (negative BIC though), although the dierence is trivial in terms of the estimated coecients, %R2 ↑ and correlation r between predicted and observed values. This implies that the joint eect of age (21 ), sex (22 ) and initial baseline BMI (23 ) on the quadratic term 2 is slight in magnitude but signicant (the small eect size, however, may not be biologically important). The predicted BMI (Y ) at time t, (that is, t from baseline) based on model M2 can be written Y (t) = 0:266 + 0:0014Age − 0:036Sex + 0:985BMI + (0:759 − 0:0051Age − 0:026Sex − 0:016BMI)t + (−0:0037 − 0:00011Age + 0:00033Sex + 0:00016BMI)t 2
(6)
For this model, all the estimated coecients are signicant at 0.05 alpha level with the exception of the estimated intercept −0:0037 for the quadratic time term (Table V). These results indicate that the BMI growth trajectory is dependent upon age, sex and initial baseline BMI (see Figure 4 and Table VI). For example, a 25-year-old male with current BMI 25 is expected to have BMI 26.9 in 10 years while a 45-year-old female with current BMI 25 is expected to have BMI 25.9 in 10 years. For the graphical representation, we chose 27.5 and 32.5 as initial BMI values, representing midpoints of overweight and class I obesity statuses
M. HEO ET AL.
30
30
29
29
28
28 BMI
BMI
114
27 age=30 age=45 age=60
26 25
25 24
0
5
(a)
10
15
20
0
34
33
33
32
32
31 age=30 age=45 age=60
29
10
15
20
15
20
time (T)
34
30
5
(c)
time (T)
BMI
BMI
age=30 age=45 age=60
26
24
31 age=30 age=45 age=60
30 29 28
28 0
(b)
27
5
10 time (T)
15
20
0
(d)
5
10 time (T)
Figure 4. Growth curves for dierent baseline ages estimated from model M2 in Table V; (a) for men with baseline BMI 27.5; (b) for men with baseline BMI 32.5; (c) for women with baseline BMI 27.5; (d) for women with baseline 32.5.
[21]. BMI tends to increase over time for younger people with relatively moderate obesity but decrease for older people regardless of degree of obesity. The gradients of such changes are inversely related to the baseline BMI, but do not substantially depend on gender, although women tend to gain more and lose less weight than do men. Heterogeneity of such patterns over dierent studies was tested in terms of interactions of study eect with the intercept [40], the slope and the quadratic terms: k4 ; k5 and k6 , for k = 0; 1; 2. The joint eects of these interactions were signicant (Table V), implying statistically signicant heterogeneity among studies. However, the correlations, r, from models M3 to M5 were the same as those from models M1 and M2. Moreover, the R2 improvements are minimal even from model M1 with maximum improvement 0.194 per cent (Table V), suggesting that any potential eect modication by individual studies is not substantial. For example, based on M4 (Table V), if the 25-year-old male with current BMI 25 had come from FHS, MRFIT, NHAHES-I or TMFS, we would expect him to have BMI 26.4, 27.2, 27.1 or 27, respectively, in 10 years. However, this non-substantial eect modication due to dierent studies may not be biologically important. Through models M1 to M5 in Table V, the estimated random eects and their standard errors were almost invariant, yielding a high negative correlation between the linear slope √ √ and the quadratic terms, that is, corr(1j ; 2j ) = cov(u1j ; u2j )= {var(u1j )var(u2j )} ≈ −0:005= (0:15 ×
115
BMI GROWTH CURVE
Table VI. Prediction of BMI over time based on model M2 by gender, baseline BMI and age. Sex
Men
Women
Baseline BMI
Age 5
10
15
20
25
25 35 45 55
25.9 25.7 25.4 25.1
26.9 26.2 25.6 25.0
27.7 26.7 25.7 24.7
28.4 26.9 25.5 24.0
30
25 35 45 55
30.5 30.2 30.0 29.7
31.1 30.5 29.9 29.3
31.6 30.6 29.6 28.6
32.1 30.6 29.2 27.7
35
25 35 45 55
35.1 34.8 34.5 34.2
35.3 34.7 34.1 33.5
35.6 34.6 33.5 32.5
35.8 34.3 32.9 31.4
25
25 35 45 55
26.1 25.8 25.6 25.3
27.1 26.5 25.9 25.3
28.0 27.0 26.0 25.0
28.8 27.4 25.9 24.5
30
25 35 45 55
30.7 30.4 30.1 29.8
31.4 30.7 30.1 29.5
32.0 31.0 30.0 29.0
32.5 31.1 29.6 28.1
35
25 35 45 55
35.2 34.9 34.7 34.4
35.6 35.0 34.4 33.8
35.9 34.9 33.9 32.9
36.2 34.7 33.3 31.8
1.95
2.71
3.29
3.63
Prediction error∗ ∗
Time
Square root of the total variance ˜ 2 + (1; t; t 2 )˜ u (1; t; t 2 )T .
0:0002) = −0:913 for all the models. Figure 5 shows a pictorial description of relationship between the two terms u1j and u2j , estimated from the empirical Bayesian method (for example, reference [41]) based on model M2. This indicates that the strong compensation between these two terms prevents a sudden surge or sudden drop in the prediction of future BMI. The precision of prediction was measured by a total estimated variance, that is ˜ 2 + (1; t; t 2 )˜ u (1; t; t 2 )
(7)
depending on the time, where the square root of this is a prediction error. Figure 6 shows that the magnitude of the total estimated variance from model M2 increases over time.
116
M. HEO ET AL.
quadratic term residuals
0.05
0.0
-0.05
-2
-1
0
1
2
slope residuals
Figure 5. Scatter plot of estimated u1j and u2j by empirical Bayesian method based on model M2 in Table V.
14 12
Total variance
10
8 6
4 2 0 0
5
10
15
20
Time (T)
Figure 6. Estimated total variance of prediction over time based on model M2.
BMI GROWTH CURVE
117
6.2. Analysis with smoking data Using the 10793 subjects for whom smoking status was available (Table IV), we tested the signicance of smoking eects on the growth curve. The results are presented in Table VII. Models SMK2 and SMK4 indicate that the smoking status has signicant eects intercept 0j , which implies that ‘ever’ smokers tend to be heavier at baseline. However, the eects of smoking status on the slope 1j and the quadratic term 2j in these two models were not signicant when tested marginally (see Table VII) or jointly (p = 0:607 in SMK2 and p = 0:333 in SMK4). Thus, smoking status did not signicantly aect the estimated growth curve trajectories. The models SMK1 and SMK3 served as a kind of sensitivity analysis with respect to models M1 and M2, respectively, because the former t the subset of subjects that the latter tted with the same sets of variables. There is a high degree of similarity in the coecients between SMK1 and M1 and between SMK3 and M2 (see Tables V and VII). The signs, or directions, of the estimated coecients are the same. More importantly, for this subset of subjects with non-missing smoking status, the correlations r between observed values and predicted values from M1 and M2 were both 0.815. This shows that models M1 and M2 ts this subset as well as SMK1 and SMK2, respectively, both of which produced 0.816 correlation (Table VII). Therefore, models M1 and M2 seem well supported for this subset. This is particularly so because SMK1 and SMK2 t the subset better than M1 and M2, respectively, as they should do, but the dierence in t is negligible.
7. DISCUSSION This paper uses HLM to estimate the expected changes in BMI over time in overweight persons. As mentioned before, this approach models dierent patterns, or shapes, of growth trajectories among individuals in terms of their characteristics, which results in modelling both within- and between-individual variations. In this sense, estimation through HLM has an advantage over the OLS approach, because the latter does not model the between-individual variations of growth trajectories. The OLS estimation approach, therefore, may yield inappropriately smaller standard errors of estimates of coecients 00 ; 10 and 20 especially when the number of the second-level units is large [7]. Indeed, (inappropriately) smaller standard errors of estimates of coecients 10 and 20 resulted from OLS approaches to models M1 to M5 and SMK1 to SMK4 in the present study (data not shown) – this illustrates the importance of using HLM when ‘between-individual variations’ are present. In addition, using OLS to estimate the coecients from the pooled data would be inappropriate as OLS assumes that the errors are independent with mean zero and common variance, but the errors from the model derived from the level-1 and level-2 analyses are not independent and the variance may not necessarily be homogeneous. Furthermore, the deviance, or the twice minus the log-likelihood dierence, of OLS from HLM tting was no less than 56000 (that is, no less than 20 per cent improvement of log-likelihood) over all models M1 to M5 and SMK1 to SMK4, strongly supporting the randomness of the coecients of 0j , 1j and 2j . Further advantages of HLM over OLS, when the former is appropriate, are discussed in Pollack [42] and general aspects of application of HLM are extensively discussed in Kreft [43]. Goldstein (reference [7, Chapter 2]) briey
Baseline covariate (Z)
00 : Intercept 01 : Age 02 : Sex† 03 : BMI 04 : Smoking‡
10 : Intercept 11 : Age 12 : Sex 13 : BMI 14 : Smoking
20 : Intercept 21 : Age 22 : Sex 23 : BMI 24 : Smoking
0j
1j
2j
0:640∗ (0:027)
— −3:6 × 10
(2:2 × 10 — — — —
−3∗
−4
)
−4
(3:1 × 10 ) — — — −4:0 × 10−5∗ (4:2 × 10−4 )
−3:6 × 10
−3∗
−6:3 × 10−3∗ (2:8 × 10−4 ) −0:010(5:6 × 10−3 ) −0:010∗ (7:6 × 10−4 ) −3:9 × 10−3 (9:2 × 10−3 )
0:635∗ (0:026)
−6:3 × 10−3∗ (2:8 × 10−4 ) −0:011∗ (5:3 × 10−3 ) −0:010∗ (7:6 × 10−4 )
0:312 (0:086) 2:6 × 10−3∗ (9:0 × 10−4 ) −0:019(0:020) 0:981∗ (2:5 × 10−3 ) 0:046∗ (0:017)
∗
Model 2 (SMK2)
0:359 (0:084) 2:1 × 10−3∗ (8:9 × 10−4 ) −7:7 × 10−3 (0:019) 0:980∗ (2:5 × 10−3 ) —
∗
Model 1 (SMK1)
—
3:4 × 10 (2:2 × 10−3 ) −1:4 × 10−4∗ (2:0 × 10−5 ) −1:8 × 10−4 (4:5 × 10−4 ) −2:0 × 10−5 (7:0 × 10−5 )
−3
—
0:520∗ (0:048)
−3:9 × 10−3∗ (5:0 × 10−4 ) −0:011(0:010) −0:010∗ (1:4 × 10−3 )
0:417 (0:088) 7:1 × 10−4 (9:3 × 10−3 ) −6:4 × 10−3 (0:020) 0:980∗ (2:6 × 10−3 ) —
∗
Model 3 (SMK3)
Estimated xed coecients (SE) for
3:8 × 10−3 (2:3 × 10−3 ) −1:5 × 10−4∗ (2:0 × 10−5 ) 2:0 × 10−4∗ (4:8 × 10−4 ) −2:0 × 10−5 (7:0 × 10−5 ) −5:0 × 10−4 (4:5 × 10−4 )
0:516∗ (0:049)
−3:9 × 10−3∗ (5:1 × 10−4 ) −0:012(0:011) −0:010∗ (1:4 × 10−3 ) −4:1 × 10−3 (9:5 × 10−3 )
0:379∗ (0:090) 9:7 × 10−4 (9:4 × 10−4 ) −0:016(0:020) 0:981∗ (2:6 × 10−3 ) 0:041∗ (0:017)
Model 4 (SMK4)
Table VII. Results of hierarchical linear model tting with 68 830 data units from 10 793 subjects with smoking status.
118 M. HEO ET AL.
(AIC,BIC)§ versus model 1 versus model 2 versus model 3
versus model 1 versus model 2 versus model 3
0.014% (+; −) 0.010% (+; +) —
0.816 4.26 0.003% (+; −) —
0.816 4.26 —
‡
†
238,690
1:237∗ (7:3 × 10−3 ) 0.000 (0.000) 1:139∗ (2:4 × 10−4 ) 1:8 × 10−4∗ (0:000) 0.000 (0.000) 0.000 (0.000) −4:7 × 10−3∗ (1:0 × 10−4 )
¡0:0001 — — 0.816 4.26
238,715
1:237∗ (7:3 × 10−3 ) 0.000 (0.000) 1:139∗ (2:4 × 10−4 ) 1:8 × 10−4∗ (0:000) 0.000 (0.000) 0.000 (0.000) −4:7 × 10−3∗ (1:0 × 10−4 )
0.046 —
—
238,723
1:237∗ (7:3 × 10−3 ) 0.000 (0.000) 1:139∗ (2:4 × 10−4 ) 1:8 × 10−4∗ (0:000) 0.000 (0.000) 0.000 (0.000) −4:7 × 10−3∗ (1:0 × 10−4 )
p-value ¡0:05. Sex = 1 for male, 0 for female. Smoking = 1 for ‘ever’ smoker, 0 for ‘never’ smoker. § Signs of AIC and BIC: the positive sign indicates improvements and the negative sign no improvements.
∗
r MSE %R2 ↑
Goodness of t −2 log L p-value
Estimated random eects (SE) ˜ 2 var(u 0j ) ˜ u var(u1j ) var(u2j ) cov(u 0j ; u1j ) cov(u 0j ; u2j ) cov(u1j ; u2j )
0.018% (+; −) 0.015% (+; −) 0.004% (+; −)
¡0:0001 ¡0:0001 0.019 0.815 4.28
238,680
1:237∗ (7:3 × 10−3 ) 0.000 (0.000) 1:139∗ (2:4 × 10−4 ) 1:8 × 10−4∗ (0:000) 0.000 (0.000) 0.000 (0.000) −4:7 × 10−3∗ (1:0 × 10−4 )
BMI GROWTH CURVE
119
120
M. HEO ET AL.
and concisely discussed other applicable estimation procedures to multilevel-structured data such as the generalized estimating equations (GEE) [44] and Bayesian approaches such as the Markov chain Monte Carlo (MCMC) [45]. The methods used to develop the growth curves herein have several strengths that should make the results of practical utility. They were derived from four large-sample longitudinal studies. Such pooling of large databases increased the precision of parameter estimates. These studies employed measured rather than self-reported height and weight. Studies measured height and weight, thus eliminating possible reporting biases. An advantage of the pooling in this particular analysis is that it allowed us to test the among-study variation in within-subject trajectories after accounting for covariates. Although there was some heterogeneity of eect among the studies, the eect was minor and did not meaningfully impact on the estimated growth curves (Table V). This implies that the performances of prediction by models M1 and M2 without knowledge of the source of particular observations are almost the same as those by models M3 to M5 with such knowledge. This is practically important because it suggests a good degree of generalizability to our ndings. Nevertheless, the recent secular trend for increasing obesity [22] may introduce a confounding eect on the estimated BMI growth trajectories. That is, the estimated trajectories might reect the additive combination of a natural trajectory plus secular trends. On the other hand, there were arguably minimal secular trends for the period during which these data were collected and, as previously mentioned, the HLM results did not substantially dier by study. For these reasons, confounding due to possible secular trends is unlikely to be substantial in the present analysis. A desirable feature of the present analytic approach is that the decline of BMI in the older subjects (baseline age = 60, Figure 4) cannot be attributed to a simple ‘survivor eect’. A survivor eect could occur if older leaner people tended to live longer than older heavier people. If individuals’ weights remained relatively constant, this might eventually lead to lower mean BMIs among a cohort of older people as they aged. If the analysis had been carried out in a cross-sectional manner, which is essentially the same as the ‘curve of averages’ (see reference [3]), such reasoning might be valid. However, because the HLM approach adapted herein is essentially the ‘average of growth curves’ of individuals over years, it bypasses the potential survivor eect. Figure 7 illustrates this point with three hypothetical subjects. The growth curve with the dashed line, based on a cross-sectional analysis, shows a decline of growth because subjects 2 and 3 died before the ages of 70 and 60 years, respectively, a result of which is a survivor eect on growth. On the other hand, the growth curve with the solid line does not show such decline, because it is based on the average of individual growth curves. Thus, our results cannot be due to artefacts from survivor eects. Hierarchical liner modelling is especially well-suited to meta-analysis (see references [46, 47] for tutorial). In typical practice, study eects have either been ignored (a practice that can yield severely biased estimates and their standard errors [48]), or considered to be xed. However, study eects should be treated as random as long as generalization of meta-analytic results is concerned, given the potential for methodological variability among studies (for example, sample size, demographic variables, predictor variables, experimental designs and measurement methods). The application and extension of HLM to meta-analysis is straightforward and can be performed by any software introduced in Section 2.2. The main ndings from this study can be summarized as BMI tends to increase with time for younger people with relatively moderate obesity while it tends to decrease for older people regardless of the degree of obesity. These results are consistent with previous results [49].
121
BMI GROWTH CURVE
Average Curve
Curve of Averages
subject 2
BMI
subject 3
subject 1 40
50
60
70
time/age
Figure 7. Dashed line with ‘♦’ markers represents the growth curve based on cross-sectional analysis. Solid line with ‘×’ markers represents the growth cure based on longitudinal analysis with HLM.
This implies that it might be benecial to try to prevent obesity in younger adults, because the younger overweight=obese persons have a tendency to gain weight over time regardless of their ‘current’ BMI as shown in Figure 4. The rates of these changes are inversely related to the baseline BMI and the patterns do not substantially depend on gender. In addition, smoking status does not have signicant eect on the patterns (Table VII). A practical implication of the growth curves developed herein is to provide control data against which the long-term ecacy of obesity interventions could be compared. For example, several studies have conducted follow-up assessments as much as 5 or 10 years after treatment [50, 51]. Although these studies successfully followed patients’ weight changes over time, there was no control group against which their progress could be compared. The growth curves generated herein ll this void. These growth curves could also be used to gauge an individual patient’s progress. For example, suppose that a 25-year-old male subject with BMI of 30 (for example, 203 pound for 5 9 height) lost 21 pounds due to an intervention and maintained that loss for ve years. This corresponds to a decrease in his BMI to 27. According to prediction equation (6), without intervention, the subject would have been expected to gain 0.8 BMI units (5.6 pound) in ve years with a prediction error of ±1:95 BMI calculated from equation (7). Therefore, his weight management could be considered quite successful, because his loss of 3 BMI units in ve years is approximately 2 SD below the predicted 0.8 BMI units gain. In sum, this paper demonstrates the utility of HLM for developing longitudinal growth curves. This was accomplished by application of random coecients regression to estimating natural changes in relative body weight over the lifespan, which provides prediction equations
122
M. HEO ET AL.
for long-term changes in BMI that can be used as pseudo-controls. It is hoped that this will be useful to other investigators.
APPENDIX: COMPARISON AMONG SOFTWARE CODES AND THEIR OUTPUTS A.1. MLA code and output MLA code /TITLE Mla Growth Curve Example /DATA file= example.dat vars= 6 % Six Data Fields in the File id2= 1 % The level-2 Variable is the First Field / MODEL % We Model the First-Level and Second-Level Effects % g’s are the gamma coefficients for the Level-2 equations % u’s are the Level-2 residuals % e’s are the Level-1 residuals % v’s are the ordinal positions of variables in a file. In our example: % v3 is a contrast code for Group Members -1 for Group 1 and 1 for Group 2 % v4 is t(ime); the linear term % v5 is time-squared; the quadratic term % v6 is the dependent variable b0 = g00+g01? v3 + u0 % Level-2 equation for Regression constant b1 = g10+g11? v3 + u1 % Level-2 equation for Linear Effect b2 = g20+g21? v3 + u2 % Level-2 equation for Quadratic Effect v6 = b0 + b1? v4 + b2? v5 + e1 % Level-1 Equation /PRINT post= all rand= all olsq= yes res= all /END MLA output Full information maximum likelihood estimates (BFGS) Fixed parameters Label
Estimate
SE
T
Prob(T)
G0 G1 G10 G11 G20 G21
5.106271 0.195673 1.949746 0.208803 0.115567 0.064794
0.107180 0.107180 0.044864 0.044864 0.009241 0.009241
47.64 1.83 43.46 4.65 12.51 7.01
0.0000 0.0679 0.0000 0.0000 0.0000 0.0000
123
BMI GROWTH CURVE
Random parameters Label
Estimate
SE
T
Prob(T)
U0 U0 U1? U0 U1? U1 U2? U0 U2? U1 U2? U2
0.003430 0.001080 0.000873 −0.002525 −0.000441 0.002095
0.095007 0.037456 0.016638 0.005695 0.002538 0.000663
0.03 0.02 0.05 −0.44 −0.17 3.16
0.9712 0.9770 0.9581 0.6575 0.8620 0.0015
E
0.246646
0.024070
10.25
0.0000
?
Warning(1): possible nearly-singular covariance matrix Determinant covariance matrix = 1.9995e-19 Conditional intra-class correlation = 0.00/(0.25+0.00) = 0.0137 # iterations = 48 -2? Log(L) = 594.184819
A.2. SAS Code and output SAS Code proc mixed data=growth.example covtest method = ml; class id; model outcome = contrast time time_sq t1_gp t2_gp/solution ; random intercept time time_sq/subject = id type =un; run ;
SAS output Covariance Parameter Estimates Cov Parm
Subject
Estimate
Standard Error
Z
Pr Z
UN(1,1) UN(2,1) UN(2,2) UN(3,1) UN(3,2) UN(3,3) Residual
ID ID ID ID ID ID
2.24E-18 0.003075 4.55E-70 -0.00336 -0.00013 0.002057 0.2466
. 0.004370 . 0.005426 0.002215 0.000651 0.02239
. 0.70 . -0.62 -0.06 3.16 11.01
. 0.4817 . 0.5362 0.9535 0.0008 <.0001
Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better)
594.1 616.1 617.0 631.5
124
M. HEO ET AL.
Null Model Likelihood Ratio Test DF
Chi-Square
Pr > ChiSq
4
748.50
<.0001
Solution for Fixed Effects Effect
Estimate
Standard Error
DF
t Value
Pr > |t|
5.1063 0.1957 1.9497 0.1156 0.2088 0.06479
0.1066 0.1066 0.04453 0.009173 0.04453 0.009173
28 210 28 28 210 210
47.89 1.84 43.78 12.60 4.69 7.06
<.0001 0.0679 <.0001 <.0001 <.0001 <.0001
Intercept CONTRAST TIME TIME SQ T1 GP T2 GP
Type 3 Tests of Fixed Effects Effect
Num DF
Den DF
F Value
Pr > F
1 1 1 1 1
210 28 28 210 210
3.37 1917.01 158.72 21.99 49.89
0.0679 <.0001 <.0001 <.0001 <.0001
CONTRAST TIME TIME sq T1 GP T2 GP
A.3. MLwin output outcomeij ∼ N(XB; ) outcomeij = 0ij intercep + 1j tmij + 2j tm2ij + 0:1956726(0:1079461)contrast j + 0:2088034(0:0450828)tl gpij + 0:0647941(0:0089599)t2 gpij 0ij = 5:1062620(0:1079461) + u 0j + e0ij 1j = 1:9497500(0:0450828) + u1j 2j = 0:1155664(0:0089599) + u2j u 0j u1j u2j
∼ N(0; u ): u =
0:0000000(0:0000000) 0:0000000(0:0000000) 0:0000000(0:0000000)
0:0000000(0:0000000) 0:0000000(0:0000000)
0:0019297(0:0005008)
[e 0ij ] ∼ N(0; e ): e = [0:2527032(0:0217494)] −2∗ log(like) = 596:8070000
(Note: the numbers in the parentheses represent the standard errors of the preceding estimates; tm = time; tm2 = time × time; t1 gp = time × contrast; t2 gp = time2 × contrast; one = 1.)
BMI GROWTH CURVE
125
ACKNOWLEDGEMENTS
Supported in part by National Institute of Health grants R01DK51716, P01DK42618, P30DK26687 and K08MH01530. We thank Drs Sander Greenland, David Rindskopf, Gary Gadbury, Brent J. Shelton, Stanley Heshka and David Williamson for advice on this study and Ms Adwoa Dadzie for her help. REFERENCES 1. Aber MS, McArdle JJ. Latent growth curve approach to modelling the development of competence. In Criteria for Competence: Controversies in the Conceptualization and Assessment of Children’s Abilities, Chandler M, Chapman M (eds). Lawrence Erlbaum: Hillsdale NJ, 1991; 231–258. 2. Klein DN, Norden KA, Ferro T, Leader JB, Kasch KL, Klein LM, Schwartz JE, Aronson TA. Thirty-month naturalistic follow-up study of early-onset dysthymic disorder: course, diagnostic stability, and prediction of outcome. Journal of Abnormal Psychology 1998; 107:338–348. 3. Allison DB, Paultre F, Maggio C, Mezzitis N, Pi-Sunyer FX. The use of areas under curves in diabetes research. Diabetes Care 1995; 18:245–250. 4. McArdle JJ, Allison DB. Regression change models with incomplete repeated measures data in obesity research. In Obesity Treatment: Establishing Goals, Improving Outcomes, and Reviewing the Research Agenda, Allison DB, Pi-Sunyer FX (eds). NATO ASI series, vol. 278, Plenum Press: New York, 1995; 53– 64. 5. Byrk AS, Raudenbush SW. Hierarchical Linear Models: Applications and Data Analysis Methods. SAGE Publications: Newbury Park, CA, 1992. 6. Longford NT. Random Coecient Models. Oxford University Press: New York, 1993. 7. Goldstein H. Multilevel Statistical Models. 2nd edn. Wiley: New York, 1995. 8. Hox JJ. Applied Multilevel Analysis. TT-Publikaties: Amsterdam, 1995. 9. Little TD, Schnabel KU, Baumert J (eds). Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches, and Specic Examples. Erlbaum: Mahwah, NJ, 2000. 10. Snijders TAB, Bosker RJ. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. SAGE: London, 1999. 11. Kreft IGG, de Leeuw J. Introducing Multilevel Modeling. SAGE: Newbury Park, CA, 1998. 12. Sullivan LM, Dukes KA, Losina E. Tutorial in biostatistics: An introduction to hierarchical linear modelling. Statistics in Medicine 1999; 18:855–888. 13. Bushing F, Meijer E, van der Leeden R. MLA J Multilevel Analysis for Two Level Data, Version 3.2. Leiden University, Faculty of Social and Behavioural Sciences: Leiden, 1997. 14. Akaike H. Information theory and an extension of the entropy maximization principle. In Proceedings of the Second International Symposium on Information Theory, Petrov B, Cask R (eds). Akademica: Kiado, 1973. 15. Schwartz G. Estimating the dimension of a model. Annals of Statistics 1978; 6:461– 464. 16. Kreft IGG, de Leeuw J, van der Leen R. Review of ve multilevel analysis programs: BMDP-5V, GENMOD, HLM, ML3, and VARCL. American Statistician 1994; 48:324 –335. 17. Singer JD. Using SAS PROC MIXED to t multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics 1998; 23:323–355. 18. Bryk AS, Raudenbush SW, Congdon RT. HLM Hierarchical Linear and Nonlinear Modeling with the HLM=2L and HLM=3L Programs. Scientic Software International Inc.: Chicago, IL, 1996. 19. Goldstein H, Rasbash J, Plewis I, Draper D, Browne W, Yang M, Woodhouse G, Healy M. A User’s Guide to MLwin; version 1.0. Multilevel Models Projects, Institute of Education: London, 1998. 20. Pi-Sunyer FX. Medical hazards of obesity. Annals of Internal Medicine 1993; 119:655– 660. 21. National Heart, Lung, and Blood Institute. Clinical guidelines on the identication, evaluation, and treatment of overweight and obesity in adults: the evidence report. National Heart, Lung, and Blood Institute: Bethesda, MD, 1998. 22. Kuczmarski RJ, Flegal KM, Campbell SM, Johnson CL. Increasing prevalence of overweight among US adults. The National Health and Examination Surveys. Journal of the American Medical Association 1994; 272: 205–211. 23. Michels KB, Greenland S, Rosner BA. Does body mass index adequately capture the relation of body composition and body size to health outcomes? American Journal of Epidemiology 1998; 147:167–172. 24. Calle EE, Thun MJ, Petrelli JM, Rodriguez C, Heath CW Jr. Body mass index and mortality in a prospective cohort of US adults. New England Journal of Medicine 1999; 341:1097–1105. 25. Bartlett SJ, Faith MS, Fontaine KR, Cheskin LJ, Allison DB. Is the prevalence of successful weight loss and maintenance higher in the general community than the research clinic? Obesity Research 1999; 7:407– 413. 26. Brownell KD, Jeerey RW. Improving long-term weight loss: pushing the limits of treatment. Behavior Therapy 1987; 18:353–374. 27. Poston WS, Foreyt JP, Borrell L, Haddock CK. Challenges in obesity management. Southern Medical Journal 1998; 91:710 –720.
126
M. HEO ET AL.
28. Plankey MW, Stevens J, Flegal KM. Prediction equations do not eliminate systematic error in self-reported body mass index. Obesity Research 1997; 5:308–314. 29. Young CH, Savola KL, Phelps E. Inventory of Longitudinal Studies in the Social Sciences. SAGE Publications: Newbury Park, CA, 1991. 30. Dawber TR. The Framingham Study. Harvard University Press: Cambridge, 1980. 31. Benfari RC. The multiple risk factor intervention trial (MRFIT): III. The model for intervention. Preventive Medicine 1981; 10:426 – 442. 32. MRFIT Research Group. Multiple Risk Factor Intervention Trial, risk factor changes and mortality results. Journal of the American Medical Association 1981; 248:1465–1477. 33. National Center for Health Statistic. Health and Nutrition Examination Survey I, 1971–1975 (computer le), 2nd ICPSR edn. Ann Arbor, MI: Inter-University Consortium for Political and Social Research (producer and distributor), 1983. 34. National Center for Health Statistic. Health and Nutrition Examination Survey I, Epidemiologic Follow-up Study 1982–1984 (computer le), 2nd release. Hyattsville, MD: Department of Health and Human Services, National Center for Health Statistic (producer), 1992. Ann Arbor, MI: Inter-University Consortium for Political and Social Research (producer and distributor), 1992. 35. Tecumseh Management Committee. Tecumseh Mortality Follow-up Study, 1978 (computer le), Ann Arbor, MI: Tecumseh Management Committee (producer), 1992. Ann Arbor, MI: Inter-University Consortium for Political and social research (distributor), 1993. 36. Korn EL, Graubard BI, Midthune D. Time-to-event analysis of longitudinal follow-up of a survey: choice of the time-scale. American Journal of Epidemiology 1997; 145:72–80. 37. Goldstein H. Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika 1986; 73:43–56. 38. Everitt BS. The Cambridge Dictionary of Statistics. Cambridge Press: Cambridge, U.K, 1998. 39. Clayton D, Hills M. Statistical Models in Epidemiology. Oxford University Press: New York, 1993. 40. Miyazaki Y, Raudenbush SW. Tests for linkage of multiple cohorts in an accelerated longitudinal design. Psychological Methods 2000; 5:44 – 63. 41. Braun HI. Empirical Bayes method: a tool for exploratory analysis. In Multilevel Analysis of Educational Data, Bock RD (ed.). Academic Press: New York, 1989; 19–55. 42. Pollack BN. Hierarchical linear modelling and the ‘unit of analysis’ problem: a solution for analyzing responses of intact group members. Group Dynamics: Theory, Research, and Practice 1998; 2;299–312. 43. Kreft IGG (ed.). Hierarchical linear models: Problems and prospects. Journal of Educational and Behavioral Statistics 1995; 20(2) (special issue). 44. Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73:45–51. 45. Zeger SL, Karim MR. Generalized linear models with random eects; a Gibbs sampling approach. Journal of the American Statistical Association 1991; 86:79–86. 46. Normand SL. Tutorial in Biostatistics. Meta-analysis: Formulating, evaluating, combining, and reporting. Statistics in Medicine 1999; 18;321–359. 47. Cooper HM, Hedges LV (eds). The Handbook of Research Synthesis. Russell Sage Foundation: New York, 1994. 48. St. Pierre NR. Integrating quantitative ndings from multiple studies using mixed model methodology. Journal of Dairy Science 2001; 84:741–745. 49. Williamson DF, Kahn HS, Remington PL, Anda RF. The 10-year incidence of overweight and major weight gain in US adults. Archives of Internal Medicine 1990; 150:665 – 672. 50. Bjorvell H, Rossner S. A ten year follow-up of weight change in severely obese subjects treated in a behavioral modication program. International Journal of Obesity 1990; 14 (Suppl. 2):88. 51. Brownell KD. Relapse and the treatment of obesity, In Treatment of the Obese Patient, Wadden TA, VanItallie TB (eds). Guilford Press: New York, 1992; 437– 455.
1.3 Mixed Models
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 127
128
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
129
130
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
131
132
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
133
134
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
135
136
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
137
138
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
139
140
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
141
142
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
143
144
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
145
146
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
147
148
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
149
150
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
151
152
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
153
154
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
155
156
A. CNAAN, N. M. LAIRD AND P. SLASOR
USING THE GENERAL LINEAR MIXED MODEL
157
158
A. CNAAN, N. M. LAIRD AND P. SLASOR
TUTORIAL IN BIOSTATISTICS Modelling covariance structure in the analysis of repeated measures data Ramon C. Littell1;∗;† , Jane Pendergast2 and Ranjini Natarajan3 1 Department
of Statistics; Institute of Food and Agricultural Sciences; University of Florida; Gainesville; Florida 32611; U.S.A. 2 Department of Biostatistics, University of Iowa, Iowa City, Iowa 52242, U.S.A. 3 Department of Statistics; Division of Biostatistics; University of Florida; Gainesville; Florida 32611; U.S.A.
SUMMARY The term ‘repeated measures’ refers to data with multiple observations on the same sampling unit. In most cases, the multiple observations are taken over time, but they could be over space. It is usually plausible to assume that observations on the same unit are correlated. Hence, statistical analysis of repeated measures data must address the issue of covariation between measures on the same unit. Until recently, analysis techniques available in computer software only oered the user limited and inadequate choices. One choice was to ignore covariance structure and make invalid assumptions. Another was to avoid the covariance structure issue by analysing transformed data or making adjustments to otherwise inadequate analyses. Ignoring covariance structure may result in erroneous inference, and avoiding it may result in inecient inference. Recently available mixed model methodology permits the covariance structure to be incorporated into the statistical model. The MIXED procedure of the SASJ System provides a rich selection of covariance structures through the RANDOM and REPEATED statements. Modelling the covariance structure is a major hurdle in the use of PROC MIXED. However, once the covariance structure is modelled, inference about xed eects proceeds essentially as when using PROC GLM. An example from the pharmaceutical industry is used to illustrate how to choose a covariance structure. The example also illustrates the eects of choice of covariance structure on tests and estimates of xed eects. In many situations, estimates of linear combinations are invariant with respect to covariance structure, yet standard errors of the estimates may still depend on the covariance structure. Copyright ? 2000 John Wiley & Sons, Ltd.
1. INTRODUCTION Statistical linear mixed models state that observed data consist of two parts, xed eects and random eects. Fixed eects dene the expected values of the observations, and random eects dene the variance and covariances of the observations. In typical comparative experiments with repeated measures, subjects are randomly assigned to treatment groups, and observations are made
∗ Correspondence
to: Ramon C. Littell, Department of Statistics, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, Florida 32611, U.S.A.
[email protected].edu
† E-mail:
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 159
160
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
at multiple time points on each subject. Basically, there are two xed eect factors, treatment and time. Random eects result from variation between subjects and from variation within subjects. Measures on the same subject at dierent times almost always are correlated, with measures taken close together in time being more highly correlated than measures taken far apart in time. Observations on dierent subjects are often assumed independent, although the validity of this assumption depends on the study design. Mixed linear models are used with repeated measures data to accommodate the xed eects of treatment and time and the covariation between observations on the same subject at dierent times. Cnaan et al. [1] extensively discussed the use of the general linear mixed model for analysis of repeated measures and longitudinal data. They presented two example analyses, one using BMDP 5V [2] and the other using PROC MIXED of the SAS System [3]. Although Cnaan et al. discussed statistical analyses in the context of unbalanced data sets, their description of modelling covariance structure also applies to balanced data sets. The objectives of repeated measures studies usually are to make inferences about the expected values of the observations, that is, about the means of the populations from which subjects are sampled. This is done in terms of treatment and time eects in the model. For example, it might be of interest to test or estimate dierences between treatment means at particular times, or dierences between means at dierent times for the same treatment. These are inferences about the xed eects in the model. Implementation of mixed models ordinarily occurs in stages. Dierent data analysts may use dierent sequences of stages. Ideally, dierent data sets would be used to choose model form and to estimate parameters, but this is usually not possible in practice. Here we present the more realistic situation of choosing model form using data to be analysed. We prefer a four stage approach, which is similar to recommendations of others, such as Diggle [4] and Wolnger [5]. The rst stage is to model the mean structure in sucient generality to ensure unbiasedness of the xed eect estimates. This usually entails a saturated parameter specication for xed eects, often in the form of eects for treatment, time, treatment-by-time interaction, and other relevant covariables. The second stage is to specify a model for the covariance structure of the data. This involves modelling variation between subjects, and also covariation between measures at dierent times on the same subject. In the third stage, generalized least squares methods are used to t the mean portion of the model. In the fourth stage the xed eects portion may be made more parsimonious, such as by tting polynomial curves over time. Then, statistical inferences are drawn based on tting this nal model. In the present paper, we illustrate the four-stage process, but the major focus is on the second stage, modelling the covariance structure. If the true underlying covariance structure were known, the generalized least squares xed eects estimates would be the best linear unbiased estimates (BLUE). When it is unknown, our goal is to estimate it as closely as possible, thus providing more ecient estimates of the xed eects parameters. The MIXED procedure in the SASJ system [3] provides a rich selection of covariance structures from which to choose. In addition to selecting a covariance structure, we examine the eects of choice of covariance structure on tests of xed eects, estimates of dierences between treatment means, and on standard errors of the dierences between means. 2. EXAMPLE DATA SET A pharmaceutical example experiment will be used to illustrate the methodology. Objectives of the study were to compare eects of two drugs (A and B) and a placebo (P) on a measure of
161
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
Figure 1. FEV1 repeated measures for each patient. Table I. REML covariance and correlation estimates for FEV1 repeated measures data. Time 1
Unstructured Time 4 Time 5
Time 2
Time 3
Time 6
Time 7
Time 8
0.216
0.211
0.204
0.175
0.163
0.128
0.168
0.893
0.259
0.233
0.243
0.220
0.181
0.156
0.195
0.880
0.908
0.254
0.252
0.219
0.191
0.168
0.204
0.784
0.892
0.915
0.299
0.240
0.204
0.190
0.226
0.688
0.807
0.813
0.822
0.286
0.232
0.204
0.247
0.675
0.698
0.745
0.735
0.855
0.258
0.214
0.245
0.516
0.590
0.643
0.670
0.733
0.812
0.270
0.233
0.642
0.701
0.742
0.755
0.845
0.882
0.820
0.299
0.226
Variances on diagonal, covariances above diagonal, correlations below diagonal.
respiratory ability, called FEV1. Twenty-four patients were assigned to each of the three treatment groups, and FEV1 was measured at baseline (immediately prior to administration of the drugs), and at hourly intervals thereafter for eight hours. Data were analysed using PROC MIXED of the SAS System, using baseline FEV1 as a covariable. An SAS data set, named FEV1UN1, contained data with variables DRUG, PATIENT, HR (hour), BASEFEV1 and FEV1. Data for individual patients are plotted versus HR in Figure 1 for the three treatment groups. The drug curves appear to follow a classic pharmacokinetic pattern and thus might be analysed using a non-linear mean model. However, we will restrict our attention to models of the mean function which are linear in the parameters. Estimates of between-patient variances within drug group at each hour are printed
162
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
Figure 2. FEV1 repeated measures means for each drug.
in the diagonal of the matrix of Table I. It appears from these plots and variance estimates that variances between patients within drug groups are approximately equal across times. Therefore, an assumption of equal variances seems reasonable. Treatment means are plotted versus HR in Figure 2. The graph shows that means for the three treatment groups are essentially the same at HR = 0 (baseline). At HR = 1 the mean for drug B is larger than the mean for drug A, and both of the drug means are much larger than the placebo mean. Means for drugs A and B continue to be larger than the placebo means for subsequent hours, but the magnitudes of the dierences decrease sharply with time. It is of interest to estimate dierences between the treatment group means at various times, and to estimate dierences between means for the same treatment at dierent times. Covariances and correlations are printed above and below the diagonal, respectively, of the matrix in Table I. The correlations between FEV1 at HR = 1 and later times are in the rst column of the matrix. Correlations generally decrease from 0.893 between FEV1 at HR = 1 and HR = 2 down to 0.642 between FEV1 at HR = 1 and HR = 8. Similar decreases are found between FEV1 at HR = 2 and later times, between FEV1 at HR = 3 and later times etc. In short, correlations between pairs of FEV1 measurements decrease with the number of hours between the times at which the measurements were obtained. This is a common phenomenon with repeated measures data. Moreover, magnitudes of correlations between FEV1 repeated measures are similar for pairs of hours with the same interval between hours. Scatter plots of FEV1 for each hour versus FEV1 at each other hour are presented in Figure 3. These are similar to the ‘draftsman’s’ plots as described by Dawson et al. [6]. The trends of decreasing correlations with increasing interval between measurement times is apparent in the plots. That is, points are more tightly packed in plots for two measures close in time than for measures far apart in time. As a consequence of the patterns of correlations, a standard analysis of variance as prescribed in Milliken and Johnson [7] is likely not appropriate for this data set. Thus, another type of analysis must be used.
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
163
Figure 3. Scatter plots of FEV1 repeated measures at each hour versus each other hour.
3. LINEAR MIXED MODEL FOR REPEATED MEASURES In this section we develop the general linear mixed model to a minimally sucient level that will allow the reader to eectively begin using PROC MIXED of the SAS System. The development here is consistent and somewhat overlapping with that of Cnann et al. [2], but is needed for completeness. We assume a completely randomized design for patients in g treatment groups, with ni subjects assigned to group i. Thus, we assume data on dierent subjects are independent. For simplicity, we assume there are t measurements at the same equally spaced times on each subject. We choose to work in this nicely balanced situation so that we can illustrate the basic issues of modelling covariance structure without complications introduced by unbalanced data. Let Yijk denote the value of the response measured at time k on subject j in group i; i = 1; : : : ; g; j = 1; : : : ; ni , and k = 1; : : : ; t. Throughout this paper, we assume all random eects are normally distributed. The xed eect portion of the general linear mixed model species the expected value of Yijk to be E(Yijk ) = ijk . The expected value, ijk , usually is modelled as a function of treatment, time, and other xed eects covariates. The random eect portion of the model species the covariance structure of the observations. We assume that observations on dierent subjects are independent, which is legitimate as a result of the completely randomized design. Thus, cov(Yijk ; Yi j l ) = 0 if i = i or j = j . Also, we assume that variances and covariances of measures on a single subject are the same within each of the groups. However, we allow for the possibility
164
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
that variances are not homogeneous at all times, and that covariance between observations at dierent times on the same subject are not the same at all pairs of times. A general covariance structure is denoted as cov(Yijk ; Yijl ) = k;l , where k;l is the covariance between measures at times k and l on the same subject, and k;k = k2 denotes the variance at time k. This is sometimes called ‘unstructured’ covariance, because there are no mathematical structural conditions on the variances and covariances. Let Yij = (Yij1 ; Yij2 ; : : : ; Yijt ) denote the vector of data at times 1; 2; : : : ; t on subject j in group i. Then, in matrix notation, the model can be written Yij = ij + ij where ij = (ij1 ; ij2 ; : : : ; ijt ) is the vector of means and ij = ( ij1 ; ij2 ; : : : ; ijt ) is the vector of errors, respectively, for subject j in group i. Matrix representations of the expectation and variance of Yij are E(Yij ) = ij and V (Yij ) = Vij , where Vij is the t × t matrix with k;l in row k, column l. We assume that Vij is the same for all subjects (that is, for all i and j), but we continue to use the subscripts ij to emphasize that we are referring to the covariance matrix for a single subject. ; : : : ; Y1n ; Y21 ; : : : ; Y2n We represent the vector of data for all subjects as Y = (Y11 ; : : : ; Yg1 ;:::; ; Ygn ) , and similarly for the vectors of expected values and errors to get E(Y) = = (11 ; : : : ; 1n 21 ; : : : ; 2n ; : : : ; g1 ; : : : ; gn ) and = ( 11 ; : : : ; 1n ; 21 ; : : : ; 2n ; : : : ; g1 ; : : : ; gn ) . Then we have the model Y= +
(1)
and V (Y) = V = diag{Vij } where diag{Vij } refers to a block-diagonal matrix with Vij in each block. A univariate linear mixed model for the FEV1 repeated measures data is Yijk = + xij + i + dij + k + ( )ik + eijk
(2)
where is a constant common to all observations, is a xed coecient on the covariate xij = BASEFEV1 for patient j in drug group i; i is a parameter corresponding to drug i; k is a parameter corresponding to hour k, and ( )ik is an interaction parameter corresponding to drug i and hour k; dij is a normally distributed random variable with mean zero and variance d2 corresponding to patient j in drug group i, and eijk is a normally distributed random variable with mean zero and variance e2 , independent of dij , corresponding to patient j in drug group i at hour k. Then E(Yijk ) = ijk = + xij + i + k + ( ik ) V (Yijk ) = d2 + e2
(3)
and cov(Yijk ; Yijl ) = d2 + cov(eijk ; eijl ) The model (2), written in matrix rotation, is Y = X + ZU + e
(4)
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
165
where X is a matrix of known coecients of the xed eect parameters ; ; i ; k ; and ( ik ); is the vector of xed eect parameters, Z is a matrix of coecients (zeros and ones) of the random patient eects dij ; U is the vector of random eects dij , and e is the vector of the errors eijk . In relation to model (1), = X and = ZU + e. Model (4) for the FEV1 data is a special case of the general linear mixed model Y = X + ZU + e
(5)
in which no restrictions are necessarily imposed on the structures of G = V (U) and R = V (e). We assume only that U and e are independent, and obtain V (Y) = ZGZ + R:
(6)
Equation (6) expresses the structure of V (Y) as a function of G and R. In many repeated measures applications, ZGZ represents the between-patient portion of the covariance structure, and R represents the within-patient portion. By way of notation, sub-matrices of X; Z; R and e corresponding to subject j in drug group i will be denoted by Xij ; Zij ; Rij and eij , respectively. More details on implementation of the model for statistical inference are presented in the Appendix. In order to apply the general linear mixed model (5) using PROC MIXED in the SAS System, the user must specify the three parts of the model: X; ZU and e. Specifying X is done in the same manner as with PROC GLM, and presents no new challenges to PROC MIXED users who are familiar with GLM. However, specifying ZU and e entails dening covariance structures, which may be less familiar concepts. Several covariance structures are discussed in Section 4.
4. COVARIANCE STRUCTURES FOR REPEATED MEASURES Modelling covariance structure refers to representing V (Y) in (6) as a function of a relatively small number of parameters. Functional specication of the covariance structure for the mixed model is done through G and R of (5), often only in terms of Rij . We present six covariance structures that will be tted to the FEV1 data. Since observations on dierent patients are assumed independent, the structure refers to the covariance pattern of measurements on the same subject. For most of these structures, the covariance between two observations on the same subject depends only on the length of the time interval between measurements (called the lag), and the variance is constant over time. We assume the repeated measurements are equally spaced so we may dene the lag for the observations Yijk and Yijl to be the absolute value of k − l, that is |k − l|. For these structures, the covariance can be characterized in terms of the variance and the correlations expressed as a function of the lag. We generically denote the correlation function corr XXX (lag), where XXX is an abbreviation for the name of a covariance structure. 4.1. Simple (SIM) cov(Yijk ; Yijl ) = 0 if k = l;
2 V (Yijk ) = SIM
Simple structure species that the observations are independent, even on the same patient, and 2 . The correlation function is corr SIM (lag) = 0. Simple have homogeneous variance V (Yijk ) = SIM structure is not realistic for most repeated measures data because it species that observations on
166
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
2 the same patient are independent. In terms of model (5), G = 0 and Rij = SIM I, where I is an identity matrix. For the model (3), simple structure would be obtained with dij = 0 (equivalently, 2 : d2 = 0), cov(eijk ; eijl ) = 0 for k = l, and V (eijk ) = SIM
4.2. Compound Symmetric (CS) 2 cov(Yijk ; Yijl ) = CS; b if k = l;
2 2 V (Yijk ) = CS; b + CS; w
Compound symmetric structure species that observations on the same patient have homogeneous 2 2 2 covariance CS; b , and homogeneous variance V (Yijk ) = CS; b + CS; w . The correlation function is 2 2 2 corr CS (lag) = CS; b =(CS; b + CS; w )
Notice that the correlation does not depend on the value of lag, in the sense that the correlations between two observations are equal for all pairs of observations on the same subject. Compound symmetric structure is sometimes called ‘variance components’ structure, because the two param2 2 eters CS; b and CS; w represent between-subjects and within-subjects variances, respectively. This mix of between- and within-subject variances logically motivates the form of V (Yij ) in many situations and implies a non-negative correlation between pairs of within-subject observations. It can 2 be specied in one of two ways through G and R in (5). One way is to dene G = CS; b I; and 2 2 2 R = CS; w I. In terms of the univariate model (3), we would have d = CS; d , cov(eijk ; eijl ) = 0 for 2 k = l, and V (eijk ) = CS; w . The other way to specify compound symmetric structure is to dene 2 2 G = 0, and dene Rij to be compound symmetric; for example, Rij = CS; w I + CS; b J, where J is a 2 2 matrix of ones. In terms of the univariate model (3), we would have d = 0; cov(eijk ; eijl ) = CS; b 2 2 for k = l, and V (eijk ) = CS; b + CS; w . The second formulation using only the R matrix is more general, since it can be dened with negative within-subject correlation as well. 4.3. Autoregressive, order 1 (AR(1)) |k−l|
2 cov(Yijk ; Yijl ) = AR(1)
AR(1) 2 Autoregressive (order 1) covariance structure species homogeneous variance V (Yijk ) = AR(1) . It also species that covariances between observations on the same patient are not equal, but decrease toward zero with increasing lag. The correlation between the measurements at times k and l is given by the exponential function
corr AR(1) (lag) = lag AR(1) Thus, observations on the same patient far apart in time would be essentially independent, which may not be realistic. Autoregressive structure is dened in model (5) entirely in terms of R, with |k−l| 2 G = 0. The element in row k, column l of Rij is denoted to be AR(1)
AR(1) . In terms of the |k−l|
2
AR(1) . univariate model (3), we would have d2 = 0, and cov(eijk ; eijl ) = AR(1)
4.4. Autoregressive with random eect for patient (AR(1)+RE) |k−l|
2 2 cov(Yijk ; Yijl ) = AR(1)+RE; b + AR(1)+RE; w AR(1)+RE
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
167
Autoregressive with random eect for patient covariance structure species homogeneous variance 2 2 V (Yijk ) = AR(1)+RE; b + AR(1)+RE; w . The correlation function is lag 2 2 2 2 corr AR(1)+RE (lag) = (AR(1)+RE; b + AR(1)+RE; w AR(1)+RE )=(AR(1)+RE; b + AR(1)+RE; w )
Autoregressive plus random eects structure species that covariance between observations on the same patient comes from two sources. First, any two observations share a common contribution 2 simply because they are on the same subject. This is the AR(1)+RE; b portion of the covariance, and results from dening a random eect for patients. Second, the covariance between observations 2 decreases exponentially with lag, but decreases only to AR(1)+RE; b . This is the autoregressive 2 |k−l| contribution to the covariance, AR(1)+RE; w . In terms of model (5), AR(1)+RE is represented 2 with G = AR(1)+RE; I and autoregressive R . In terms of the univariate model (3), we would ij b 2 2 2 have d = AR(1)+RE; b , and cov(eijk ; eijl ) = AR(1)+RE; w |k−l| . The AR(1)+RE covariance structure actually results from a special case of the model proposed by Diggle [4].
4.5. Toeplitz (TOEP) cov(Yijk ; Yijl ) = TOEP; |k−l| ;
2 V (Yijk ) = TOEP
Toeplitz structure, sometimes called ‘banded’, species that covariance depends only on lag, but not as a mathematical function with a smaller number of parameters. The correlation function 2 is corr(lag) = TOEP; |lag| =TOEP . In terms of model (5), TOEP structure is given with G = 0. The 2 . All elements in a sub-diagonal |k − l| = lag are elements of the main diagonal of R are TOEP TOEP; |k−l| , where k is the row number and l is the column number. 4.6. Unstructured (UN) cov(Yijk ; Yijl ) = UN; kl The ‘unstructured’ structure species no patterns in the covariance matrix, and is completely general, but the generality brings the disadvantage of having a very large number of parameters. In terms of model (5), it is given with G = 0 and a completely general Rij .
5. USING THE MIXED PROCEDURE TO FIT LINEAR MIXED MODELS We now turn to PROC MIXED for analyses of the FEV1 data which t the mean model (3) and accommodate structures dened on the covariance matrix. We assume the reader has some familiarity with the SAS System, and knows how to construct SAS data sets and call SAS procedures. The general linear mixed model (5) may be t by using the MODEL, CLASS, RANDOM and REPEATED statements in the MIXED procedure. The MODEL statement consists of an equation which species the response variable on the left side of the equal sign and terms on the right side to specify the xed eects portion of the model, X. Readers familiar with the GLM procedure
168
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
in SAS will recognize the RANDOM and REPEATED statements as being available in GLM, but their purposes are quite dierent in MIXED. The RANDOM statement in MIXED is used to specify the random eects portion, ZU , including the structure of V (U) = G. The REPEATED statement in MIXED is used to specify the structure of V (e) = R. Also, the MODEL statement in MIXED contains only xed eects, but in GLM it contains both xed and random eects. The CLASS statement, however, has a similar purpose in MIXED as in GLM, which is to specify classication variables, that is, variables for which indicator variables are needed in either X or Z. The CLASS statement in MIXED also is used to identify grouping variables, for example, variables that delineate the submatrices of block diagonal G or R. In the FEV1 data, PATIENT and DRUG are clearly classication variables, and must be listed in the CLASS statement. The variable HR (hour) could be treated as either a continuous or a classication variable. In the rst stage of implementing the linear mixed model, the mean structure E(Y) = X usually should be fully parameterized, as emphasized by Diggle [4]. Underspecifying the mean structure can result in biased estimates of the variance and covariance parameters, and thus lead to an incorrect assessment of covariance structure. Therefore, unless there are a very large number of levels of the repeated measures factor, we usually specify the repeated measures factor as a classication variable. Thus, we include the variable HR in the CLASS statement class drug patient hr; On the right side of the MODEL statement, we list terms to specify the mean structure (3) model fev1 = basefev1 drug hr drug ∗ hr Executing the statements proc mixed data = fev1uni; class drug patient hr;
(7)
model fev1 = basefev1 drug hr drug ∗ hr; would provide an ordinary least squares t of the model (3). Results would be equivalent to those obtained by executing the CLASS and MODEL statements in (7) using PROC GLM. All tests of hypotheses, standard errors, and condence intervals for estimable functions would be computed with an implicit assumption that V (Y) = 2 I, that is, that G = 0 and that R = 2 I. Specifying the MODEL statement in (7) is basically stage 1 of our four-stage process. Stage 2 is to select an appropriate covariance structure. The covariance structures described in Section 3 may be implemented in PROC MIXED by using RANDOM and=or REPEATED statements in conjunction with the statements (7). These statements cause PROC MIXED to compute REsidual Maximum Likelihood (REML, also known as restricted maximum likelihood) or Maximum Likelihood (ML) (Searle et al. reference [8], chapter 6) estimates of covariance parameters for the specied structures. Several options are available with the REPEATED and RANDOM statements, and would be specied following a slash (=). Following is a list of some of the options, and a brief description
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
169
of their functions: TYPE = structure type.
Species the type of structure for G or R. Structure options are given in SAS Institute Inc. [3]. R and RCORR (REPEATED). Requests printing of R matrix in covariance or correlation form. G and GCORR (RANDOM). Requests printing of G matrix in covariance or correlation form. V and VCORR (RANDOM). Requests printing of V = ZGZ + R matrix in covariance or correlation form SUBJECT = variable name. Species variables whose levels are used to identify block diagonal structure in G or R. When used in conjunction with R, RCORR, G, GCORR, V, or VCORR options, only a sub-matrix for a single value of the variable is printed. We now present statements to produce each of the covariance structures of Section 3. Basic output from these statements would include a table of estimates of parameters in the specied covariance structure and a table of tests of xed eects, similar to an analysis of variance table. In each of the REPEATED statements, there is a designation ‘SUBJECT = PATIENT (DRUG)’. This species that R is a block diagonal matrix with a sub-matrix for each patient. In this example, it is necessary to designate PATIENT (DRUG) because patients are numbered 1–24 in each drug. If patients were numbered 1–72, with no common numberings in dierent drugs, it would be sucient to designate only ‘PATIENT’. The options R and RCORR are used with the REPEATED statement and V and VCORR are used with the RANDOM statement to request printing of covariance and correlation matrices. 5.1. Simple This is the default structure when no RANDOM or REPEATED statement is used, as in statements (7), or when no TYPE option is specied in a RANDOM or REPEATED statement. It can be specied explicitly with a REPEATED statement using a TYPE option: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; repeated=type = vc subject = patient(drug) r corr;
(8)
Note that in SAS version 6.12, the option ‘simple’ can replace ‘vc’ in the REPEATED statement. 5.2. Compound Symmetric As noted in the previous section, compound symmetric covariance structure can be specied two dierent ways using G or R. Correspondingly, it can be implemented two dierent ways in the MIXED procedure, which would give identical results for non-negative within-subject correlation, 2 2 except for labelling. The rst way, setting G = CS;b I and R = CS;w I, is implemented with the RANDOM statement: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; random patient(drug);
(9)
170
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
2 The RANDOM statement denes G = CS;b I and the absence of a REPEATED statement (by 2 2 2 default) denes R = CS;w I. The second way, setting G = 0 and Rij = CS;w I + CS;b J, is implemented with a REPEATED statement using a SUBJECT and TYPE options. The following statements would specify compound symmetric structure for each individual patient, and print the Rij submatrix for one patient in both covariance and correlation forms:
proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; repeated=type = cs subject = patient(drug) r rcorr;
(10)
The PROC MIXED output from statements (10) is shown in Figure 4, so that the reader can relate it to the parts we summarize in tables. 5.3. Autoregressive, order 1 This covariance structure would be specied for each patient using a REPEATED statement: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; repeated=type = ar(1) subject = patient(drug) r rcorr;
(11)
5.4. Autoregressive with random eect for patient This covariance structure involves both G and R, and therefore requires both a RANDOM and a REPEATED statement: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; random patient(drug); repeated=type = ar(1) subject = patient(drug);
(12)
2 The RANDOM statement denes G = AR(1)+RE:b I; and the REPEATED statement denes Rij to 2 be autoregressive, with parameters AR(1)+RE;w and AR(1)+RE . Notice that we have no R and RCORR options in the REPEATED statement in (12). Covariance and correlation estimates that would be printed by R and RCORR options in (12) would not be directly comparable with the other covariances and correlations for other structures that are dened by REPEATED statements without a RANDOM statement. Covariance and correlation estimates that would be printed by R and RCORR options in the REPEATED statement in (12) would pertain only to the R matrix. Estimates for AR(1)+RE structure which are comparable to covariances and correlations for other structures must be based on covariances of the observation vector Y, that is, on V (Y) = ZGZ + R. This could be printed by using V and VCORR options in the RANDOM statement in (12). However, the entire ZGZ + R matrix, of dimension 576 × 576, would be printed. Alternatively, the statements (13) could be used, which are the same as (12) except for the RANDOM statement, but would print only Zij GZ ij + Rij , of dimension 8 × 8.
proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr; random int=subject = patient(drug) v vcorr; repeated=type = ar(1) subject = patient(drug);
(13)
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
171
Figure 4. Basic PROC MIXED output for compound symmetric covariance structure.
Executing statements (13) results in the covariance and corresponding correlation estimates for AR(1)+RE structure shown in Table II. The RANDOM statement in (13) denes ZU in (5) equivalent to the RANDOM statement in (12), but from an ‘individual subject’ perspective rather than a ‘sample of subjects’ perspective. The RANDOM statement in (12) basically denes columns of Z as indicator variables for dierent patients. The RANDOM statement in (13), with the ‘int=sub = patient(drug)’ designation, denes a set of ones as ‘intercept’ coecients for each patient.
172
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
Table II. REML variance, covariance and correlation estimates for ve covariance structures for FEV1 repeated measures. Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Time 7
Time 8
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
2. Compound Symmetric 0.269 0.206 0.206 1.0 0.766 0.766
0.206 0.766
0.206 0.766
0.206 0.766
0.206 0.766
0.206 0.766
3. Autoregressive (1) 0.266 0.228 1.0 0.856
0.167 0.629
0.143 0.538
0.123 0.461
0.105 0.394
0.090 0.338
4. Autoregressive (1) with random eect for patient 0.268 0.230 0.209 0.198 0.192 1.0 0.858 0.780 0.739 0.716
0.189 0.705
0.187 0.698
0.186 0.694
5. Toeplitz (banded) 0.266 0.228 1.0 0.858
0.183 0.686
0.169 0.635
0.158 0.593
1. Simple 0.267 1.0
0.195 0.733
0.216 0.811
0.207 0.777
0.191 0.716
Variances and covariances in top line; correlations in bottom line.
5.5. Toeplitz This structure can be specied in terms of R with G = 0, and therefore requires only a REPEATED statement: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr;
(14)
repeated=type = toep subject = patient(drug) r rcorr;
5.6. Unstructured This structure can be specied in terms of R with G = 0, and therefore requires only a REPEATED statement: proc mixed data = fev1uni; class drug patient hr; model fev1 = basefev1 drug hr drug ∗ hr;
(15)
repeated=type = un subject = patient(drug) r rcorr; Parameter estimates in the covariance and correlation matrices for the various structures (excepting ‘unstructured’) are: SIM
ˆ 2SIM = 0:267
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
CS
ˆ 2CS;b = 0:206 ˆ 2CS;w = 0:063
AR(1)
ˆ 2AR(1) = 0:266
ˆˆ AR(1) = 0:856
AR(1) + RE
ˆ 2AR(1)+RE;b = 0:185 ˆ 2AR(1)+RE;w = 0:083
ˆAR(1)+RE = 0:540
TOEP
ˆ 2TOEP = 0:266 ˆ TOEP;1 = 0:228 ˆ TOEP;2 = 0:216 ˆ TOEP;3 = 0:207 ˆ TOEP;4 = 0:191 ˆ TOEP;5 = 0:183 ˆ TOEP;6 = 0:169 ˆ TOEP;7 = 0:158
UN
(parameter estimates shown in Table I):
173
The covariance and correlation matrices resulting from statements (8), (10), (11), (13) and (14) are summarized in Table II. Rather than printing the entire matrices, covariances and correlations are displayed as a function of lag for SIM, CS, AR(1), AR(1)+RE and TOEP structures. Covariances and correlations resulting from (15) are printed in Table I.
6. COMPARISON OF FITS OF COVARIANCE STRUCTURES We discuss covariance and correlation estimates in Table II for the structured covariances in comparison with those in Table I for the unstructured covariances. First, simple and compound symmetric estimates in Table II clearly do not reect the trends in Table I. Autoregressive estimates in Table II show the general trend of correlations decreasing with length of time interval, but the values of the correlations in the autoregressive structure are too small, especially for long intervals. Thus, none of SIM, CS or AR(1) structures appears to adequately model the correlation pattern of the data. The AR(1)+RE correlations in Table II show good agreement with TOEP estimates in Table II and UN estimates in Table I. Generally, we prefer a covariance model which provides a good t to the UN estimates, and has a small number of parameters. On this principle, AR(1)+RE is preferable. The correlogram (Cressie, reference [9], p. 67) is a graphical device for assessing correlation structure. It is basically a plot of the correlation function. Correlation plots are shown in Figure 5 based on estimates assuming UN, CS, AR(1), AR(1)+RE and TOEP structures. Plots for CS, AR(1), AR(1)+RE and TOEP may be considered correlogram estimates assuming these structures. Of these correlations which are a function only of lag, the TOEP structure is the most general, and thus is used as the reference type in Figure 5. These plots clearly show that the t of the AR(1)+RE structure agrees with TOEP and is superior to the ts of CS and AR(1).
174
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
Figure 5. Plots of correlation estimates and correlograms.
Akaike’s information criterion (AIC) [10] and Schwarz’s Bayesian criterion (SBC) [11] are indices of relative goodness-of-t and may be used to compare models with the same xed eects but dierent covariance structures. Both of these criteria apply rather generally for purposes of model selection and hypothesis testing. For instance, Kass and Wassermann [12] have shown that the SBC provides an approximate Bayes factor in large samples. Formulae for their computation are ˆ −q AIC = L() ˆ − (q=2) log(N ∗ ) SBC = L()
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
175
Table III. Akaike’s information criterion (AIC) and Schwarz’s Bayesian criterion (SBC) for six covariance structures. Structure name 1. 2. 3. 4. 5. 6.
Simple Compound symmetric Autoregressive (1) Autogressive (1) with random eect for patients Toeplitz (banded) Unstructured
AIC
SBC
−459:5 −175:6 −139:5 −126:5 −121:9 −110:1
−461:6 −179:9 −143:8 −132:9 −139:2 −187:7
ˆ is the maximized log-likelihood or restricted log-likelihood (REML), q is the number where L() of parameters in the covariance matrix, p is the number of xed eect parameters and N ∗ is the total number of ‘observations’ (N for ML and N − p for REML, where N is the number of subjects). Models with large AIC or SBC values indicate a better t. However, it is important to note that the SBC criterion penalizes models more severely for the number of estimated parameters than does AIC. Hence the two criteria will not always agree on the choice of ‘best’ model. Since our objective is parsimonious modelling of the covariance structure, we will rely more on the SBC than the AIC criterion. AIC and SBC values for the six covariance structures are shown in Table III. ‘Unstructured’, has the largest AIC, but AR(1)+RE, ‘autoregressive with random eect for patient’, has the largest SBC. Toeplitz ranks second in both AIC and SBC. The discrepancy between AIC and SBC for the UN structure reects the penalty for the large number of parameters in the UN covariance matrix. Based on inspection of the correlation estimates in Tables I and III, the graphs of Figure 5, and the relative values of SBC, we conclude that AR(1)+RE, ‘autoregressive with random eect for patient’, is the best choice of covariance structure.
7. EFFECTS OF COVARIANCE STRUCTURE ON TESTS OF FIXED EFFECTS, ESTIMATES OF FIXED EFFECTS AND STANDARD ERRORS OF ESTIMATES In Section 6 we compared the correlation and covariance matrices produced by ve choices of covariance structure. In this section we examine the eects of choices of covariance structure on tests and estimates of xed eects. First, we examine the table of tests for xed eects specied in the MODEL statements. Then we select a set of 15 comparisons among means and use the ESTIMATE statement to illustrate eects of covariance structure on estimates of linear combinations of xed eects. Table IV contains values of F tests for xed eects that are computed by the MIXED procedure for each of the covariance structures specied in (8), (10), (11), (13), (14) and (15). The F values dier substantially for SIM, CS and AR(1) structures. These are the structures that did not provide good ts in Section 6. Failure of SIM to recognize between-patient variation results in the excessively large F values for BASEFEV1 and DRUG, which are between patient eects. Using CS structure produces essentially the same results that would be obtained by using a univariate split-plot type analysis of variance (Milliken and Johnson, reference [7], chapter 26). It results in excessively large F values for HR and DRUG ∗ HR. This is a well-known phenomenon of
176
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
Table IV. Values of F tests for xed eects for six covariance structures. Structure name 1. 2. 3. 4. 5. 6.
BaseFEV1 DRUG
Simple Compound symmetric Autoregressive (1) Autoregressive (1) with random eect for patient Toeplitz (banded) Unstructured
490.76 76.42 90.39 75.93 76.31 92.58
46.50 7.24 8.40 7.28 7.30 7.25
HR
DRUG ∗ HR
9.20 38.86 7.39 17.10 13.75 13.72
1.69 7.11 2.46 3.94 3.82 4.06
performing univariate analysis of variance when CS (actually, Hyunh–Feldt [13]) assumptions are not met. It is basically the reason for making the so-called Hyunh–Feldt [13] and Greenhouse– Geisser [14] adjustments to ANOVA p-values as done by the REPEATED statement in PROC GLM [15]. F values for tests of HR and DRUG ∗ HR using AR(1) structure are excessively small due to the fact that AR(1) underestimates covariances between observations far apart in time, and thereby overestimates variances of dierences between these observations. Results of F tests based on AR(1)+RE, TOEP and UN covariance are similar for all xed eects. All of these structures are adequate for modelling the covariance, and therefore produce valid estimates of error. Now, we investigate eects of covariance structure on 15 linear combinations of xed eects, which are comparisons of means. The rst seven comparisons are dierences between hour 1 and subsequent hours in drug A; these are within-subject comparisons. In terms of the univariate model (2), they are estimates of A:1 − A:k = 1 − k + ( )A1 − ( )Ak
(16)
for k = 2; : : : ; 8. The next eight comparisons are dierences between drugs A and B at hours 1 to 8; these are between-subject comparisons at particular times. In terms of the univariate model (3), they are estimates of A:k − B:k = (XA − XB ) + A − B + ( )Ak − ( )Bk
(17)
for k = 1; : : : ; 8. The ESTIMATE statement in the MIXED procedure can be used to compute estimates of linear combinations of xed eect parameters. It is used for this purpose in essentially the same manner as with the GLM procedure. With MIXED, the ESTIMATE statement can be used for the more general purpose of computing estimates of linear combinations of xed and random eectes, known as Best Linear U nbiased Predictors (BLUPs) [16]. The following ESTIMATE statements in (18) can be run in conjunction with the PROC MIXED statements (7) – (15) to obtain estimates of the dierences (16). Coecients following ‘hr’ in (18) specify coecients of k parameters in (16), and coecients following ‘drug ∗ hr’ in (18) specify coecients of ( )ik parameters in (16): estimate ‘hr1–hr2 drgA’ hr 1 − 1 0 0 0 0 0 0 drug ∗ hr 1 − 1 0 0 0 0 0 0; estimate ‘hr1–hr3 drgA’ hr 1 0 − 1 0 0 0 0 0 drug ∗ hr 1 0 − 1 0 0 0 0 0;
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
177
Table V. Estimates and standard errors for six covariance structures: within-subject comparisons across time. Standard errors Parameter hr1–hr2 hr1–hr3 hr1–hr4 hr1–hr5 hr1–hr6 hr1–hr7 hr1–hr8
drug drug drug drug drug drug drug
∗ Parameter
A A A A A A A
Estimate∗
Simple
CS
AR(1)
AR(1)+RE
Toeplitz (banded)
Unstructured
0.0767 0.2896 0.4271 0.4200 0.4942 0.6050 0.6154
0.1491 0.1491 0.1491 0.1491 0.1491 0.1491 0.1491
0.0725 0.0725 0.0725 0.0725 0.0725 0.0725 0.0725
0.0564 0.0769 0.0908 0.1012 0.1093 0.1158 0.1211
0.0564 0.0700 0.0764 0.0796 0.0813 0.0822 0.0827
0.0562 0.0647 0.0704 0.0794 0.0836 0.0900 0.0951
0.0470 0.0492 0.0698 0.0822 0.0811 0.1002 0.0888
estimates are the same regardless of variance structure for these contrasts.
estimate estimate estimate estimate estimate
‘hr1–hr4 ‘hr1–hr8 ‘hr1–hr5 ‘hr1–hr6 ‘hr1–hr7
drgA’ drgA’ drgA’ drgA’ drgA’
hr hr hr hr hr
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
−1 0 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 0 0 −1
drug ∗ hr drug ∗ hr drug ∗ hr drug ∗ hr drug ∗ hr
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
− 1 0 0 0 0; 0 − 1 0 0 0; 0 0 − 1 0 0; 0 0 0 − 1 0; 0 0 0 0 − 1;
(18)
Results from running these ESTIMATE statements with each of the six covariance structures in (18) appear in Table V. The estimates obtained from (18) are simply dierences between the two drug A means for each pair of hours, that is, the estimate labelled ‘hr1-hrk drgA’ is Y A:1 − Y A:k , or in terms of the model (2), 1 − k + ( )A1 − ( )Ak + eA:1 − eA:k , for k = 2; : : : ; 8. Because the covariable BASEFEV1 is a subject-level covariate, it cancels in this comparison. Consequently, the estimates are all the same for any covariance structure due to the equivalence of generalized least squares (GLS) and ordinary least squares (OLS) in this setting. This will not happen in all cases, such as when the data are unbalanced, when the covariate is time-varying, or when polynomial trends are used to model time eects. In this example, the data are balanced and hour is treated as a discrete factor. See Puntanen and Styan [17] for general conditions when GLS estimates are equal to OLS estimates. Even though all estimates of dierences from statements (18) are equal, each of the six covariance structures results in a dierent standard error estimate (Table V). Note that the ‘simple’ standard error estimates are always larger than those from the mixed model. The general expression for the variance of the standard error estimate is V (YA:1 − YA:k ) = [1; 1 + k; k − 21; k ]=24
(19)
where k; l = cov(Yijk ; Yijl ). For structured covariances, k; l will be a function of k; l, and a small number of parameters. Standard error estimates printed by PROC MIXED are square roots of (19), with k; l expressions replaced by their respective estimates, assuming a particular covariance structure. We now discuss eects of the assumed covariance structure on the standard error estimates. Structure number 1, ‘simple’, treats the data as if all observations are independent with the same variance. This results in equal standard error estimates of 0:14909825 = (2 ˆ2SIM =24)1=2 = (2(0:267=24))1=2
178
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
for all dierences between time means in the same drug. These are incorrect because SIM structure clearly is inappropriate for two reasons. First, the SIM structure does not accommodate betweenpatient variation, and second, it does not recognize that measures close together in time are more highly correlated than measures far apart in time. Structure number 2, ‘compound symmetric’, acknowledges variation as coming from two sources, between- and within-patient. This results in standard error estimates of 0:07252978 = (2 ˆ2CS;w =24)1=2 = (2(0:063=24))1=2 being functions only of the within-patient varience component estimate. However, compound symmetry does not accommodate dierent standard errors of dierences between times as being dependent on the length of the time interval. Consequently, the standard error estimates based on the compound symmetric structure also are invalid. Structure number 3, ‘autoregressive’, results in standard errors of estimates of dierences between times which depend on the length of the time interval. For example, the standard error estimate for the dierence between hours 1 and 8 (lag=7) is 0:121 = (2 ˆ2AR(1) (1 − ˆ7AR(1) )=24))1=2 = (2(0:266(1 − 0:8567 )=24))1=2 and similarly for other lags. The standard error estimates are 0.121 for the dierence between hours 1 and 8 etc., down to 0.056 for the dierence between hours 1 and 2. If the autoregressive structure were correct, then these estimates of standard errors should be in good agreement with those produced by TOEP covariance. The TOEP standard error estimates range from 0.095 for the dierence between hours 1 and 8 down to 0.056 for the dierence between hours 1 and 2. Thus the autoregressive estimates are too large by approximately 30 per cent for long time intervals (for example, hours 1 to 8). This is because the autoregressive structure underestimates the correlation between observations far apart in time by forcing the correlation to decrease exponentially toward zero. Next, we examine the standard errors provided by structure 4, ‘autoregressive with random eect for patient’. The standard error estimate for the dierence between hours 1 and 8 (lag = 7) is 0:083 = (2(ˆ2AR(1)+RE;w (1 − ˆ7AR(1)+RE )=24))1=2 = (2(0:083(1 − 0:5407 )=24))1=2 and similarly for other lags. We see that these standard error estimates generally provide good agreement with the TOEP and UN standard error estimates. These three structures (TOEP, UN and AR(1)+RE) are all potential candidates, because they accommodate between-subject variance and decreasing correlation as the lag increases. The intuitive advantage of the AR(1)+RE estimates over the TOEP and UN estimates in this setting is that the standard errors of the AR(1)+RE estimates follow a smooth trend as a function of lag, whereas the TOEP and UN standard error estimates are more erratic, particularly so for the UN estimates. In all three strucutres, the standard errors for the larger time lags are larger than those for the smaller lags, reecting the pattern seen in the data. The following ESTIMATE statements can be run in conjunction with PROC MIXED statements (7) – (15) to obtain estimates of the dierences between drugs A and B at each hour, dened
179
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
Table VI. Estimates and standard errors for six covariance structures: between-subject comparisons. Parameter
Simple
CS
AR(1)
AR(1)+RE
Toeplitz (Banded)
Unstructured
hr1 hr2 hr3 hr4 hr5 hr6 hr7 hr8
0.2184 0.2305 0.3943 0.3980 0.1968 0.1068 0.1093 0.1530
0.2184 0.2305 0.3943 0.3981 0.1968 0.1068 0.1093 0.1530
0.2179 0.2300 0.3938 0.3975 0.1963 0.1063 0.1088 0.1525
0.2182 0.2303 0.3941 0.3978 0.1966 0.1066 0.1091 0.1528
0.2180 0.2301 0.3939 0.3976 0.1964 0.1064 0.1088 0.1526
0.2188 0.2308 0.3946 0.3984 0.1971 0.1071 0.1096 0.1534
Standard errors drg B – drg A hr1 drg B – drg A hr2 drg B – drg A hr3 drg B – drg A hr4 drg B – drg A hr5 drg B – drg A hr6 drg B – drg A hr7 drg B – drg A hr8
0.1491 0.1491 0.1491 0.1491 0.1491 0.1491 0.1491 0.1491
0.1499 0.1499 0.1499 0.1499 0.1499 0.1499 0.1499 0.1499
0.1489 0.1489 0.1489 0.1489 0.1489 0.1489 0.1489 0.1489
0.1494 0.1494 0.1494 0.1494 0.1494 0.1494 0.1494 0.1494
0.1490 0.1490 0.1490 0.1490 0.1490 0.1490 0.1490 0.1490
0.1374 0.1471 0.1454 0.1578 0.1544 0.1465 0.1501 0.1579
Estimates drg B – drg drg B – drg drg B – drg drg B – drg drg B – drg drg B – drg drg B – drg drg B – drg
A A A A A A A A
in (17): estimate estimate estimate estimate
‘drgA–drgB ‘drgA–drgB ‘drgA–drgB ‘drgA–drgB
hr1’ hr2’ hr3’ hr4’
drug drug drug drug
1 1 1 1
−1 −1 −1 −1
0 0 0 0
drug ∗ hr drug ∗ hr drug ∗ hr drug ∗ hr
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
−1 0 0 0 0 −1 0 0 0 0 −1 0 0 0 0 −1
0 0 0 0
0 0 0 0
0 0 0 0
0; 0; 0; 0;
estimate ‘drgA–drgB hr5’ drug 1 −1 0 drug ∗ hr 0 0 0 0 1 0 0 0 0 0 0 0 −1 0 0 0; estimate ‘drgA–drgB hr6’ drug 1 −1 0 drug ∗ hr 0 0 0 0 0 1 0 0 0 0 0 0 0 −1 0 0;
(20)
estimate ‘drgA–drgB hr7’ drug 1 −1 0 drug ∗ hr 0 0 0 0 0 0 1 0 0 0 0 0 0 0 −1 0; estimate ‘drgA–drgB hr8’ drug 1 −1 0 drug ∗ hr 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 −1; Results appear in Table VI. These estimates are the same for structures SIM and CS. They are simply dierences between ordinary least squares means, adjusted for the covariable BASEFEV1. However, a simple expression for the variance of the estimates is not easily available. The standard errors dier for the two covariance structures, because simple structure does not recognize betweenpatient variation. Estimates of drug dierences for the four covariance structures other than ‘Simple’ and ‘Compound symmetric’ are all numerically dierent, though similar. Also, standard errors of the drug dierences are not the same for covariance structures AR(1), AR(1)+RE, TOEP and UN, but the standard errors for the AR(1)+RE and TOEP structure are constant over the hours. This is desirable, because data variance are homogeneous over hours, and the adjustment for the covariable BASEFEV1 would be the same at each hour. However, the standard errors of drug dierences for UN covariance vacillate between 0.137 and 0.158, a range of approximately 16 per cent. The standard errors are not constant because UN does not assume homogeneous variances. In the present
180
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
example, it is reasonable to assume homogeneous variances, and this should be exploited. Not doing so results in variable and inecient standard error estimates. The purpose of this section was to illustrate the practical eects of choosing a covariance structure. The results show that SIM, CS and AR(1) covariance structures are inadequate for the example data. These structure models basically provide ill-tting estimates of the true covariance matrix of the data. In turn, the ill-tting estimates of data covariance result in poor estimates of standard errors of certain dierences between means, even if estimates of dierences between means are equal across covariance structures. The structures AR(1)+RE, TOEP and UN are adequate, in the sense that they provide good ts to the data covariance. (This is always true of UN because there are no constraints to impose lack of t.) These adequate structures incorporate the two essential features of the data covariance. One, observations on the same patient are correlated, and two, observations on the same patient taken close in time are more highly correlated than observations taken far apart in time. As a result, standard error estimates based on assumptions of AR(1)+RE, TOEP or UN covariance structures are valid, but because UN imposes no constraints or patterns, the standard error estimates are somewhat unstable.
8. MODELLING POLYNOMIAL TRENDS OVER TIME Previous analyses have treated hour as a classication variable and not modelled FEV1 trends as a continuous function of hour. In Section 6, we tted six covariance structures to the FEV1 data, and determined that AR(1)+RE provided the best t. In Section 7, we examined eects of covariance structure on estimates of xed eect parameters and standard errors. In this section, we treat hour as a continuous variable and model hour eects in polynomials to rene the xed eects portion of the model. Then we use the polynomial model to compute estimates of dierences analogous to those in Section 7. Statements (21) t the general linear mixed model using AR(1)+RE covariance structure to model random eects and third degree polynomials to model xed eects of drug and hour. A previous analysis (not shown) that tted fourth degree polynomials using PROC MIXED showed no signicant evidence of fourth degree terms. proc mixed data = fev1uni; class drug patient; model fev1 = basefev1 drug drug ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr/htype = 1 3 solution noint; random patient(drug); repeated/type = ar(1) sub = patient(drug);
(21)
The MODEL statement in (21) is specied so that parameter estimates obtained from the SOLUTION option directly provide the coecients of the third degree polynomials for each drug. The tted polynomial equations, after inserting the overall average value of 2.6493 for BASEFEV1, are A:
FEV1 = 3:6187 − 0:1475 HR + 0:0034 HR 2 + 0:0004 HR3
B:
FEV1 = 3:5793 + 0:1806 HR − 0:0802 HR 2 + 0:0061 HR3
P:
FEV1 = 2:7355 + 0:1214 HR − 0:0289 HR 2 + 0:0017 HR3
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
181
Figure 6. Plots of polynomial trends over hours for each drug.
The polynomial curves for the drugs are plotted in Figure 6. Estimates of dierences between hour 1 and subsequent hours in drug A based on the tted polynomials may be obtained from the ESTIMATE statements (22): estimate estimate estimate estimate estimate estimate estimate
‘hr1–hr2 ‘hr1–hr3 ‘hr1–hr4 ‘hr1–hr5 ‘hr1–hr6 ‘hr1–hr7 ‘hr1–hr8
drga’drug ∗ hr drga’drug ∗ hr drga’drug ∗ hr drga’drug ∗ hr drga’drug ∗ hr drga’drug ∗ hr drga’drug ∗ hr
−1 −2 −3 −4 −5 −6 −7
drug ∗ hr ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr drug ∗ hr ∗ hr
−03 −08 −15 −24 −35 −48 −63
drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr drug ∗ hr ∗ hr ∗ hr
−007; −026; −063; −124; −215; −342; −511;
(22)
Results from statements (22) appear in Table VII. We see that standard errors of dierences between hour 1 and subsequent hours in drug A using AR(1)+RE covariance and polynomial trends for hour are smaller than corresponding standard errors in Table V using AR(1)+RE covariance and hour as a classication variable. This is due to the use of the polynomial model which exploits the continuous trend over hours. If the polynomial model yields very dierent results, one would conclude it does not adequately represent the trend over time. Estimates of dierences between drugs A and B at each hour may be obtained from the ESTIMATE statements (23): estimate ‘drga − drgb hr1’ drug 1 − 1 0 drug ∗ hr 1 − 1 0 drug ∗ hr ∗ hr 1 − 1 0 drug ∗ hr ∗ hr ∗ hr 1 − 1 0; estimate ‘drga − drgb hr2’ drug 1 − 1 0 drug ∗ hr 2 − 2 0 drug ∗ hr ∗ hr 4 − 4 0 drug ∗ hr ∗ hr ∗ hr 8 − 8 0;
182
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
Table VII. Estimates and standard errors for AR(1) + RE covariance structure and third degree polynomial model for hour.m Parameter
Estimate
Standard error
Within-subject comparisons hr1–hr2 drug A hr1–hr3 drug A hr1–hr4 drug A hr1–hr5 drug A hr1–hr6 drug A hr1–hr7 drug A hr1–hr8 drug A
0.1346 0.2577 0.3669 0.4599 0.5344 0.5880 0.6183
0.0453 0.0634 0.0686 0.0720 0.0754 0.0753 0.0828
Between-subject comparisons drg B–drg A hr1 drg B–drg A hr2 drg B–drg A hr3 drg B–drg A hr4 drg B–drg A hr5 drg B–drg A hr6 drg B–drg A hr7 drg B–drg A hr8
0.2108 0.3280 0.3463 0.2998 0.2228 0.1492 0.1132 0.1489
0.1494 0.1429 0.1434 0.1408 0.1408 0.1434 0.1429 0.1494
estimate ‘drga − drgb hr3’ drug 1 − 1 0 drug ∗ hr 3 − 3 0 drug ∗ hr ∗ hr 9 − 9 0 drug ∗ hr ∗ hr ∗ hr 27 − 27 0; estimate ‘drga − drgb hr4’ drug 1 − 1 0 drug ∗ hr 4 − 4 0 drug ∗ hr ∗ hr 16 − 16 0 drug ∗ hr ∗ hr ∗ hr 64 − 64 0; estimate ‘drga − drgb hr5’ drug 1 − 1 0 drug ∗ hr 5 − 5 0 drug ∗ hr ∗ hr 25 − 25 0 drug ∗ hr ∗ hr ∗ hr 125 − 125 0; estimate ‘drga − drgb hr6’ drug 1 − 1 0
(23)
drug ∗ hr 6 − 6 0 drug ∗ hr ∗ hr 36 − 36 0 drug ∗ hr ∗ hr ∗ hr 216 − 216 0; estimate ‘drga − drgb hr7’ drug 1 − 1 0 drug ∗ hr 7 − 7 0 drug ∗ hr ∗ hr 49 − 49 0 drug ∗ hr ∗ hr ∗ hr 343 − 343 0; estimate ‘drga − drgb hr8’ drug 1 − 1 0 drug ∗ hr 8 − 8 0 drug ∗ hr ∗ hr 64 − 64 0 drug ∗ hr ∗ hr ∗ hr 512 − 512 0; Results from statements (23) appear in Table VII. Standard errors for dierences between drug A and drug B at hours 1 and 8 using the polynomial model are similar to standard errors for these dierences using the model with hour as a classication variable. The standard errors of dierences between drugs A and B at intermediate hours are less than the standard errors for respective dierences using hour as a classication variable. Again, this is a phenomenon related to using regression models, and has very little to do with the covariance structure. It demonstrates that there is considerable advantage to rening the xed eects portion of the model. We believe, however, that rening the xed eects portion of the model should be done after arriving at a satisfactory covariance structure using a saturated xed eects model.
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
183
9. SUMMARY AND CONCLUSIONS One of the primary distinguishing features of analysis of repeated measures data is the need to accommodate the covariation of the measures on the same sampling unit. Modern statistical software enables the user to incorporate the covariance structure into the statistical model. This should be done at a stage prior to the inferential stage of the analysis. Choice of covariance structure can utilize graphical techniques, numerical comparisons of covariance estimates, and indices of goodness-of-t. After covariance is satisfactorily modelled, the estimated covariance matrix is used to compute generalized least squares estimates of xed eects of treatments and time. In most repeated measures settings there are two aspects to the covariance structure. First is the covariance structure induced by the subject experimental design, that is, the manner in which subjects are assigned to treatment groups. The design typically induces covariance due to contribution of random eects. In the example of this paper, the design was completely randomized which results in covariance of observations on the same subject due to between-subject variation. If the design were randomized blocks, then there would be additional covariance due to block variation. When using SAS PROC MIXED, the covariance structure induced by the subject experimental design is usually specied in the RANDOM statement. Second is the covariance structure induced by the phenomenon that measures close in time are more highly correlated than measures far apart in time. In many cases this can be described by a mathematical function of time lag between measures. This aspect of covariance structure must be modelled using the REPEATED statement in PROC MIXED. Estimates of xed eects, such as dierences between treatment means, may be the same for dierent covariance structures, but standard errors of these estimates can still be substantially dierent. Thus, it is important to model the covariance structure even in conditions when estimates of xed eects do not depend on the covariance structure. Likewise, tests of signicance may depend on covariance structure even when estimates of xed eects do not. The example in the present paper has equal numbers of subjects per treatment and no missing data for any subject. Having equal numbers of subjects per treatment is not particularly important as far as implementation of data analysis is concerned using mixed model technology. However, missing data within subjects can present serious problems depending on the amount, cause and pattern of missing data. In some cases, missing data can cause non-estimability of xed eect parameters. This would occur in the extreme situation of all subjects in a particular treatment having missing data at the same time point. Missing data can also result in unstable estimates of variance and covariance parameters, though non-estimability is unlikely. The analyst must also address the underlying causes of missing data to assess the potential for introducing bias into the estimates. If the treatment is so toxic as to cause elimination of study subjects, ignoring that cause of missingness would lead to erroneous conclusions about the ecacy of the treatment. For more information on this topic, the reader is referred to Little and Rubin [18], who describe dierent severity levels of missingness and modelling approaches to address it. Unequal spacing of observation times presents no conceptual problems in data analysis, but computation may be more complex. In terms of PROC MIXED, the user may have to resort to the class of covariance structures for spatial data to implement autoregressive covariance. See Littell et al. [15] for illustration. Using regression curves to model mean response as functions of time can greatly decrease standard errors of estimators of treatment means and dierences between treatment means at
184
R. C. LITTELL, J. PENDERGAST AND R. NATARAJAN
particular times. This is true in any modelling situation involving a continuous variable, and is not related particularly to repeated measures data. This was demonstrated in Section 8 using polynomials to model FEV1 trends over time. In an actual data analysis application, pharmacokinetic models could be used instead. Such models usually are non-linear in the parameters, and thus PROC MIXED could not be used in its usual form. However, the NLINMIX macro or the new NLMIXED procedure could be used.
APPENDIX The general linear mixed model species that the data vector Y is represented by the equation Y = X + ZU + e
(24)
where E(U) = 0, E(e) = 0; V (U) = G and V (e) = R. Thus E(Y) = X
(25)
We assume that U and e are independent, and obtain V (Y) = ZGZ + R
(26)
Thus, the general linear mixed model species that the data vector Y has a multivariate normal distribution with mean vector = X and covariance matrix V = ZGZ + R. Generalized least squares theory (Graybill, Reference [19], Chapter 6) states that the best linear unbiased estimate of is given by b = (X V−1 X)−1 X V−1 Y
(27)
and the covariance matrix of the sampling distribution of b is V (b) = (X V−1 X)−1
(28)
The BLUE of a linear combination a is a b, and its variance is a (X V−1 X)−1 a. More generally, the BLUE of a set of linear combinations A is A b, and its sampling distribution covariance matrix is A (X V−1 X)−1 A. Thus, the sampling distribution of A b is multivariate normal with mean vector E(A b) = A and covariance matrix A (X V−1 X)−1 A. Inference procedures for the general linear mixed model are based on these principles. However, the estimate b = (X V−1 X)−1 X V−1 Y and its covariance matrix V (b) = (X V−1 X)−1 both are functions, of V = ZGZ + R, and in most all cases V will contain unknown parameters. Thus, an estimate of V must be used in its place. Usually, elements of G will be functions of one set of parameters, and elements of R will be functions of another set. The MIXED procedure estimates the parameters of G and R, using by default the REML method, or the ML method, if requested by the user. Estimates of the parameters ˆ In turn, Vˆ is are then inserted into G and R in place of the true parameter values to obtain V. ˆ ˆb). used in place of V to compute bˆ and V( ˆ ˆb))1=2 = Standard errors of estimates of linear combinations are computed as ( V(a ˆ ˆ 1=2 −1 ˆ ˆb)) A b=rank(A). (a ( V(b))a) . Statistics for tests of xed eects are computed as F = b A( V( In some cases, the distributions of F are, in fact, F distributions, and in other cases they are only approximate. Degrees of freedom for the numerator of the F statistic are given by the rank of A,
MODELLING COVARIANCE STRUCTURE FOR REPEATED MEASURES DATA
185
but computation of degrees of freedom for the denominator is a much more dicult problem. One possibility is a generalized Satterthwaite approximation as given by Fai and Cornelius [20]. The interested reader is also referred to McLean and Sanders [21] for further discussion on approximating degrees of freedom, and to Hulting and Harville [22] for some Bayesian and non-Bayesian perspectives on this issue. For more information on analysis of repeated measures data, see Diggle et al. [23] and Verbeke and Molenberghs [24]. REFERENCES 1. Cnaan A, Laird NM, Slasor P. Using the general linear mixed model to analyze unbalanced repeated measures and longitudinal data. Statistics in Medicine 1997; 16:2349 – 2380. 2. Dixon WJ, (ed.). BMDP Statistical Software Manual, Volume 2. University of California Press: Berkeley, CA, 1988. 3. SAS Institute Inc. SAS=STAT J Software: Changes and Enhancements through Release 6.12. SAS Institute Inc.: Cary, NC, 1997. 4. Diggle PJ. An approach to the analysis of repeated measures. Biometrics 1988; 44:959 – 971. 5. Wolnger RD. Covariance structure selection in general mixed models. Communications in Statistics, Simulation and Computation 1993; 22(4):1079 –1106. 6. Dawson KS, Gennings C, Carter WH. Two graphical techniques useful in detecting correlation structure in repeated measures data. American Statistician 1997; 45:275 – 283. 7. Milliken GA, Johnson DE. Analysis of Messy Data, Volume 1: Designed Experiments. Chapman and Hall: New York, 1992. 8. Searle SR, Casella G, McCulloch CE. Variance Components. Wiley: New York, 1991. 9. Cressie NAC. Statistics for Spatial Data. Wiley: New York, 1991. 10. Akaike H. A new look at the statistical model identication. IEEE Transaction on Automatic control 1974; AC-19:716 –723. 11. Schwarz G. Estimating the dimension of a model. Annals of Statistics 1978; 6:461– 464. 12. Kass RE, Wasserman L. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 1995; 90:928 – 934. 13. Huynh H, Feldt LS. Estimation of the Box correction for degrees of freedom from sample data in the randomized block and split plot designs. Journal of Educational Statistics 1976; 1:69 – 82. 14. Greenhouse SS, Geisser S. On methods in the analysis of prole data. Psychometrika 1959; 32:95 – 112. 15. Littell RC, Milliken GA, Stroup WW, Wolnger RD. SAS J System for Mixed Models. SAS Institute Inc.: Cary, NC, 1996. 16. Robinson GK. That BLUP is a good thing-the estimation of random eect. Statistical Science 1991; 6:15 – 51. 17. Puntanen S, Styan GPH. On the equality of the ordinary least squares estimator and the best linear unbiased estimator. American Statistician 1989; 43:153 –164. 18. Little RJ, Rubin DB. Statistical Analysis with Missing Data. Wiley: New York, 1987. 19. Graybill FA. Theory and Application of the Linear Model. Wadsworth & Brooks=Cole: Pacic Grove, CA, 1976. 20. Fai AHT, Cornelius PL. Approximate F-tests of multiple degrees of freedom hypotheses in generalized least squares analyses of unbalanced split-plot experiments. Journal of Statistical Computing and Simulation 1996; 54:363 –378. 21. McLean RA, Sanders WL. Approximating degrees of freedom for standard errors in mixed linear models. In Proceedings of the Statistical Computing Section. American Statistical Association, 50 –59. 22. Hulting FL, Harville DA. Some Bayesian and non-Bayesian procedures for the analysis of comparative experiments and for small area estimation: computational aspects, frequentist properties, and relationships. Journal of the American Statistical Association 1991; 86:557 – 568. 23. Diggle PJ, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. Oxford University Press: New York, 1994. 24. Verbeke G, Molenberghs B. Linear Mixed Models in Practice. Springer: New York, 1997.
TUTORIAL IN BIOSTATISTICS Covariance models for nested repeated measures data: analysis of ovarian steroid secretion data‡ Taesung Park1; ∗; † and Young Jack Lee2; 3 1 Department
of Statistics; Seoul National University; San 56-1 Shillim-Dong; Kwanak-Gu; Seoul; 151-742; Korea 2 Biometry and Mathematical Statistics Branch; National Institute of Child Health and Human Development; Bethesda; MD 20892; U.S.A. 3 Hanyang University; Seoul; Korea
SUMMARY We consider several covariance models for analysing repeated measures data from a study of ovarian steroid secretion in reproductive-aged women. Urinary oestradiol and serum oestrogen were repeatedly observed over three or four menstrual periods, each period separated by one year. For each menstrual period, daily rst morning urine specimens were collected 8 to 18 times, and serum specimens 2 to 5 times. Thus, measurements were repeatedly observed over menstrual cycle days within menstrual periods. Owing to missing observations, the number of observations diered from subject to subject. In this study, there were two repeat factors: menstrual cycle day and menstrual period. The rst repeat factor, cycle day, is nested within the second repeat factor, menstrual period. In analysing these nested repeated measures data, the correlation structure should be modelled that will account for both repeat factors. We present several covariance models for dening appropriate covariance structures for these data. Published in 2002 by John Wiley & Sons, Ltd. KEY WORDS:
composite covariance; covariance structures; random eect models
1. INTRODUCTION Repeated measures data arise when an outcome variable of interest is measured repeatedly over time or under dierent experimental conditions from the same subject. Since the repeated measures from the same subject are correlated, multivariate methods that account for the correlation among responses are commonly used. The widely used models include multivariate linear mixed models with random subject eects proposed by Harville [1] and Laird and Ware [2], and structured covariance models proposed by Jennrich and Schluchter [3].
∗ Correspondence
to: Taesung Park, Department of Statistics, Seoul National University, San 56-1 Shillim-Dong, Kwanak-Gu, Seoul, 151-742, Korea
[email protected] ‡ This article is a U.S. Government work and is in the public domain in the U.S.A. † E-mail:
Contract=grant sponsor: KOSEF (through Statistical Research Center for Complex Systems at Seoul National University)
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 187
188
T. PARK AND Y. J. LEE
The models studied by Laird and Ware [2] and Jennrich and Schluchter [3] usually assume a special form of covariance structure. These methods use maximum likelihood (ML) or restricted maximum likelihood (REML) estimation to obtain estimators of model parameters. Iterative algorithms for parameter estimation are generally required. Recently, statistical programs such as the BMDP 5V program and the SAS MIXED procedures have implemented this methodology. The analysis of repeated measures data is often to test whether or not there are signicant treatment dierences or test whether or not some covariates have signicant eects on the outcome measure. Since the signicance of covariate eects might be highly dependent on the choice of covariance structure, it is important to choose an appropriate covariance matrix that takes intrasubject correlations into consideration. However, the choice of an appropriate covariance structure is not always obvious. In situations where the number of time points is relatively small and the data are balanced and complete, the unstructured covariance model oers an advantage of not requiring any assumption about the covariance structure. However, when the number of observations per subject is large, it would be more ecient to use a structured covariance model. In fact, in such a situation, we often experience convergence failures when using the unstructured covariance model. In this paper, we analyse repeated measures data from a study of ovarian steroid secretion in 175 reproductive-age women [4]. From each subject, the rst morning urine specimen was collected and its urinary oestradiol level was measured 8 to 18 times per menstrual period over four periods, where each period was separated by one year. In addition, the serum oestrogen level was obtained repeatedly from blood specimens. Serum specimens were collected 2 to 5 times per menstrual period over three periods. Thus, measurements were repeatedly observed over menstrual cycle days within menstrual periods. In this study there are two repeat factors: one is the menstrual cycle day where specimens were collected and the other is the menstrual period (or visit). We call these repeat factors inner repeat factor (IRF) and outer repeat factor (ORF), respectively. The IRF is characterized by menstrual cycle day and represents the within-menstrual-period repeat factor. The ORF is characterized by visit or period and represents the between-menstrual-period repeat factor. Since the IRF is nested within the ORF, we call the data the nested repeated measures data. In analysing the nested repeated measures data, the issue is what covariance structure to choose. In choosing the covariance structure, we have to consider the correlation structure that will account for both repeat factors. In our case the correlation between cycle days and also between menstrual periods should be modelled. Nested repeated measures are dierent from doubly repeated measures. Littell et al. [5] rst used the term doubly repeated measures. The data structure to which they applied the term, however, is dierent from our case. In doubly repeated measures, there is only one repeat factor but data are collected in multivariate outcomes at each observation point. In nested repeated measures, on the other hand, there are two repeat factors but data are collected in a univariate outcome at each observation point. Our focus in this paper is how to model the covariance structure for the nested repeated measures data. In Section 2, we describe the study of ovarian steroid secretion that motivated this paper. In Section 3, we introduce notation and models for nested repeated measures data. Section 4 discusses the covariance structures for these data and presents several ways of dening appropriate covariance structures. We present the analysis results in Section 5. Final remarks are given in Section 6.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
189
2. OVARIAN STEROID SECRETION DATA A total of 175 women were recruited for the study between 1985 and 1988 from the family planning clinics of the King County Hospital Center and the State University Hospital in Brooklyn, New York [4]. Among these, 118 women were seeking tubal ligation and 57 women were fertile women using a variety of contraceptive methods and not seeking tubal ligation. Women were eligible for the study if they were between 21 and 36 years of age, had experienced at least one pregnancy, were not pregnant or trying to get pregnant, and reported having menstrual cycles at intervals ranging from 23 to 38 days and lasting for 2–7 days with moderate ow. At baseline, all subjects were interviewed regarding their menstrual history, pregnancies, past and present contraceptive use, sexual activity, use of cigarettes and alcohol. During the baseline study cycle, all women underwent a complete physical examination including height and weight measurement. Blood and urine specimens were collected during all study menstrual cycles. Urinary oestradiol levels and serum oestrogen levels were repeatedly observed for each subject over three or four menstrual periods, respectively. At each study menstrual period, daily rst morning urine specimens were collected 8 to 18 times, and serum blood specimens were collected 2 to 5 times. The number of observations diered from subject to subject, because subjects missed visits or failed to collect urine specimens. In Table I we show data for the rst study subject. We have 52 observations on her. The visit column represents the study menstrual period where 0 is the baseline visit, 1 is the year-1 visit and so forth. Cycle day 0 is the estimated day of ovulation, which was estimated by a single investigator using the basal body temperature record and the dates of onset of the preceding and subsequent menstrual ows. Then each cycle day was assigned a number of days deviating from the day of ovulation, which is conveniently called ‘the ovulation day’. For example, cycle day −2 means the day that precedes ovulation day by two days. All blood and urine specimens were assigned a cycle day corresponding to the number of days plus or minus from the estimated day of ovulation. Cycle days range from −2 to 13. The fourth and fth columns of Table I display urinary oestradiol and serum oestrogen levels. Note that this subject had 11 baseline urinary oestradiol measurements, 13 year-1 visit measurements, 13 year-2 visit measurements, and 15 year-3 visit measurements. She had three serum oestrogen measurements for the baseline, year-1 visit and year-2 visit. Note that the intervals between the adjacent serum level observations vary. The last ve columns of Table I show the values of covariates: age at baseline, menage (age at menarche), smoking (smoking status), weight, and treat (treatment groups). Age and menage are time-independent covariates, while smoking and weight are time-dependent covariates. Figure 1 shows the mean values of urinary oestradiol by cycle days. Four plots show the mean values for the baseline year, year 1, year 2 and year 3, respectively. These plots show similar patterns. That is, the urinary oestradiol level rises until it reaches the peak near the day of ovulation (cycle day = 0), and falls until 3 or 4 days after the day of ovulation. Then, it rises again until it reaches the second peak near cycle day 10 and then falls again. Figure 2 shows the mean values of serum oestrogen over three menstrual periods. Figure 2 is based on fewer data points and thus the pattern is not as clear as in Figure 1. These patterns of
190
T. PARK AND Y. J. LEE
urinary oestradiol levels and serum oestrogen levels are commonly observed in many standard textbooks [6, 7]. Figures 3 and 4 show the within-person line plots that provide information about the intraperson and interperson variability in the hormones. Each line represents the measurements from the same subject. In Figure 3, the lines become sparser from visit 0 to visit 3 due to
Table I. Urinary oestradiol levels and serum oestrogen levels for the rst subject. Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Visit
Cycle day
Urinary oestradiol
Serum oestrogen
Age
Menage
Smoking
Weight
Treat
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3
−2 −1 0 1 2 3 4 5 6 7 8 −1 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 −2 −1 0 1
142.5 159.1 81.7 44.9 7.4 7.9 44.3 73.4 58.3 91.5 51.4 22.0 109.5 58.9 53.0 46.6 62.5 51.2 78.3 93.4 78.2 101.7 83.5 66.5 21.4 145.2 131.8 56.0 86.5 78.8 63.5 92.8 8.0 93.2 82.2 88.7 10.7 122.1 22.2 22.4 127.8
323.6
31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
N N N N N N N N N N N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
125 125 125 125 125 125 125 125 125 125 125 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 134 134 134 134
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
56.7
121.5
64.3 112.4 127.4
124.4 121.4 76.4
191
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
Table I. Continued. Observation 42 43 44 45 46 47 48 49 50 51 52
Visit 3 3 3 3 3 3 3 3 3 3 3
Cycle day 2 4 5 6 7 8 9 10 11 12 13
Urinary oestradiol 61.9 77.8 62.1 89.3 62.3 81.3 91.6 76.7 129.1 86.5 9.7
Serum oestrogen
Age
Menage
Smoking
Weight
Treat
31 31 31 31 31 31 31 31 31 31 31
11 11 11 11 11 11 11 11 11 11 11
Y Y Y Y Y Y Y Y Y Y Y
134 134 134 134 134 134 134 134 134 134 134
1 1 1 1 1 1 1 1 1 1 1
Figure 1. Plots of urinary oestradiol mean levels.
the smaller number of subjects. The same pattern is observed in Figure 4. Although these plots show wide variability in measures, they reveal some up-and-down patterns observed in Figures 1 and 2. It is a typical pattern of repeated measures data that the number of missing observations increases with time. Table II summarizes the number of subjects who had urinary oestradiol
192
T. PARK AND Y. J. LEE
Figure 2. Plots of serum oestrogen mean levels.
values observed at the corresponding menstrual cycle day during the mid-luteal period, say days 6 to 10. The number of observations decreases very rapidly over visits. We assume missing observations are ‘missing completely at random’ (MCAR) or ‘missing at random’ (MAR) in the sense of Rubin [8]. Under the MCAR mechanism, generating the missing data is independent of the observed data and the unobserved data. Under the MAR mechanism, the missing data are dependent on the observed data but remains independent of the unobserved data. Laird [9] indicated that under these two scenarios, likelihood-based inferences may proceed with the data actually captured and called the missing data mechanism ‘ignorable’. Here we assume that the missing data are ignorable. The primary objective of the ovarian steroid secretion study was to determine if the tubal ligation would alter secretion of steroids, progestin and oestrogen, in reproductive-aged women [4]. There were two types of tubal ligation, the cautery and the clip. 118 women seeking voluntary tubal ligation were randomly assigned to either the clip or the cautery, but because there was no evidence of any dierence between the two procedures in aecting oestrogen levels, we analyse the 118 women as one group in Section 5. We compare the 118 women to 57 control women after adjusting for age, menage, and other possible breast cancer risk factors. In analysing the eects of covariates on hormone levels, we need to adjust cycle day eects on hormone levels rst. We used the polynomial regression approach to model the cycle day eects on hormone levels [10]. We also assumed the common cycle day eects for all subjects. Since the complete analysis of data from the ovarian secretion data is not the main focus of
Figure 3. Plots of individual urinary oestradiol levels.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
193
194
T. PARK AND Y. J. LEE
Figure 4. Plots of individual serum oestrogen levels.
the paper, we focus on analysing the measurements observed during the mid-luteal period, cycle days between 6 and 10. For this period, models with lower order polynomial may t the data well. Table III shows some summary statistics of covariates at baseline observed in the midluteal period. The mean age was 29.8 and the mean of the age at menarche was 12.5. The
195
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
Table II. Number of women who had observations for urinary oestradiol levels between menstrual cycle days 6 to 10. Visit
Cycle day
Frequency
6 7 8 9 10 6 7 8 9 10 6 7 8 9 10 6 7 8 9 10
162 164 164 159 158 129 120 124 117 122 92 93 92 89 84 55 57 56 51 53
0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Table III. Baseline characteristics of reproductive-age women in a study of tubal ligation. Variable Age at baseline Age at menarche Weight, pounds
N
MEAN
STD
Minimum
Maximum
169 169 167
29.787 12.515 150.317
3.772 1.666 38.899
21.000 8.000 90.000
36.000 17.000 270.000
mean weight was 150.3 pound, though it varied over years. Table IV shows the numbers of reproductive-age women (N ), the mean values (MEAN), and standard errors (STD) of ovarian hormones in the three treatment groups. Here 1 represents the tubal ligation group with the clip, 2 the tubal ligation group with the cautery, and 3 the control group. For the serum oestrogens, the second treatment group tended to have larger mean values than other groups at baseline and year 1, but not at year 2. For the urinary oestradiols, the third treatment group tended to have larger mean values than other groups at baseline, year 1 and year 2. For detailed analyses for these covariates see Section 5. We present the analysis results based on the models with covariates: age; menage; weight; smoking, and treat after adjusting cycle day eects. We exclude other risk factors which were not signicant at the 10 per cent signicance level in the preliminary analyses. A fuller medical paper will present the full scope of the study and data analysis. The main focus of the paper is to show how to choose covariance models for dening appropriate covariance structures for the nested repeated measures data.
196
T. PARK AND Y. J. LEE
Table IV. Ovarian hormone values of reproductive-age women for three treatment groups. Visit 0
Treatment∗ 1 2 3
1
1 2 3
2
1 2 3
3
1 2 3
∗ †
Statistics†
Serum oestrogen
Urinary oestradiol
N MEAN STD N MEAN STD N MEAN STD
57 120.412 45.033 48 140.494 67.864 41 120.072 39.13
59 51.68 32.229 56 51.987 29.573 54 54.64 31.849
N MEAN STD N MEAN STD N MEAN STD
42 120.737 45.426 41 122.25 48.723 35 111.753 38.585
45 48.261 27.252 42 54.237 32.985 45 58.234 30.069
N MEAN STD N MEAN STD N MEAN STD
35 122.411 59.857 32 124.788 58.132 23 125.165 47.894
32 53.277 37.633 36 49.955 36.317 31 59.897 36.701
N MEAN STD N MEAN STD N MEAN STD
23 57.802 43.139 19 53.388 33.245 15 50.588 29.53
1 =tubal ligation with the clip, 2 = tubal ligation with the cautery, 3 = control. N = number of subjects, MEAN = mean; STD = standard error.
3. NOTATION AND MODELS Suppose that there are n subjects in the study. Let i ( = 1; : : : ; n) denote the ith subject and j ( = 0; : : : ; J ) the jth visit, where J = 3 for the urine specimen data and J = 2 for the serum specimen data (the year 3 serum sample was not assayed for oestrogen). For the ith subject, suppose that tij observations were made at visit j. Let yijk represent the k ( = 1; : : : ; tij )th observation obtained from the ith subject at the jth year and let yij = (yi1 ; : : : ; yijtij )T denote
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
197
the observation vector. Then, the observation vector from the ith subject, yi , is given by yi0 yi0 : ti0 × 1 vector at year 0 yi1 : ti1 × 1 vector at year 1 yi1 yi = .. where .. . . yiJ yiJ : tiJ × 1 vector at year J Note that yi is a ti × 1 vector where ti = Jj=0 tij . Observations from the same subject are correlated, and observations from the dierent subjects are uncorrelated. Let xi = (xi1 ; : : : ; xiti )T be the ti × p matrix of these covariate values, where xij is a tij × p covariate matrix for the ith subject (i = 1; : : : ; n) and jth year (j = 0; : : : ; J ). Then the regression model can be written as a linear model proposed by Jennrich and Schluchter [3].
yi = xi + ei
(1)
where ei is a ti × 1 error vector having a multivariate normal distribution with an appropriate covariance matrix, and is the parameter vector of interest. Our interest is to determine whether or not some covariates have signicant eects on the outcome observation. In the ovarian steroid secretion study, a covariate distinguishing the tubal ligation treatment group and control group is included in the model. In addition to this treatment covariate, all covariates in Table I are included in the model. In Section 5, we apply this model to oestrogen data and present data analysis results.
4. COVARIANCE MODELS In many repeated measures data, the covariance parameters that incorporate intrasubject correlations into the regression model are treated as nuisance parameters, because they often do not provide relevant information regarding study objectives. Our interest is in estimating regression parameters and their variances. Nevertheless, we need to make an appropriate covariance assumption in order to get ecient regression parameter estimates. In this section we describe covariance structures that can be applied to the nested repeated measures data. Let var(ei0 ) cov(ei0 ; ei1 ) · · · cov(ei0 ; eiJ ) cov(ei1 ; ei0 ) var(ei1 ) · · · cov(ei1 ; eiJ ) var(ei ) = .. .. .. .. . . . . cov(eiJ ; ei0 )
cov(eiJ ; ei1 )
var(eiJ )
Note that var(eij ) represents the variance-covariance matrix within the same menstrual period(visit) for j = 0; : : : ; J . cov(eij ; eij ) represents the variance-covariance matrix between the jth and j th menstrual periods(visits). We may apply the compound symmetry (CS, or exchangeable) covariance structure, the rst-order autocorrelation (AR(1)) structure, or the unstructured (UN) covariance model to the IRF. We show the three covariance structures for a subject with three serum oestrogen
198
T. PARK AND Y. J. LEE
measurements at visit j as an example: Structure CS
For tij = 3
1 2 1
1
AR(1)
1 2
2
UN
11 12 13
1
2
1
12 22 23
13 23 33
Table I shows that the rst study subject had 52 urinary oestradiol observations and nine serum oestrogen observations. For this subject, ti = 3j=0 tij = 52 for urinary data and ti = 2j=0 tij = 9 for serum data. Then, we have 52 × 52 and 9 × 9 covariance matrices for urinary and serum data, respectively. If we restrict the analysis to the mid-luteal period, cycle days 6 to 10, we have lower dimension covariance matrices. In that case, we have 19 × 19 and 3 × 3 covariance matrices for urinary and serum data, respectively. In general cases, however, it would not be easy to derive these high dimensional matrices by extending the above 3 × 3 covariance matrices. CS structure seems to be too restrictive to use; AR(1) structure would not be appropriate to use because intervals between observations vary greatly. In addition, CS and AR(1) structures are too simple because CS assumes the correlation between two observations 2 years apart is the same as that between two in the same observational period, while AR(1) assumes, for example, that the correlation between observations 24 and 25 (1 year apart) in subject 1 is the same as that between observations 23 and 24 (1 day apart). We now propose three covariance structures for analysing the ovarian steroid secretion study data: (i) spatial(SP) covariance model using a continuous AR(1) correlation; (ii) composite(COMP) covariance model; (iii) random eects model.
4.1. Spatial covariance models SP models use the actual distance between two menstrual days from which observations were collected. The spatial structures are widely used to analyse the geostatistical data [11]. We consider two types of spatial covariance models to analyse the ovarian secretion data. In one type, denoted by SP(EXP), we assume that the covariance between yijk and yijk is 2 [exp(−|dijk − dijk |= )] where dijk and dijk represent two cycle days at which yijk and yijk are observed, respectively.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
199
In the other type, denoted by SP(POW), we assume 2 |dijk −dijk | Note that both types of correlation have the common property that the size of correlation decreases as |dijk − dijk | increases. 4.2. Composite covariance models Under the COMP covariance model, separate covariance structures are specied for each of two repeat factors. In the ovarian steroid secretion study, menstrual cycle day and period are two repeat factors. Thus, the covariance model between cycle days within a period and the covariance model across periods must be specied. For example, in Table I the rst subject has three serum oestrogen values observed at three cycle days for each of three visits: baseline; year 1 and year 2. The cycle day is the IRF, and the period is the ORF. In order to dene the COMP covariance structure, we rst introduce the direct product or right Kronecker product, denoted by @ (Searle, reference [12], p. 265). For the p × q matrix A and m × n matrix B, A@B is dened as a11 B · · · a1q B .. A@B = ... . ap1 B · · · apq B Using the direct product @, the COMP covariance structure for the IRF and the ORF can be described as (Structure 1) @ (Structure 2) where (Structure 1) is the covariance structure assumed for the IRF and (Structure 2) for the ORF. If we assume the UN covariance matrix for the IRF and the AR(1) for the ORF, then the COMP covariance model is described as UN@AR(1). For the subject who had nine oestrogen measurements with three per visit, for example, her covariance matrix under UN@AR(1) is given by 11 12 13 1 2 21 22 23 @ 2 1 31 32 33
2 1 If we assume the AR(1)@UN covariance matrix, namely the AR(1) for the IRF and the UN for the ORF, then the covariance matrix of the nine estrogen measurements is written 1 2 11 12 13 2 1 @ 21 22 23 31 32 33
2 1 The number of total urine samples is much larger than that of serum samples. For example, the rst study subject (Table I) had 52 urinary oestradiol measurements. She had 13 to 15 urine observations per visit. In such a case, assuming the UN structure for the IRF can be a
200
T. PARK AND Y. J. LEE
problem, because there are too many parameters to be estimated. In fact, when we applied the UN@AR(1) covariance model to the urine data from the whole cycle days, we encountered convergence problem. Convergence problems commonly occur in tting the UN model for unbalanced data. For the ovarian steroid secretion study, the AR(1)@UN assumption seems to be more appropriate, because it allows arbitrary covariance for observations between two dierent periods but assumes the AR(1) structure for observations within the same period. This type of composite covariance structure was considered by Galecki [13] to specify two or more repeat factors in longitudinal data, and the software is readily available for such a covariance structure (see the SAS MIXED procedure). 4.3. Random eects models The third model is a random eects model. The simplest random eect model is the model with one random eect, called subject eect. Under this model measurements from a subject are assumed to be equally correlated. This random eect model is equivalent to the model with no random eects but with the CS covariance structure. In the ovarian steroid secretion study, menstrual cycle day and period are two repeat factors. By adding two more random eects for cycle day and period, we can t the model assuming that two measurements from dierent periods but with the same cycle day are equally correlated with d correlation coecient, and two measurements from the same period with p correlation coecient. The model with three random eects, subject, cycle day and period, is more appropriate to t the ovarian steroid secretion data than the simple CS model with one random subject eect. 5. RESULTS In this section we apply the covariance models in Section 4 to the urinary oestradiol data and the serum oestrogen data separately. Covariates in the regression model are age(age at baseline), menage(age at menarche), weight, smoking and treat(treatment groups). According to some preliminary analyses, the two tubal ligation groups, cautery group and clip group, were not signicantly dierent, and thus those two tubal ligation groups were pooled into a group. We include the ‘treat’ variable in the nal regression model. The ‘treat’ variable is dened as 1 for clip or cautery (118 women) treat = 2 for control (57 women) Age, menage and treat variables are time-independent(or between-subject) covariates, and weight and smoking variables are time-dependent(or within-subject) covariates. We assume that the time-dependent variables do not change their values within the ORF variable, namely the period variable. For the SP model, both SP(EXP) and SP(POW) models were applied. In applying SP models, we encountered the convergence problem when the time interval between consecutive observations was over 50 time units. In fact, we found this number 50 by trial and error. For simplicity and practicality, we assumed that the interval between visits was 50 days. One other note: when the correlation was low, SP(EXP) and SP(POW) produced indistinguishable results.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
201
For the COMP model, the covariance for the IRF and covariance for the ORF should be specied. In this data analysis, both AR(1)@UN and UN@AR(1) models were applied: AR(1)@UN assumes AR(1) structure for the IRF (cycle days) and UN structure for the ORF (period), while UN@AR(1) assumes UN structure for the IRF and AR(1) covariance matrix for the ORF. For the random eect model, we add three random eects, subject, cycle day and period, to the model. As shown in Figures 1 and 2, the hormone levels change greatly over the cycle days. How do we handle the eect of cycle days on hormone levels in the model? One approach is to t a higher-order polynomial function. For simplicity, we present the results only on the mid-luteal period that corresponds to the interval between 6 and 10 cycle days. Within this mid-luteal period, the hormone level may be viewed as a second- or third-order polynomial function of cycle days. To determine the order of polynomials, we t models by forward method (Draper and Smith, reference [10], Chapter 12). For the SP model, for example, Table V shows the estimates and p-values for the ve covariates of interest as well as the estimates of pth-order polynomials of cycle day eects. The results are summarized from p = 1 to p = 4. Since we t the model for the mid-luteal period, the value of p is smaller than or equal to 4. The estimates and p-values turned out to be less sensitive to the order of polynomial. For example, estimates of smoking coecient range from −0:2064 to −0:2067 and its p-values are close to 0.0085. Based on the results, we chose the quadratic model for cycle days. For other covariance models such as COMP models we got the similar results. Thus, we t the models with ve covariates with the quadratic model for cycle days. Data analysis results for the urinary oestradiol are provided in Table VI. The SAS code for tting all covariance models is listed in the Appendix. For each of ve covariates, smoking, age, weight, menage and treat, the estimate of regression coecient, its standard error (SE) and the p-value are presented. N represents the number of subjects who have at least one observation during cycle days 6 to 10. The results are quite consistent. The smoking status and age were signicant covariates, but others not. Smoking has a negative eect on the urinary oestradiol levels. Its p-values vary from 0.0001 to 0.0116 depending on the model. We can conclude that the urinary oestradiol level is lower for smokers than for non-smokers. Regardless of underlying statistical models, the eect of age on the urinary oestradiol levels was negative. Its p-values range from 0.0003 to 0.0237 depending on the model. The urinary oestradiol level is higher among the younger group of women. Weight, age at menarche and tubal ligation did not have signicant eects on the urinary oestradiol level. Smoking status and age are the only covariates that have signicant negative eects on the urinary oestradiol level. Results of the serum oestrogen data analysis are presented in Tables VII and VIII. For the SP model, Table VII shows the estimates and p-values for the ve covariates of interest and the estimates of pth-order polynomials of cycle day eects. The results are summarized from p = 1 to p = 4. Like urinary oestradiols, the estimates and p-values were not aected by the order of polynomial. For example, the estimates of smoking ranges from 0.0899 to 0.0965 and its p-values from 0.1015 to 0.1283. Based on the results, we chose the quadratic model. The results using the quadratic model for cycle days are presented in Table VIII. The format is the same as Table VI. The same covariance models and covariates as in the case of the urinary oestrodiol analysis were considered. Estimated regression coecients, their standard errors (SE) and p-values are shown. From Table VIII, we can observe all models
−0:0270 (0.0146) −0:0270 (0.0147) −0:0269 (0.0148) −0:0269 (0.0149)
−0:2066 (0.0085) −0:2064 (0.0086) −0:2067 (0.0085) −0:2067 (0.0085)
1st order model
4th order model
3rd order model
2nd order model
Age Estimate (p-value)
Smoking Estimate (p-value)
Model
0.0003 (0.7832) 0.0003 (0.7795) 0.0003 (0.7797) 0.0003 (0.7791)
Weight Estimate (p-value) −0:0013 (0.9599) −0:0013 (0.9613) −0:0012 (0.9632) −0:0012 (0.9639)
Menage Estimate (p-value) −0:0158 (0.5104) −0:0157 (0.5145) −0:0158 (0.5120) −0:0158 (0.5116)
Treat Estimate (p-value) 0.0094 (0.0834) 0.2346 (0.0009) −1:1221 (0.1715) 7.3034 (0.5162)
1st Estimate (p-value)
−0:0141 (0.0014) 0.1587 (0.1279) −1:4593 (0.4987)
2nd Estimate (p-value)
−0:0072 (0.0971) 0.1292 (0.4770)
3rd Estimate (p-value)
Table V. Sensitivity analysis of urine oestradiol: the pth-order polynomial SP models using SP(EXP) structure.
−0:0043 (0.4527)
4th Estimate (p-value)
202 T. PARK AND Y. J. LEE
SP(EXP) SP(POW) AR(1)@UN UN@AR(1)
SP model
COMP model Random model
Model
Cycle days
167 167 167 167 167
N SE 0.0784 0.0784 0.0626 0.0673 0.0776
Estimate −0:2064 −0:2064 −0:2563 −0:2729 −0:1963
(0.0086) (0.0086) (0.0001) (0.0001) (0.0116)
−0:0270 −0:0270 −0:0259 −0:0295 −0:0252
(p-value) Estimate
Smoking
0.0109 0.0109 0.0076 0.0081 0.0111
SE
Age
(0.0147) 0.0003 (0.0147) 0.0003 (0.0008) −0:0002 (0.0003) −0:0005 (0.0237) 0.0004
(p-value) Estimate 0.0010 0.0010 0.0008 0.0008 0.0010
SE
Weight
(0.7795) (0.7795) (0.7622) (0.5141) (0.6765)
−0:0013 −0:0013 −0:0083 −0:0125 −0:0070
0.0262 0.0262 0.0184 0.0195 0.0268
SE
(0.9613) (0.9613) (0.6536) (0.5225) (0.7931)
−0:0157 −0:0157 −0:0163 −0:0143 −0:0106
0.0240 0.0240 0.0264 0.0287 0.0224
SE
(0.5145) (0.5145) (0.5377) (0.6193) (0.6360)
(p-value)
Treat (eect of tubal ligation) (p-value) Estimate
Menage (age at menarche) (p-value) Estimate
Table VI. Analysis results of urine oestradiol.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
203
0.0899 (0.1283) 0.0964 (0.1016) 0.0965 (0.1015) 0.0941 (0.1112)
1st order model
4th order model
3rd order model
2nd order model
Smoking Estimate (p-value)
Model
0.0003 (0.7309) 0.0002 (0.7925) 0.0002 (0.7973) 0.0002 (0.8269)
−0.0171
(0.298) −0:0171 (0.0288) −0:0171 (0.0287) −0:0172 (0.0280)
Weight Estimate (p-value)
Age Estimate (p-value) (0.9208) −0:0016 (0.9356) −0:0016 (0.9342) −0:0018 (0.9269)
−0:0020
Menage Estimate (p-value) (0.2335) −0:0224 (0.2085) −0:0224 (0.2086) −0:0215 (0.2295)
−0:0214
Treat Estimate (p-value) (0.0372) 0.3773 (0.0148) 0.4700 (0.8058) 13.3460 (0.5883)
−0:0212
1st Estimate (p-value)
(0.8801) −2:5142 (0.5956)
−0:0367
(0.0099)
−0:0249
2nd Estimate (p-value)
0.0005 (0.9612) 0.2098 (0.5997)
3rd Estimate (p-value)
Table VII. Sensitivity analysis of serum oestrogen: the pth order polynomial SP models using SP(EXP) structure.
(0.6005)
−0:0066
4th Estimate (p-value)
204 T. PARK AND Y. J. LEE
Structure
SP model SP(EXP) SP(POW) COMP AR(1)@UN model UN@AR(1) Random model
Models
146 146 146 146 146
N
0.0964 0.0964 0.0743 0.0813 0.0989
Estimate 0.0587 0.0587 0.0565 0.0558 0.0588
SE (0.1016) (0.1016) (0.1897) (0.1463) (0.1203)
−0:0171 −0:0171 −0:0191 −0:0181 −0:0170
(p-value) Estimate
Smoking
0.0078 0.0078 0.0072 0.0069 0.0077
SE
Age
(0.0288) (0.0288) (0.0087) (0.0093) (0.0503)
0.0002 0.0002 0.0000 0.0001 0.0002
(p-value) Estimate 0.0008 0.0008 0.0007 0.0007 0.0008
SE
Weight
(0.7925) (0.7925) (0.9556) (0.9160) (0.7724)
−0:0016 −0:0016 −0:0108 −0:0048 −0:0016
(p-value) Estimate
0.0196 0.0196 0.0183 0.0174 0.0196
SE
(0.9356) (0.9356) (0.5569) (0.7821) (0.9343)
−0:0224 −0:0224 −0:0161 −0:0113 −0:0219
0.0178 0.0178 0.0180 0.0197 0.0179
SE
(0.2085) (0.2085) (0.3694) (0.5671) (0.2464)
(p-value)
Treat (eect of tubal ligation) (p-value) Estimate
Menage (age at menarche)
Table VIII. Analysis results of serum oestrogen.
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
205
206
T. PARK AND Y. J. LEE
produce comparable results. For the baseline age, all other models except random eect models yield p-values smaller than 0.05. Thus, we may conclude that the serum oestrogen level is signicantly lowered with the age. However, no covariates are signicant.
6. REMARKS In this paper, we have considered three covariance models for modelling the covariance structures for nested repeated measures data. SP, COMP and random eects models tend to produce comparable results. One characteristic of the ovarian steroid secretion study data was that the number of observations within a given cycle varied from subject to subject. Furthermore, intervals between adjacent observations varied even within a subject, namely observation times were not equally spaced. Both SP and COMP covariance models may be suitable for such unevenly spaced nested repeated measures data. However, there is a drawback in applying the SP model, because when the time units of ORF and IRF are quite dierent, the correlation cannot be correctly estimated under the SP model. For example, the urinary oestradiol level measured at the ovulation day (cycle day 0) in year 1 may be highly correlated with the oestradiol level observed at the corresponding ovulation days in year 2 and year 3, but SP models put 50 days and 100 days intervals between year 1 and year 2 and year 1 and year 3, respectively. Thus, these correlations are assumed to be very low, and thus underestimated. On the other hand, the measurements within the same period tend to be overestimated. COMP models are more exible in this matter. COMP models can assign high correlations to observations at the same cycle day of dierent periods. Random eects models are more appropriate than the simple CS structure models, because they allow more appropriate correlation structures. However, this also assumes that two measurements from the same cycle day are equally correlated and so are two measurements from the same period. That assumption might be too strong to hold in most practical situations. We think the COMP model is the most appropriate covariance model to analyse the nested repeated measures data. We have applied the SAS MIXED procedure to t the COMP covariance model. The SAS code is listed in Appendix. The PROC MIXED procedure, however, has limited selections for the COMP model, namely UN@AR(1) and AR(1)@UN only. We need more choices for the covariance model. More technical comments about the covariance model are in order. When we t the COMP UN@AR(1) model to the urinary oestradiol data for the whole interval, it failed to converge if the number of repeated data points within the IRF was large. Other models such as AR(1)@AR(1) would have been useful, but software for such a model is not available yet. In general, which covariance structure is most robust requires an evaluation. When the correct covariance structure is not known, the use of a robust covariance matrix can be an alternative approach. The robust covariance matrix was considered by Liang and Zeger [14] in their generalized estimating equations (GEE) approach. We expect that the unstructured covariance matrix suer from the same convergence problem. The use of structured covariance might be recommended. Unfortunately, there are some limitations for the practical use of this approach. The SAS MIXED procedure does not produce the robust variance estimates. Though the SAS GENMOD procedure can handle the GEE approach, it
COVARIANCE MODELS FOR NESTED REPEATED MEASURES DATA
207
cannot handle COMP models or random eects models but can handle a very limited type of covariance structures.
APPENDIX: SAS MIXED PROCEDURE STATEMENTS We rst list the variables not introduced in the context: studyno = subject id day = menstrual cycle day day1 = day day2 = day*day day all = day if visit = 0 day+50, if visit = 1 day+100, if visit = 2 day+150, if visit = 3 The SAS (V6.12) for tting the models for our data is listed below.
A.1. SP model SP(EXP) model proc mixed method=ml data=total5 convf; class smoke studyno visit; model y=smoke day day2 age weight menage treat basetrt=s; repeated=subject=studyno type=sp(exp)(day all) local r; title ‘SP(EXP) Covariance: 5¡day¡11’; SP(POW) model proc mixed method=ml data=total5; class smoke studyno visit; model y=smoke day day2 age weight menage treat basetrt=s; repeated=subject=studyno type=sp(pow)(day all) local r; title ‘SP(POW) Covariance: 5¡day¡11’;
A.2. COMP model AR(1)@UN model proc mixed method=ml data=total5; class smoke studyno visit day1; model y=smoke day day2 age weight menage treat basetrt=s; repeated visit day1=subject=studyno type=un@ar(1) r; title ‘AR(1)@UN Covariance: 5¡day¡11’;
208
T. PARK AND Y. J. LEE
(UN@AR(1) model proc mixed method=ml data=total5 convf; class smoke studyno visit day1; model y=smoke day day2 age weight menage treat basetrt=s; repeated day1 visit=subject=studyno type=un@ar(1) r; title ‘UN@AR(1) Covariance: 5¡day¡11’; A.3. Random eects model proc mixed method=ml data=total5 convf; class smoke studyno visit day1; model y=smoke day day2 age weight menage treat basetrt=s; random studyno studyno*visit studyno*day1; title ‘Random Eects Model: 5¡day¡11’; ACKNOWLEDGEMENTS
This research was supported in part by KOSEF through Statistical Research Center for Complex Systems at Seoul National University. REFERENCES 1. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 1977; 72:320 –340. 2. Laird NM, Ware JH. Random-eects models for longitudinal data. Biometrics 1982; 38:963 – 974. 3. Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics 1986; 42:805 – 820. 4. Westho C, Gentile G, Lee J, Zacur H, Helbig D. Predictors of ovarian steroid secretion in reproductive-age women. American Journal of Epidemiology 1996; 144:381–388. 5. Littell RC, Milliken GA, Stroup WW, Wolnger RD. SAS System for Mixed Models. SAS Institute Inc.: Cary, 1996. 6. Berek JS. Novak’s Gynecology. 12th edn. Williams & Wilkins: Baltimore, 1996; Section II, Chapter 7, Reproductive Physiology, 149 –174. 7. Carr BR, Bradshow KD. Disorders of the ovary and female reproductive tract. In Harrison’s Principle of Internal Medicine, Fauci AS, Braunwald E, Isselbacher KJ et al. (eds). McGraw-Hill: NewYork, 1998; 2097–2115. 8. Rubin DB. Inference and missing data. Biometrics 1976; 63:581–592. 9. Laird NM. Missing data in longitudinal studies. Statistics in Medicine 1988; 7:305 –315. 10. Draper NR, Smith H. Applied Regression Analysis. Wiley: New York, 1998. 11. Cressie NAC. Statistics for Spatial Data, revised edn. Wiley: New York, 1993. 12. Searle SR. Matrix Algebra Useful for Statistics. Wiley: New York, 1982. 13. Galecki AT. General class of covariance structures for two or more repeated factors in longitudinal data analysis. Communications in Statistics – Theory and Method 1994; 23:3105 –3119. 14. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73:13 – 22.
1.4 Likelihood Modelling TUTORIAL IN BIOSTATISTICS Likelihood methods for measuring statistical evidence Jerey D. Blume∗; † Center for Statistical Sciences; Brown University; Providence; RI 02912; U.S.A.
SUMMARY Focused on interpreting data as statistical evidence, the evidential paradigm uses likelihood ratios to measure the strength of statistical evidence. Under this paradigm, re-examination of accumulating evidence is encouraged because (i) the likelihood ratio, unlike a p-value, is unaected by the number of examinations and (ii) the probability of observing strong misleading evidence is naturally low, even for study designs that re-examine the data with each new observation. Further, the controllable probabilities of observing misleading and weak evidence provide assurance that the study design is reliable without aecting the strength of statistical evidence in the data. This paper illustrates the ideas and methods associated with using likelihood ratios to measure statistical evidence. It contains a comprehensive introduction to the evidential paradigm, including an overview of how to quantify the probability of observing misleading evidence for various study designs. The University Group Diabetes Program (UGDP), a classic and still controversial multi-centred clinical trial, is used as an illustrative example. Some of the original UGDP results, and subsequent re-analyses, are presented for comparison purposes. Copyright ? 2002 John Wiley & Sons, Ltd. KEY WORDS:
Law of Likelihood; statistical evidence; misleading evidence; bump function; tepee function; University Group Diabetes Program (UGDP)
1. INTRODUCTION Science looks to statistics for an objective measure of the strength of evidence in a given body of observations. This tutorial describes how and why likelihood ratios measure the strength of statistical evidence and discusses why that evidence is seldom misleading. To illustrate this methodology, a brief re-analysis of a well-known clinical trial is presented. At present, there are two generally accepted approaches to interpreting data – one frequentist and one Bayesian – but neither of these approaches explicitly answer the question: ‘What do the data say?’ [1–4]. It has been argued elsewhere that both methodologies answer dierent questions, specically ‘What should I do?’ and ‘What should I believe?’, respectively [1, 5]. ∗
Correspondence to: Jerey D. Blume, Center for Statistical Sciences, Brown University Box G-H., Providence, RI 02912, U.S.A. † E-mail:
[email protected]
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 209
210
J. D. BLUME
Another viewpoint is that frequentist and Bayesian approaches both include more information in their interpretation of the data than just what the data themselves supply. Bayesian inference includes prior information=belief, and frequentist inference includes information about the sample space ‘for example, the number of planned looks at the data’. Likelihood based methodology, as presented here, essentially splits the dierence between the frequentist and Bayesian approaches. The evidential paradigm (dubbed Evidentialism by Vineland and Hodge [6]) provides a framework for presenting and evaluating likelihood ratios as measures of statistical evidence for one hypothesis over another. Although likelihood ratios gure prominently in both frequentist and Bayesian methodologies, they are neither the focus nor the endpoint of either methodology. The key dierence is that, under the evidential paradigm, the measure of the strength of statistical evidence is decoupled from the probability of observing misleading evidence. As a result, controllable probabilities of observing misleading or weak evidence provide quantities analogous to the type I and type II error rates of hypothesis testing, but do not themselves represent the strength of the evidence in the data. This tutorial is intended for readers with a moderate amount of statistical training. However, readers who have only taken a semester or year-long graduate level introductory course in statistics or biostatistics will nd the majority of material accessible. The basic concepts of this paradigm are presented and illustrated in Sections 2 and 3.1. The remaining subsections of Section 3 are more technical and detail how often likelihood ratios are misleading. These subsections may be skipped without loss of continuity. In Section 4, the University Group Diabetes Program (UGDP), a classic and still controversial multi-centred clinical trial, is re-examined with likelihood methods. The nal section contains closing comments, computer code is explained in Appendix D, and some UGDP data are provided in Appendix E.
2. THE EVIDENTIAL PARADIGM 2.1. The Law of Likelihood An experiment or observational study produces observations which, under a probability model, represent statistical evidence. The Law of Likelihood is a starting point for interpreting such evidence [1, 3, 7]. It provides a mechanism for interpreting those observations, in the context of a probability model, as statistical evidence for one hypothesis vis-a-vis another. The Law of Likelihood. If the rst hypothesis, H1 , implies that the probability that a random variable X takes the value x is P1 (x), while the second hypothesis, H2 , implies that the probability is P2 (x), then the observation X = x is evidence supporting H1 over H2 if and only if P1 (x)¿P2 (x), and the likelihood ratio, P1 (x)=P2 (x), measures the strength of that evidence [1, 7]. Here P1 (x) = P(x|H1 ) is the probability of observing x given that H1 is true and P2 (x) = P(x|H2 ) is the probability of observing x given that H2 is true. The ratio of these conditional probabilities, P(x|H1 )=P(x|H2 ), is the likelihood ratio. Both H1 and H2 are ‘simple’ hypotheses because they specify a single numeric value for the probability of observing x. A ‘composite’ hypothesis, such as Hc = {H1 or H2 }, species a set of numeric values for P(x|Hc ), in this
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
211
case {P(x|H1 ) or P(x|H2 )}. Likelihood ratios involving composite hypotheses are, without additional assumptions, undened because P(x|Hc ) does not specify a single numeric value. The Law’s silence on composite hypotheses is discussed further in Section 2.6. The Law of Likelihood explains that statistical evidence for one simple hypothesis vis-a-vis another is measured by the likelihood ratio. The hypothesis that is better supported is the hypothesis that assigns a higher probability to the observed events. That is, the data better support the hypothesis that more accurately predicted the observed data. If both hypotheses place the same probability on the observed events, then the observations do not support one hypothesis over the other. This concept of statistical evidence is essentially relative in nature; the data represent evidence for one hypothesis in relation to another. The data do not represent evidence for, or against, a single hypothesis. Why would it be wrong to say that the data represent evidence against H1 if P(x|H1 ) is small? The reason is P(x|H1 ), while small, might be the largest among the hypotheses under consideration, making H1 the hypothesis best supported by the data. A more in-depth discussion of this point is given by Royall (reference [1], Section 3.3) and Goodman and Royall [8]. An important implication of the Law of Likelihood is that statistical evidence and uncertainty have dierent mathematical forms. Uncertainty is measured by probabilities, which describe how often evidence of a particular type will be observed or how an experiment will perform over the so-called ‘long run’. A separate mathematical quantity, the likelihood ratio, is required to measure the strength of statistical evidence. This is a critical insight; the measure of the strength of evidence and the frequency with which such evidence occurs are distinct mathematical quantities. R.A. Fisher, who was often cryptic in his writings, clearly understood this: “In fact, as a matter of principle, the infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force, or cogency, of such evidence” (reference [9], p. 93). 2.2. The strength of statistical evidence The likelihood ratio for H1 versus H2 measures the strength of evidence for the rst hypothesis over the second hypothesis. A likelihood ratio equal to one would indicate that the evidence does not support either hypothesis over the other; the evidence for H1 vis-a-vis H2 is neutral. If the likelihood ratio is greater than one, the evidence favours H1 over H2 and if the likelihood ratio is less than one, the evidence favours H2 over H1 . Likelihood ratios may take any non-negative value, from zero (indicating overwhelming evidence for H2 over H1 ) to innity (indicating the reverse). For the purpose of interpreting and communicating the strength of evidence, it is useful to divide the continuous scale of the likelihood ratio into descriptive categories, such as ‘weak’, ‘moderate’ and ‘strong’ evidence. Such a crude categorization allows a quick and easily understood characterization of the evidence for one hypothesis over another. Benchmark values of k = 8 and 32 have been suggested to distinguish between weak, moderate and strong evidence [1, 10]. (Others have proposed similar benchmarks in the literature [3, 11, 12].) Observations resulting in a likelihood ratio of 8 or more (or less than 1=8) represent at least moderate or fairly strong evidence and those with a likelihood ratio of 32 or more (or less than 1=32) represent strong evidence. Note that while a likelihood ratio of 8 represents fairly strong evidence, so does a likelihood ratio of 7.5 or 10 (albeit to a lesser or greater degree).
212
J. D. BLUME
Table I. Properties of the diagnostic test for disease D. Disease D
Present Absent
Test result Positive
Negative
0.94 0.02
0.06 0.98
The scale of statistical evidence is not discrete; evidence does not jump from category to category. A likelihood ratio of k-strength represents statistical evidence of the same strength in dierent problems. This is because a likelihood ratio supporting H1 over H2 by a factor of k represents evidence precisely strong enough to increase the prior probability ratio of H1 to H2 by a factor of k, regardless of what H1 and H2 represent and regardless of their initial probabilities [10]. This is expressed mathematically by P(H1 ) P(H1 |x) =k P(H2 |x) P(H2 )
(1)
where k = P(x|H1 )=P(x|H2 ) is the likelihood ratio, P(H1 )=P(H2 ) is the prior probability ratio, and P(H1 |x)=P(H2 |x) is the posterior probability ratio (that is, the probability ratio after observing the data x). Note that the prior probabilities, P(H1 ) and P(H2 ), do not depend on the data and must be predetermined (often subjectively). Sometimes the data, while representing strong statistical evidence for H1 over H2 , do not represent strong enough evidence to make H1 appear more probable than H2 . For example, suppose that H2 is initially 20 times more likely than H1 . An experiment is conducted and the data yield a likelihood ratio of k = 10 supporting H1 over H2 . Therefore, the data represent fairly strong evidence supporting H1 over H2 , even though H2 remains more probable than H1 (but now by a factor of two, instead of 20). Equation (1) shows why the same statistical evidence may convince some that H1 is more probable than H2 , but not others. This is because everyone has dierent prior probabilities for the hypotheses depending on their own experience and beliefs. Hence it is worth while reiterating that statistical evidence is not measured by the relative belief that one hypothesis is more likely than another either before or after the experiment. That is, statistical evidence is not measured by the prior or posterior probability ratio. Rather, statistical evidence is measured by the data dependent factor that uniformly modies all beliefs, no matter what their initial magnitude. That factor is the likelihood ratio. 2.3. Example: a diagnostic test Consider a diagnostic test for disease D that has 94 per cent sensitivity and 98 per cent specicity. The properties of this test are listed in Table I. A test result from this diagnostic test represents statistical evidence about the presence of disease D. Here, a positive test result does not prove that disease is present, but represents statistical evidence supporting HD+ (disease present) over HD− (disease absent) because 0:94 = P(T + |D+)¿P(T + |D−) = 0:02. The statistical evidence may be characterized as ‘strong’ because the likelihood ratio is P(T + |D+)=P(T + |D−) = 0:94=0:02 = 47.
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
213
Whether or not this positive test result presents convincing evidence that the disease is present depends on the disease prevalence in the population, P(D+). In this context, equation (1) becomes P(D+) P(D + |T +) = 47 P(D − |T +) P(D−)
(2)
where P(D−) = 1 − P(D+) and P(D − |T +) = 1 − P(D + |T +). When the disease prevalence is less than 2 per cent, the prior probability ratio, P(D+)=P(D−)¡0:02=0:98 = 1=49, is very small and the posterior probability ratio, P(D + |T +)=P(D − |T +)¡47=49, is less than one. Therefore, a positive test result for this rare disease does not represent strong enough evidence to convince a physician that the disease is present, even though that positive result represents strong evidence for HD+ over HD− . For example, upon observing a positive test, a prior probability of disease P(D+) = 0:015 increases to 0:417 = P(D + |T +) and the prior probability of no disease, P(D−) = 0:985, is reduced to 0:583 = P(D − |T +). It is still more probable that the disease is absent, but nevertheless wrong to claim that this positive result is evidence that the disease is absent. Why? Because observing a positive result increases the probability that the disease is present from 0.015 to 0.417 and decreases the probability that the disease is absent from 0.985 to 0.583, or equivalently, because the likelihood ratio supports HD+ over HD− by a factor of 47. An application of Bayes theorem shows that a positive test result always increases the physician’s belief that the disease is present if the likelihood ratio supporting HD+ over HD− is greater than one. However that same positive result will not be convincing if it fails to overwhelm the prior probability that the disease is absent. Even further, any medical action that the physician takes is based not only on his or her beliefs about the presence of disease, but also on the possible benets and costs (monetary and otherwise) associated with the treatment. If available, a simple, non-invasive, inexpensive treatment may be prescribed as a precautionary measure, even though the physician might not believe that the disease is present. The point is that action, belief and evidence are distinct concepts. It is not enough to answer the question ‘What do the data say?’ by explaining what one believes or what one should do. Likewise for statistical methodology; the techniques for answering questions such as ‘What should I do?’ (frequentist) and ‘What should I believe?’ (Bayesian) cannot be used to answer the question ‘What do the data say?’ [1, 4, 5, 8]. When speaking of statistical ‘inference’, it is necessary to distinguish between these three questions, not only because the methodologies required to address them are dierent, but also because each question is important in its own right. 2.4. Likelihood functions The diagnostic test example, although very instructive, is somewhat simplistic in that its probability model (given by Table I) consists of only two simple hypotheses, HD+ and HD− , and only a single likelihood ratio needs to be reported. Under more complex probability models, hypotheses often specify a particular value for a parameter of interest (such as a mean, odds ratio or probability), making the total number of simple hypotheses much greater than two. In this case, many likelihood ratios should be reported, namely one for each pair of simple hypotheses, and simply listing all of them becomes quite cumbersome. A more concise
214
J. D. BLUME
way of reporting the evidence is to report and=or display the likelihood function, which is the mathematical representation of the statistical evidence in the data [13]. In essence, a likelihood function is an expression of the conditional probabilities P(x|H1 ); P(x|H2 ); : : : as a single function P(x|Hi ) for i = 1; 2; : : : . The notation used to represent a likelihood function is L(Hi |x) or L(Hi ), rather than P(x|Hi ), to emphasize that the observed data are xed and the hypothesis Hi varies [14]. A plot of the likelihood function (L(Hi ) versus Hi ) reveals which hypotheses are better supported by the data because these hypotheses will have a larger P(x|Hi ) relative to other hypotheses. To illustrate, consider an experiment that generates evidence about the (unknown) probability of obtaining heads on a toss of a biased coin. The experiment consists of ipping the biased coin n times and recording the number of times it lands heads. Because each toss of the coin is independent, the probability model is binomial and the probability of observing x heads out of n tosses is P(x|) = c(n; x) x (1 − ) n−x , where c(n; x) = n!=(n − x)!x! is the binomial coecient and is the unknown probability of heads. Here each simple hypothesis species a value for the unknown parameter (for example, H0 : = 0:5). If the coin is tossed 50 times and 14 heads are observed, the likelihood function is given by L() = c(n; x)14 (1 − ) 50–14 . The likelihood ratio for H1 versus H2 is given by L(H1 )=L(H2 ) = P(x = 14|H1 )=P(x = 14|H2 ). For example, these data support = 0:3 over = 0:5 by a factor of L(0:3)=L(0:5) = 0:314 (0:7) 36 =0:514 (0:5) 36 = 143. The value of that maximizes the likelihood function is called the maximum likelihood estimator (MLE) and will be denoted ˆ The MLE deserves mention because it is the best supported hypothesis (L()=L() ˆ by . ¿1 for all ). However, not too much emphasis should be placed on the MLE because other hypotheses are (essentially) equally well supported. For example, the MLE in the biased coin example is ˆ = 14=50 = 0:28, but there is only very weak evidence to support = 0:28 over = 0:3 because L(0:28)=L(0:3) = 1:05. Likelihood functions are graphed to provide a visual impression of the evidence over the parameter space. For presentation purposes, likelihood functions are standardized by their maximum value (a constant). The scaling constant for the likelihood function can be chosen arbitrarily because only ratios of likelihood functions measure the statistical evidence. In the biased coin example, the standardized likelihood function is L() 14 (1 − ) 36 L() = = ˆ max L() L() 0:2814 (1 − 0:28) 36
(3)
Figure 1 displays the standardized likelihood function. The y-axis is not labelled to emphasize that only ratios of points on the likelihood function have evidential meaning. Note that the best supported value, the MLE ˆ = 0:28, is at the crest of the likelihood function. The usefulness of the likelihood function is that it ‘shows’ all the likelihood ratios and provides a visual impression of the evidence about . Appendix D1 shows how to reproduce this plot in the statistical package S-plus. Likelihood functions play a prominent role in another well-known statistical principle that is often confused with the Law of Likelihood. That principle is the likelihood principle, which preceded the Law of Likelihood by several years and outlined the conditions under which two experiments produce equivalent statistical evidence [13]. The condition is a simple one: two experiments produce equivalent statistical evidence if their likelihood functions are proportional by a xed constant, or equivalently, if all the likelihood ratios are equal for the
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
215
Figure 1. Likelihood function for the probability of heads.
two experiments (see reference [14] for a review). The Law of Likelihood is concerned with how the data should be interpreted as statistical evidence for one hypothesis over another and should not be confused with the likelihood principle. 2.5. Support intervals Drawn within the likelihood function in Figure 1 are two lines parallel to the x-axis. These lines represent likelihood support intervals, identifying all the parameter values for that are consistent with the data at a certain level. The values for that are most consistent with the data are those values under the crest of the likelihood function. Hence, a 1=k likelihood support interval (SI) is dened as the set of s where the standardized likelihood function is greater than 1=k [1]. That is, a 1=k likelihood support interval is ˆ 1 L() L() all where ¿ = all where 6k (4) max L() k L() In words, any theta within the 1=k SI is supported by the data because the best supported ˆ is only better supported by a factor of k or less. Thus if k = 8 there is only hypothesis, , weak evidence supporting the MLE over any theta in the interval. Likelihood support intervals present a concise summarization of the evidence about without having to graph the likelihood function or report numerous likelihood ratios. Remaining consistent with the aforementioned benchmarks, k = 8 and 32 are used to construct moderate and strong support intervals. In Figure 1, the 1=8 SI for the probability of heads is 0.165 to 0.419 (the 1=8 SI is the line corresponding to a height of 1=8 on the y-axis). Hypotheses within the interval may be better supported over others within the interval, but the level of support is weak and less than a factor of 8. For those hypothesized values outside
216
J. D. BLUME
the interval, there always exists another probability, namely ˆ = 0:28, that is better supported by a factor of more than 8. For these reasons 1=32 SIs are often called ‘stronger’ support intervals over 1=8 SIs. In the current example, hypotheses suggesting the probability of heads is between 0.137 and 0.462 (the 1=32 SI is the line corresponding to a height of 1=32 on the y-axis) are considered consistent with the data at a stronger level. These support intervals indicate that the coin is biased. Support intervals and exact condence intervals are indeed related but this relationship is only mathematical, because the interpretation and construction of these intervals is quite different (see reference [1] for √ an in-depth √ discussion). For example, a 1=k SI for the mean of a ± (2 log k)= n while a (1 − )100 per cent condence interval (CI) normal distribution is x √ √ is x ± Z=2 = n. When k = 8, (2 log k) = 2:039 and the 1=k SI has the form of a 95.9 per cent condence interval (a 95 per cent CI corresponds to a 1=6.67 SI). However the form of the condence interval depends on more than just the type I error. If three looks at the data √ are planned, that same 95.9 per cent condence interval becomes x ± 2:47= n because is divided by three to maintain the overall type I error. Advantageously, likelihood support intervals depend only on the data, so they keep the same form regardless of the number of looks. 2.6. Composite hypotheses The Law of Likelihood explains how to measure evidence between two simple hypotheses and is therefore silent with regard to composite hypotheses. This silence is a natural consequence of the Law itself; it avoids the inherent subjectivity in summarizing evidence over a composite space. Consider again the biased coin example from the previous section. How might the evidence for the composite hypothesis Hc :¡0:5 versus the simple hypothesis Hs : = 0:5 be measured? The Law provides no direction here, because the likelihood ratio, L(¡0:5)=L(0:5) = P(x = 14|Hc )=P(x = 14|Hs ), depends on the undened quantity L(¡0:5) = P(x = 14|¡0:5) = P(x = 14|Hc ) that species a set of probabilities rather than just one. The problem here is that even if we were willing to specify a rule for choosing a single probability out of the set, the nal measure of support will depend on that rule. For example, the support for Hc over Hs , say kc , may be dened by: (i) taking the maximum, kc = max¡0:5 [L()=L(0:5)] = L(0:28)=L(0:5) = 150, which indicates very strong support for Hc over Hs , (ii) taking the minimum, kc = min¡0:5 [L()=L(0:5)] = L(0)=L(0:5) = 0, which indicates overwhelming support for Hs over Hc , or (iii) taking a weighted average, kc = (L()=L(0:5))w() d where w() are weights such that w() = 1, which indicates that the support will be between 0 and 150 depending on the weights. Thus the support for Hs over Hc changes along with the rule. Alternatively, situations involving composite hypotheses can be transformed so that the Law of Likelihood applies. The idea is Bayesian—change the probability model by specifying a prior distribution for the simple hypotheses. The prior provides a mechanism for reducing the composite hypothesis to a simple hypothesis and then the Law of Likelihood applies to the likelihood ratio under the extended probability model. (This new likelihood ratio is called a Bayes factor and Appendix A provides the details.) This Bayesian approach does not provide a measure of evidence where the Law of Likelihood cannot; it simply changes the problem so that the Law of Likelihood is applicable. The drawback of this approach is that there are many ‘Bayesian’ solutions with no objective criteria to choose between them [10].
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
217
The point here is not that choosing a method of summarization (that is, a rule) is bad, but that it is unnecessary and often arbitrary. Pairwise reasoning between two simple hypotheses is not alien to statistics and, in fact, is a major component of current statistical thinking. For example, a power curve displays the probability of rejecting a simple null hypothesis at every possible simple alternative hypothesis. (When examining the power of a test, no attempt is made to summarize the power over a composite space, although Bayesians sometimes determine the ‘pre-posterior risk’ which is similar in some respects [15].) Comprehensively describing the power for rejecting some xed null hypothesis requires the presentation of the entire power curve. Analogously, the entire spectrum of evidential support is conveyed concisely in a graph of the likelihood function. No further reduction or summarization of the evidence is necessary. 3. MISLEADING EVIDENCE 3.1. Denition and implications It is possible to observe strong evidence for H2 over H1 when, in actuality, H1 is correct. Misleading evidence is dened as strong evidence in favour of the incorrect hypothesis over the correct hypothesis. After the data have been collected, the strength of the evidence will be determined by the likelihood ratio. Whether the evidence is weak or strong will be clear from the numerical value of the likelihood ratio. However, it remains unknown if the evidence is misleading or not. Recall the diagnostic test example discussed in Section 2.3 whose properties were given in Table I. For this test, a positive test result is always strong evidence supporting HD+ over HD− because P(T + |D+)=P(T + |D−) = 47, but an observed positive test represents misleading evidence if the disease is truly absent. What makes this test a good test is that it rarely produces this type of misleading evidence; the probability of observing a (misleading) likelihood ratio that supports HD+ is P(T + |D−) = 0:02. However, the probability of observing misleading evidence is irrelevant once data have been collected. While this probability provides assurance that the study design is reliable, it does not aect the strength of statistical evidence in observed data (as measured by the likelihood ratio) or the probability that the observed evidence is misleading. In fact, if two dierent studies produce the same statistical evidence for one hypothesis over another (that is, they have identical likelihood ratios), then the probability that the observed evidence is misleading is exactly the same for both studies. To illustrate, consider another diagnostic test whose properties are given in Table II. The sensitivity of this test is only 47 per cent, much worse than the rst test, but its specicity is slightly better at 99 per cent. A positive test result from this test also represents strong statistical evidence supporting HD+ over HD− by a factor of 0:47=0:01 = 47, but if the disease is absent, the second test is only half as likely to produce (misleading) evidence in favour of HD+ (because P(T + |D−) = 0:01). A positive result from either test represents evidence supporting HD+ over HD− by a factor of 47. However, the second test produces misleading positive results half as often. Why, then, is it unimportant to know which test produced the observed positive result? Put another way, is an observed positive result on the second test ‘less likely to be misleading’ or ‘more reliable’ in some sense, or does it warrant more ‘condence’ [10]. The answer is no.
218
J. D. BLUME
Table II. Properties of the second diagnostic test for disease D. Disease D
Test result
Present Absent
Positive
Negative
0.47 0.01
0.53 0.99
An observed positive result on the second test is equivalent, as statistical evidence about the presence or absence of disease, to an observed positive result from the rst test. The strength of evidence for HD+ over HD− is identical for both tests because their likelihood ratios are equal, but more to the point, an observed positive result from the rst test is not any more likely to be misleading than an observed positive result from the second test. Why? Because an observed positive result is misleading if, and only if, the subject does not have the disease and P(D − |T +) is the same for both tests! From Bayes theorem we have P(D − |T +) = =
P(T + |D−)P(D−) P(T + |D−)P(D−) + P(T + |D+)P(D+) P(D−) P(D−) +
P(T +|D+) P(T +|D−) P(D+)
=
P(D−) P(D−) + 47P(D+)
(5)
for both tests. Notice that P(D −|T +) only depends on the likelihood ratio and the prevalence. Hence the propensity for an observed positive result to be misleading does not depend on which test produced the result (see reference [1], Section 4.5 for further discussion). Overall the implication is clear: the probability of observing misleading evidence does not aect the strength of statistical evidence in the data or the probability that observed evidence is misleading. It is critical to distinguish between the probability of observing misleading evidence and the probability that observed evidence is misleading. (Incidentally, confusion over adjusting p-values and type I errors for multiple comparisons or multiple looks at the data is, in part, due to the failure to distinguish between these two probabilities [1, 16].) Even though the probability of observing misleading evidence does not aect our measure of statistical evidence, it plays an important role in the planning of experiments along with the probability of observing weak evidence. Therefore, the remainder of Section 3 is devoted to introducing the probabilities of misleading and weak evidence and establishing that, in general, strong misleading evidence is seldom observed. 3.2. Probabilities of misleading, weak evidence and the type I, II errors of hypothesis testing Study designs are evaluated in terms of their operational characteristics, for example, the frequency with which a particular study design will produce misleading or weak evidence. These design characteristics provide assurance that the design is reliable, but do not aect the statistical evidence in the observed data. When discussing the probabilities of misleading and weak evidence it is instructive to contrast them with the familiar type I and II errors of hypothesis testing.
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
219
The classical Neyman–Pearson hypothesis testing theory and the evidential paradigm both use likelihood ratios as key quantities, but each for a dierent purpose. Hypothesis testing procedures do not place any interpretation on the numerical value of the likelihood ratio. The extremeness of an observation is measured, not by the magnitude of the likelihood ratio, but by the probability of observing a likelihood ratio that large or larger. It is the tail area, not the likelihood ratio, that is the meaningful quantity in hypothesis testing. Conversely, the Law of Likelihood says that the likelihood ratio itself measures the strength of statistical evidence. These dierences are subtle, but critically important. Consider the following common example. Observe X1 ; X2 ; : : : ; Xn i.i.d. f(Xi ; ). Let Ln () = ni=1 f(xi ; ) be the likelihood function. Suppose two hypotheses of interest are H0 : = 0 and H1 : = 1 . The type I error rate of hypothesis testing is dened as
P0
Ln (1 ) ¿k; n = Ln ( 0 )
where is xed over n
(6)
By contrast the probability of observing misleading evidence for 1 over 0 is
P0
Ln (1 ) ¿k = M (n; k) Ln ( 0 )
where k is xed over n
(7)
While the type I error rate and the probability of observing misleading evidence appear to have the same mathematical form, they actually are very dierent. In hypothesis testing, the type I error rate is xed at and the strength of evidence at which the test rejects, k; n , depends on and n. This often leads to confusion because two tests which reject at the same -level represent dierent strengths of evidence depending on their respective sample sizes [17, 18]. Just the opposite is true for the probability of observing misleading evidence; the strength of the evidence, k, is xed and the resulting probability, M (n; k), depends on k and the sample size n. To make these dierences transparent, consider the case when the data are normally distributed with mean 0 and known variance 2 . Further suppose that n observations are collected in a standard xed sample size experiment. The probabilities of observing misleading and weak evidence depend on the sample size and the distance between the two xed simple hypotheses H1 : = 1 and H0 : = 0 . The strength of evidence for H1 versus H0 provided by the observations x1 ; : : : ; xn is given by
n(1 − 0 ) Sn 1 + 0 Ln (1 ) = exp − Ln ( 0 ) 2 n 2
(8)
where Sn = x1 + · · · + xn . When the true mean is 0 , the probability that the observations will support 1 over 0 by a factor of k or more is straightforward to calculate [1, 10, 19].
M (n; k) = P0
√ Ln (1 ) n ln k ¿k = − − √ Ln ( 0 ) 2 n
where = |1 − 0 | and (:) is the standard normal cumulative distribution function.
(9)
220
J. D. BLUME
Figure 2. Probabilities of weak and misleading evidence.
On the other hand, the type I error rate in hypothesis testing is held constant as the sample size varies, that is, the above probability M (n; k) always equals . Thus, for a hypothesis test, we can calculate k; n such that M (n; k) = demonstrating that the strength of the evidence at which the -sized test rejects H0 varies with n and is given by √ n n2 − (10) k; n = exp Z 22 Further, k; n is often less than one, indicating that a hypothesis test rejects H0 when the evidence favours H0 over H1 . For example, suppose n = 30 observations are collected to test H0 : = 0 versus H1 : = 0 + with size = 0:025 when the two hypotheses of interest are one standard deviation apart, that is, = . Now the hypothesis test rejects H0 in favour of H1 if the likelihood ratio, Ln (1 )=Ln ( 0 ), is greater than k = 1=70. Thus observations which support H0 over H1 by a factor of 70 or less lead to rejection of H0 , even though factors greater than 1 indicate support for H0 over H1 . The probability of observing weak evidence and the type II error rate also have similar forms, but again represent dierent quantities. The probability of observing weak evidence favouring either hypothesis when H0 is the correct hypothesis can be expressed as [1]
√ √ 1 Ln (1 ) n ln k n ln k W (n; k) = P0 ¡ ¡k = + √ − − √ (11) k Ln ( 0 ) 2 2 n n Note that this is the same as the corresponding probability when H1 is correct. Figure 2 plots the probability of observing misleading and weak evidence (M (n; 8) and W (n; 8), respectively) as well as the type I and II error rates, as a function of the sample size. Note that all four
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
221
curves are quite distinct. Hence the type I error probability is not the probability of observing misleading evidence and the type II error probability is not the probability of observing weak evidence. Note that, as shown in Figure 2, both probabilities of misleading and weak evidence converge to zero as the sample size increases [1]. Finally, it is interesting to note that had Neyman and Pearson chosen to minimize a linear combination of the type I and type II error probabilities min(+k), instead of xing the type I error rate and minimizing the type II error rate, the resulting rejecting rule would coincide with the Law of Likelihood, that is, reject when the likelihood ratio is greater than k [20]. 3.3. The universal bound The frequency with which misleading evidence is observed is, in general, low. For any xed sample size and any pair of probability distributions, the probability of observing misleading evidence of strength k or greater is always less than or equal to 1=k [1, 10, 13, 21]. Mathematically, if both f(X ) and g(X ) are probability density functions and X is distributed according to f(X ) then
Pf
g(X ) 1 ¿k 6 f(X ) k
(12)
This bound has been named the universal bound [1]. A simple application of Markov’s inequality will yield the result. The universal bound indicates that, for moderately large k, misleading evidence will not be observed very often. In fact, the probability of observing very strong misleading evidence (that is, evidence supporting an incorrect hypothesis over the correct hypothesis by a factor of 32 or more) cannot exceed 1=32 = 0:031. In the case of multiple looks at the data, the likelihood function is unaected. Conversely, the probability of observing misleading evidence increases with each look (from two to innity) but also remains bounded by the universal bound (for a proof see Robbins [22]). If both f(Xn ) and g(Xn ) are probability density functions and Xn is a vector of n observations with joint density f(Xn ) then
Pf
g(Xn ) 1 ¿k; for any n = 1; 2; : : : 6 f(Xn ) k
(13)
The probability of observing misleading evidence remains bounded even though it increases with each look because the amount by which the probability increases converges to zero as the sample size grows [23, 24]. Further, when f(Xn ) is the true density, the likelihood ratio itself, g(Xn )=f(Xn ), converges to zero as the sample size grows (by the law of large numbers), making it unlikely that it will be greater than k in moderate to large samples. Thus, an experimenter who plans to examine the data with each new observation, stopping only when the data support Hg over Hf , will be eternally frustrated with probability at least 1 − 1=k. Practically, this implies that an investigator searching for evidence to support his favourite hypothesis over the correct hypothesis is likely (with probability greater than 1 − 1=k) never to nd such evidence. This property of statistical evidence is an excellent scientic safeguard. It is dicult, deliberately or otherwise, to collect strong misleading evidence.
222
J. D. BLUME
Figure 3. The bump function (at n = 1; 20) and tepee function.
3.4. In depth: the probability of observing misleading evidence The universal bound provides a crude measure of the frequency of observing misleading evidence during either a xed sample size or sequential study. However the probability of observing misleading evidence rarely, if ever, achieves the universal bound in xed sample size studies. Here we detail how often misleading evidence is observed when the underlying distribution is normal and note that these results apply in some generality for non-normal distributions when the sample size is large. Consider generating evidence about the mean of a normally distributed random variable by selecting n observations and examining the likelihood function. That is, X1 ; X2 ; : : : ; Xn are normally distributed random variables with mean 0 and known variance 2 . As discussed in Section 3.2, the probability of observing misleading evidence depends on the sample size and the distance between the two xed simple hypotheses H1 : = 1 and H0 : = 0 , which was dened as = |1 − 0 |. This probability is given by equation (9) and was examined as a function of n holding constant. Consider the reverse situation. Hold the sample size constant at n observations and let the probability be a function of = c, where c represents the distance between the two hypotheses measured in standard deviations. The resulting probability of observing misleading evidence is
√ Ln (1 ) ln k |c| n √ ¿k = − − (14) P0 Ln ( 0 ) 2 |c| n This probability has been named the bump function by Royall [10] because of its graphical appearance. Figure 3 graphs this probability with k = 8, for n = 1; 20 (curves ‘a’ and ‘b’, respectively). Note that the x-axis is in units of standard deviations. The universal bound of
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
223
1=8 = 0:125 is far greater √ than the maximum probability of observing misleading evidence, which is given by [− (2 ln k)] = 0:021 [10]. There essentially is no chance of nding misleading evidence for alternatives near zero, that is, for ≈ 0. This happens because 1 and 0 specify distributions so similar that only very extreme observations will produce strong evidence supporting 1 over 0 and those extreme hypotheses grows, observations are improbable under 0 . As the dierence between the two √ the bump function increases until it reaches its maximum value at √ = (2 ln k) standard errors. At this maximum, observations which would support 1 = 0 + (2 ln k=n) over 0 are more likely to occur, because those observations are not too extreme under H0 . After reaching its maximum, the bump function decreases until there is essentially no chance of observing strong misleading evidence for 1 . At this point, observations which would support 1 over 0 (for large values of ) again are very improbable under 0 . For designs that are sequential in nature, a dierent function represents the probability of observing misleading evidence. Consider the same normal model above, but now our study is designed to continue sampling (possibly forever) until strong evidence for H1 over H0 is obtained. The data are examined after each observation is collected and the study terminates if, and only if, the data support H1 : = 1 over H0 : = 0 by a factor of k or more. Even though this design is severely biased in favour of H1 , the universal bound still applies. The probability that such a biased sequential study design will generate misleading evidence supporting H1 over H0 is approximately
Ln (1 ) ∼ exp{− =} = exp{− c} P0 (15) ¿k; for any n = 1; 2; : : : = Ln ( 0 ) k k ∼ 0:583 where the subscript on the probability denotes the true mean, = |1 − 0 | = c, and = is a constant (see references [23, 24] for details). This probability has been named the tepee function because of its graphical appearance. Figure 3 graphs this probability, equation (15) when k = 8 (curve ‘c’). Under the sequential design, the sample size is allowed √ to grow until strong evidence for H1 is obtained. Notice that for large alternatives (c¿ (2 ln k) = 2:04) the tepee function provides values that are only a little larger than the bump function when n = 1. For distant alternatives (greater than 3 or 4 standard deviations) the bump function shows that there is little chance of a single observation representing strong misleading evidence for 1 over 0 . Furthermore, the tepee function shows that this probability cannot be substantially increased, even if we continue to sample until such strong misleading evidence is obtained. The probability increases steadily as the distance between the two hypotheses approaches zero. One explanation for this is that, as the sample size increases, the probability at each alternative builds up. That is, there exists some probability for each xed sample size, and this probability is given by the bump function. Alternatives close to H0 accumulate more of this probability because the bump function eectively ‘moves inwards toward zero’ as the sample size increases. Thus these ‘closer’ alternatives accumulate a large amount of probability as the entire bump function moves across them. For the same reason, alternatives ‘far’ from H0 do not accumulate much more probability than what is initially specied on the rst observation. Finally, both the bump function and tepee function apply in moderate to large samples when the underlying distribution is non-normal and there exists a single parameter of interest [10, 23, 24]. The probability of generating misleading evidence in sequential study designs that restrict the sample size are also investigated in references [23] and [24].
224
J. D. BLUME
3.5. Nuisance parameters In multi-parameter models, such as X1 ; X2 ; : : : ; Xn i.i.d. f(Xi ; ; ), the Law of Likelihood
explains how to measure evidence for the simple hypotheses H0 :(; ) = (0 ; 0 ) vis-a-vis H1 :(; ) = (1 ; 1 ). It is the likelihood ratio n
f(Xi ; 1 ; 1 ) Ln (1 ; 1 ) = Ln (0 ; 0 ) i=1 f(Xi ; 0 ; 0 )
(16)
which measures the support for one probability distribution, identied by (1 ; 1 ), versus another, identied by (0 ; 0 ). Under this model the hypothesis H0 : = 0 is a composite hypothesis in the following form: H0 :(; ) = (0 ; ). It is now readily seen that H0 species a family of probability distributions because is unknown. For a xed value of gamma, say 0 = 1 = , the likelihood ratio in equation (16) measures the relative support for 1 versus 0 but still depends on . Of course, it would be preferable if the likelihood ratio was free of . The problem is that cannot, in general, be removed to provide a likelihood function of alone [10]. Thus, ad hoc solutions are required. One approach is to use a conditional or marginal likelihood that is free of the nuisance parameter. A common example is the likelihood function for an odds ratio based on the conditional distribution given the margin totals. Another example is the likelihood function for a regression coecient based on the marginal distribution in a normal linear regression model. Both approaches are appealing because the universal bound, bump and tepee functions all still characterize the probability of observing misleading evidence. Alternatively one might use a prole or estimated likelihood function. The prole likelihood function maximizes the joint likelihood with respect to the nuisance parameter at each value of the parameter of interest. For xed , the prole likelihood function is dened as max Ln (; ) = Ln (; ˆ()) = Lpn (). The approach is to use the prole likelihood function for as if it were a true likelihood function. Along the same line of reasoning, the estimated likelihood function for can be used as if it were a true likelihood function. For xed , the estimated likelihood is dened as Ln (; ˆn ) = Len () where ˆn is any consistent estimator of . For example, the overall MLE might be used in place of the nuisance parameter. If both and are xed dimensional parameters and if f is a smooth function, then the prole likelihood will behave like a true likelihood in large samples. That is, for a prole likelihood ratio, the limiting probability of observing misleading evidence is given by the bump function and the tepee function [10, 23, 24]. The same is not true for estimated likelihood ratios, where the probability of observing misleading evidence can be much greater [10, 23, 24]. There are important special circumstances when the likelihood ratio (equation (16)) will be free of . If the likelihood function factors (Ln (; ) ∝ Ln1 ()Ln2 ()), then the parameters and are said to be orthogonal [25]. Then the likelihood ratio in equation (16) is only a function of and the support for 1 over 0 is measured by examining that likelihood ratio. This orthogonalization may sometimes be achieved through reparameterization but is most easily attained, if such a reparameterization exists, through the prole likelihood function [26]. Thus, it is not necessary to determine the reparameterization that orthogonalizes the parameters because the prole likelihood automatically provides the answer. Of course, proling does not always work. Prole likelihoods can behave poorly in small samples and situations where the number of parameters is large compared to the sample
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
225
size [10, 27]. However there are times when they also give very intuitive answers; consider proling out the variance in a normal model, which yields a prole likelihood for the mean that is proportional to a t-distribution (see Appendix B). Furthermore, there are many unexplored adjustments to prole likelihoods that may increase the prole likelihood’s applicability in samples of moderate size [28].
4. EXAMPLE: THE UNIVERSITY GROUP DIABETES PROGRAM 4.1. Background The computer code and data required to recreate the likelihood functions presented in this section have been moved to the appendices in order to facilitate ease in reading. The data pertaining to the analysis that follows are included in Appendix E and the S-plus functions for graphing likelihoods are included in Appendix D. It should be noted that the S-plus functions are not specic to this example and may be used whenever data of this type is encountered. The University Group Diabetes Program (UGDP) was a multi-centred, placebo controlled, randomized, double-masked clinical trial with long-term follow-up. Participants were actively recruited from 1961 to 1966 and followed through 1975. The primary objective of the UGDP was to evaluate the eect of hypoglycaemic agents on vascular complications of adult-onset diabetes. Tolbutamide, a drug that lowers blood glucose levels, was one of four treatments evaluated in the UGDP, but was discontinued ahead of schedule in 1969. At the time, medical wisdom conjectured that lowering blood glucose levels would lower the eight-year mortality from cardiovascular disease appreciably. The UGDP was designed to examine this relationship. Two of the inclusion criteria for enrolment into the study were a diagnosis of adult-onset diabetes within the last 12 months of enrolment and a life expectancy greater than ve years. Smoking history was not collected. Readings of ECGs, fundus photos and X-rays as well as cause of death coding (decided by a central committee) were all completed in a masked manner. Randomization of participants to treatment was stratied by clinic. Balance of treatment assignments across clinics was forced by blocking within each clinic. Monte Carlo 5 per cent monitoring bounds were used to monitor both overall and cardiovascular mortality over time. No benecial eect of Tolbutamide with respect to overall mortality was observed (p¿0:05 [29]). However, the cardiovascular mortality monitoring bounds suggested that participants on Tolbutamide were at increased risk over the placebo group for mortality from cardiovascular causes (p¡0:05 [29]). Fixed sample size chi-square tests were used to assess the overall and cardiovascular mortality dierence between the Tolbutamide and placebo groups at p = 0:17 and p = 0:005, respectively [29]. The overall mortality results, combined with the excess mortality from cardiovascular causes on the Tolbutamide arm, compelled the UGDP investigators to discontinue the use of Tolbutamide: “The ndings of this study indicate that the combination of diet and Tolbutamide therapy is no more eective than diet alone in prolonging life. Moreover, the ndings suggest that Tolbutamide and diet may be less eective than diet alone or than diet and insulin at least in so far as cardiovascular mortality is concerned. For this reason, use of Tolbutamide has been discontinued in the UGDP” [29].
226
J. D. BLUME
When termination of the Tolbutamide arm of the UGDP was disclosed to the medical and scientic community, questions were raised that challenged the ndings of the UGDP, its study design, and the competency of the analysis that showed that the increased cardiovascular mortality was attributable to Tolbutamide. Published papers suggested that the dierence in cardiovascular death observed between the Tolbutamide and placebo arms was due to some unobserved factor or an imbalance of baseline cardiovascular risk factors between the groups. An extensive list of these criticisms along with a response can be found in references [30] and [31]. (See also references [29, 32–35].) A subsequent analysis of the UGDP mortality results by Corneld [36] conrmed the UGDP investigators’ analysis and showed the Tolbutamide and placebo groups to be adequately balanced on known risk factors. 4.2. Early termination of the Tolbutamide arm The UGDP investigators observed that participants on Tolbutamide were not beneting in regard to overall mortality, but were at increased risk for cardiovascular mortality. Because it is unethical to allow a trial to continue for the purpose of demonstrating a harmful eect of treatment, the investigators discontinued the Tolbutamide arm of the trial. The aim of the re-analysis in this section is to measure the statistical evidence about the cardiovascular mortality dierence. Specically, it is of interest to know if the data support the conclusion that the participants on the Tolbutamide arm are at a greater risk of dying from cardiovascular causes than participants on the placebo arm. In the literature, some have suggested this is not the case [33, 34]. Such a nding is especially important if there is no benet in overall mortality for Tolbutamide over placebo. Each of the twelve UGDP clinics generated data about cardiovascular death on the Tolbutamide and placebo arms. Table E1 gives the raw data – the number of cardiovascular deaths and participants on each treatment arm by clinic. Under a binomial probability model these data represent evidence about the probability of cardiovascular death on the placebo arm, p , and the probability of cardiovascular death on the Tolbutamide arm, t . (The binomial model was discussed in detail in the biased coin example of Section 2.4.) The dotted lines in Figures 4 and 5 are the clinic-specic likelihood functions, which display the evidence about p and t , respectively. Notice that when there are no cardiovascular deaths the resulting likelihood function has an interesting shape: it is maximized at = 0 and monotonically decreases as increases. In addition to the clinic-specic likelihood functions, Figures 4 and 5 also display two likelihood functions (the solid lines) that represent the combined evidence about the probability of cardiovascular death. In both gures the thinner of the two likelihood functions represents the evidence under a model that pools all of the data and assumes a homogeneous binomial model for all participants. Under this model (labelled ‘a’), the 1=32 SI for p is 1.9 to 9.9 per cent and for t is 7.5 to 19.7 per cent. The best supported hypothesis for t is 13 per cent, which is not in the 1=32 SI for p , suggesting that the probability of a cardiovascular death is dierent between the two treatment arms. A potential problem with the homogeneous model is that it ignores any extra variability that may be present due to clinic-to-clinic dierences. In fact, the clinic-specic likelihood functions appear more dispersed on the Tolbutamide arm. This extra variation can be accounted for under a beta-binomial model, which is essentially an extended binomial model that allows for correlated participant outcomes. Under the beta-binomial model, the likelihood functions
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
227
Figure 4. Probability of cardiovascular death (placebo) by clinic and overall.
Figure 5. Probability of cardiovascular death (Tolbutamide) by clinic and overall.
representing the combined evidence about the probability of cardiovascular death are wider, albeit only slightly so on the placebo arm. Under the beta-binomial model (labelled ‘b’), the 1=32 SI for p is 1.7 to 11.8 per cent and for t is 4.9 to 25.6 per cent. There appears to be
228
J. D. BLUME
Figure 6. Relative risk of death (Tolbutamide versus placebo).
considerably more variability on the Tolbutamide arm, as evidenced by the noticeably wider beta-binomial likelihood function. The likelihood function for the beta-binomial model is
x−1 n−x−1 (1 − + i) i=0 ( + i) i=0 (17) ni ; x L(; ) = c n−1 i i=0 (1 + i) where x is the number of success out of n trials, c(n; x) is the usual binomial coecient, is the probability of success, = =(1 − ) and ¿0 is a nuisance parameter that allows for overdispersion [37]. The average number of successes observed under both models is n, but the variability is dierent: n(1 − )(1 + (n − 1)) for the beta-binomial model compared to n(1 − ) for the binomial model. The presence of is a mixed blessing here. It adds exibility to the model, but is unknown and must also be estimated. The approach taken here is to eliminate this nuisance parameter by proling it out of the likelihood function (see Section 3.5). The resulting prole likelihood function for must be evaluated numerically (the S-plus function for doing this is given in Appendix D3). To compare the two parameters t and p , it is helpful to examine the evidence for their ratio or dierence, or for the odds ratio t (1 − p )=p (1 − t ). The relative risk, t =p , is often used to compare two dierent probabilities and this is the approach taken here. The conditional likelihood function for the relative risk may not be widely known. Appendix C presents a detailed derivation, while Appendix D2 provides the S-plus function for graphing the likelihood function. Figure 6 displays the likelihood functions for the relative risk (RR) of death from all causes and death from cardiovascular causes, comparing the Tolbutamide group to the placebo group. For the all-cause mortality curve, the best supported hypothesis is RR = 1:43, and is
229
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
Table III. Relative risk versus hazard rates. Comparison of death risks: Tolbutamide versus placebo Point estimate 1=8 SI
1=32 SI
Cardiovascular causes Relative risk Hazard rate
2.62 2.66
1.31– 5.76 1.29 – 5.94
1.08 –7.45 1.05 –7.76
All causes Relative risk Hazard rate
1.43 1.47
0.84 –2.51 0.87–2.60
0.72–2.99 0.76 – 3.14
supported over the hypothesis of RR = 1 by a factor of only 2.5. Hypotheses that the all cause relative risk is approximately one are consistent with the data. For the cardiovascular causes mortality curve, the best supported hypothesis is RR = 2:62. An RR = 2:62 is supported over an RR = 1 by a factor of 58.6, indicating strong evidence that the relative risk is 2.62 compared to hypotheses suggesting that the relative risk is about 1. The 1=8 and 1=32 SIs are (1:31; 5:76) and (1:08; 7:45), respectively. The 1=32 support interval indicates that the data support an estimated increase in the risk of cardiovascular death of 8 per cent to 645 per cent for the Tolbutamide participants over the Placebo participants. In terms of cardiovascular death, this evidence suggests a harmful eect of Tolbutamide. (Diamond and Forrester [38] present a Bayesian analysis which concludes that the data do not suggest an increase in cardiovascular risk for Tolbutamide participants. Their results vary, however, depending upon the choice of the prior distribution and therefore emphasize the fact that such an analysis is not based solely on ‘what the data say’.) An analysis based on the probability of a cardiovascular death, such as relative risk or odds ratio, assumes a common probability of death for all participants. This assumption is not satised in the UGDP because participants have varying follow-up times. However, the study is noticeably balanced with respect to follow-up time, for example, participants were at risk for a total of 5033.6 quarter-years in the placebo group and 4922.2 in the Tolbutamide group with similar balance in various subgroups such as clinic [29, 32]. Alternatively, one could use follow-up times by examining the dierence in the hazard rates between the placebo and Tolbutamide groups. The likelihood function for the hazard rate with D deaths and P person years at risk is proportional to (P)D exp{−P } [1, 39]. Table III compares the two analysis approaches. Because the UGDP is well balanced between treatments in terms of person years, both approaches yield very similar results. 4.3. Checking for Imbalance Can the dierential risk of dying from cardiovascular causes be explained by baseline imbalances in the compositions of the treatment groups, as some critics of the UGDP have suggested [33–35]? There is always the possibility that this mortality dierence is due to some unknown factor, but the process of randomization provides a reason why this is most likely not the case. Thus, to examine the balance of some covariate across treatments, one compares the proportion of participants with that covariate between the two treatment groups. That is, examine the relative risk for possessing that covariate between Tolbutamide and placebo.
230
J. D. BLUME
Figure 7. Relative risk of baseline cardiovascular risk factors (Tolbutamide versus placebo).
Examination of the likelihood functions for the relative risk of six separate baseline cardiovascular risk factors shows the Tolbutamide and placebo groups to be fairly balanced (Figure 7). The six cardiovascular risk factors are: (a) hypertension; (b) history of digitalis use; (c) history of angina pectoris; (d) signicant ECG abnormality; (e) cholesterol level¿300 mg=100 ml; and (f) one or more of the previous cardiovascular risk factors. Hypotheses suggesting that RR = 1 are within the 1=8 support interval for each risk factor. The likelihood curve for cholesterol (labelled ‘e’) appears to favour relative risks greater than one. However, the 1=8 support interval is (1; 3:24) and the 1=32 support interval is (0:85; 3:91), indicating that the evidence is suggestive at best. The best supported hypothesis of RR = 1:76 is supported over the hypothesis of RR = 1 by a factor of 7.6 (L(1:76)=L(1) = 7:6). Thus, the Tolbutamide and placebo groups appear adequately balanced with respect to baseline cardiovascular risk factors. A similar analysis of other baseline risk factors such as fasting blood glucose level¿110 mg=100 ml; relative body weight¿1:25; visual acuity (either eye)620=200; serum creatinine level¿1:5mg=100ml, arterial calcication, age¿53 years (median age), gender, and race indicate good balance on these factors as well. All of the 1=8 support intervals contain hypotheses suggesting RR = 1, including 1 itself. A logistic regression model can be used to estimate each participant’s risk of cardiovascular death. This estimate of risk adjusts for the various factors included in the logistic model. Using this approach, the UGDP study investigators modelled the probability of a cardiovascular death with 17 covariates [29]. These covariates included the 13 baseline cardiovascular, medical and demographic risk factors examined in this paper along with a single indicator variable for each of the four treatment groups in the UGDP (two of which are not considered in this paper). Corneld used the logistic regression model (setting all four treatment indicators to zero) to tabulate participants by their estimated probability of cardiovascular risk at baseline (reference [36], Table E5).
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
231
Figure 8. Relative risk of adjusted probability cardiovascular death (Tolbutamide versus placebo).
Categories are determined by the estimated quintiles of the risk distribution for all 823 participants (using participants from all treatments at baseline). Figure 8 displays the likelihood functions for relative risk of the estimated baseline risk of cardiovascular death for the Tolbutamide versus placebo group. For each curve except ‘d’, the hypotheses suggesting that the relative risk is near 1, including 1 itself, are an element of the 1=8 support intervals. Thus the data support that the Tolbutamide and placebo groups are adequately balanced on the proportion of participants who are at those levels of risk. There is moderate evidence that the Tolbutamide group has a higher proportion of participants with an estimated risk between 0.0297 and 0.0665 (4th quintile). For the curve labelled ‘d’ in Figure 8, hypotheses suggesting the relative risk is near 1, including 1 itself, are an element of the 1=32 SI (0:97; 2:79) but not nearly as many are members of the 1=8 SI (1:08; 2:41). This indicates that the hypothesis RR = 1 is consistent with the data but only at a moderate level. In fact the best supported hypothesis is RR = (54=204)=(34=205) = 1:596 indicating that the Tolbutamide group had roughly 60 per cent more participants with this level of cardiovascular risk than the placebo group. The likelihood ratio for the best supported hypothesis to the hypothesis of RR = 1 is 19:6 (L(1:6)=L(1) = 19:6), indicating moderately strong support for the best supported hypothesis over the hypothesis of equal risk. On this topic Corneld noted: “The average number of risk factors present among those assigned to placebo was 1.65 as compared with 1.92 among those receiving Tolbutamide, an excess of about one-fourth a risk factor. All in all, the luck of the draw does not seem to have been too bad” [36]. The likelihood analysis supports this conclusion.
232
J. D. BLUME
Figure 9. Relative risk of death from cardiovascular causes (Tolbutamide versus placebo).
4.4. Analyses within subgroups A subgroup is a subset of the study population distinguished by a particular characteristic or set of characteristics [30]. Subgroups are subject to the same limitations and retain the same randomization properties as any post-stratication. However, the term subgroup is usually reserved for a homogenous post-stratication with a small sample size. As the number of strata increases, the number of participants in each stratum decreases. This leads to the so-called subgroup inference problem, because the number of participants in each subgroup often is too small to make reliable inferences regarding that particular subgroup. The subgroup inference problem is often erroneously cited as a consequence of the total number of strata (that is, groups) and can be avoided with careful planning, so that large enough sample sizes are achieved in relevant subgroups. Likelihood based methods permit the construction of any number of subgroups and allow for the evaluation of their evidence without adjustment. (Note, however, that their scientic validity still depends on the characteristics used to dene the subgroup.) Even small subgroups provide statistical evidence. The width of the likelihood function, which depends upon the sample size in the subgroup, will indicate the proper amount of variability in the evidence. The law of large numbers implies that, with a large enough sample size in each stratum, inferences about an unknown strata-specic parameter will be correct regardless of the total number of strata. Therefore, in subgroup analyses, concern should be for the size of the subgroup and not the total number of subgroups. As an example, examine the data in the four dierent race (white versus non-white) by gender subgroups. Figure 9 presents the evidence about the relative risk for cardiovascular death in these four subgroups and is essentially a picture of the gender by race by treatment interaction (see Table E3, Appendix E). There is very strong evidence for increased risk
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
233
of cardiovascular mortality for white women on Tolbutamide over white women on placebo RR = 9:17; L(9:17)=L(1) = 80; 1=32 SI (1:23; 811). The data suggest that white women on Tolbutamide have over nine times the risk of cardiovascular death than white women on placebo. There is only moderate evidence, however, for increased risk for overall mortality for white women RR = 3:57; L(3:57)=L(1) = 16:9; 1=32 SI (0:9; 28:71). In the other three subgroups, participants on Tolbutamide do not appear to be at any increased risk of cardiovascular death compared to participants on placebo. This result appears to be new. Kilo et al. [40, 41] noted that in the placebo group the risk ratio of cardiovascular death for males to females was 5:3:1 (11.1 per cent for males versus 2.1 per cent for females). They argue that this ‘anomalously’ low cardiovascular death risk for women is uncharacteristic for female diabetics and, as a consequence, causes the corresponding Tolbutamide risk to appear high in comparison. The likelihood analysis does not support this argument because only the white women are at increased risk. Furthermore, the risk ratio of cardiovascular death in the placebo group for white males to non-white males to white females to non-white females is 4:63:5:43:0:78:1 (10:64 per cent:12:5 per cent:1:8 per cent:2:3 per cent). Thus, the ‘anomalously’ low cardiovascular death risk for women cannot be the culprit because the non-white women, who have a similarly low cardiovascular death risk when compared to the males, are not at increased risk for cardiovascular death in the Tolbutamide group, 1=32 SI (0:28; 40:74). From a medical perspective, such a gender by race by treatment interaction hardly seems plausible. Yet the evidence suggests that such an interaction exists. While the evidence indicates such a dierence, the data may not represent suciently strong evidence to modify peoples’ beliefs about the existence of such a gender by race by treatment interaction. Regardless of one’s prior disposition however, this evidence indicates that a dangerous trend existed in the UGDP participant population. The Data Safety Monitoring Committee (DSMC) would have had to monitor the situation carefully, had it been allowed to continue. An investigation as to how white women in this trial diered from other participants might have proved useful. One interesting possibility might have been for the DSMC to have recommended the discontinuation of Tolbutamide for white women only. Overall, this analysis of the UGDP data indicates an increased risk of cardiovascular death for participants on Tolbutamide, while demonstrating no imbalance beyond that expected by randomization on relevant cardiovascular risk factors between the Tolbutamide and placebo groups. Furthermore the excess cardiovascular risk seems only to apply to white women.
5. COMMENTS The Law of Likelihood explains how to objectively measure the strength of evidence for one hypothesis over another. Its dependence on two simple hypotheses is not a aw, but a strength. Consequently, arbitrary methods of summarizing evidence over a composite space are avoided. Further, the probability of observing misleading evidence is naturally controlled, even with multiple looks at the data. This makes the probability of generating weak evidence the relevant quantity in sample size calculations. More importantly, the measure of the strength of evidence is mathematically separated from the probability of observing such evidence, allowing each quantity to be dealt with in its distinct role.
234
J. D. BLUME
A strong case can be argued for monitoring clinical trials with likelihood functions. Constant monitoring of statistical evidence does not diminish its strength because the likelihood is unaected by the number of examinations. The data speak for themselves, through the likelihood function, independently of the probability of observing misleading evidence. The probability of observing misleading evidence of k-strength or greater remains bounded by 1=k for any number of looks under very general conditions. Furthermore, in common situations the probability is often much less than the universal bound. However, because this probability is irrelevant in the analysis stage, the Data Safety Monitoring Committee may constantly monitor a clinical trial by looking at a repeatedly likelihood function without fear of causing or observing a spurious association.
APPENDIX A: TRANSFORMING COMPOSITE HYPOTHESES INTO SIMPLE ONES A Bayesian analysis essentially transforms the composite hypotheses into simple ones and then applies the Law of Likelihood to the resulting simple hypotheses. This is done by changing the probability model. To illustrate, suppose our probability model is X ∼ f(X|) where ∈ is a real-valued parameter of interest (in the biased coin example f(X|) was a binomial distribution and =[0; 1]). Identify Hs : =s as a simple hypothesis and Hc : ∈ c ⊆ as a composite hypothesis. Upon observing X =x, the Law of Likelihood explains that the measure of evidence for Hc over Hs is the likelihood ratio given by Pc (x)=Ps (x). However this is the problem because Pc (x)=f(X =x| ∈ c ) does not exist (the probability model f is indexed by a xed parameter , not a range of s). Assuming a prior or weighting distribution g() and using Pc (x)= c f(x|)g() d is equivalent to applying the Law of likelihood under a dierent probability model. Initially X ∼ f(X|) but after specifying g() the probability model is changed to X ∼ h(X)= f(X|) g() d. Consequently, Pc (X) is a simple hypothesis and the ratio Pc (x)=Ps (x) measures the evidence for Hc over Hs under the model X ∼ h(X). (Note that Pc (x)=Ps (x) is called a Bayes factor.) Once again, the purported measure of statistical evidence depends upon the unspecied weighting distribution g().
APPENDIX B: PROFILE LIKELIHOOD FOR A MEAN Suppose X1 ; X2 ; : : : ; Xn ∼ i:i:d: N(; 2 ). The prole likelihood function for is max2 L(; 2 )= L(; !2 ()) where !2 ()= ni=1 (Xi − )2 =n. The following algebra shows that the prole likelihood is proportional to the likelihood function for a mean under the Student’s t distribution.
1
1 L(; !2 ()) ∝ n exp − 2 ! 2 [ ()] 2
n
2 i=1 (Xi − ) !2 ()
" n # n (X − )2 − 2 i i=1 = exp − n
2
n
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
n i=1 (Xi
∝
− X )2
n
− n 2
+ (X − )
= 1 +
= 1+
n
2
− n 2
n(X − )2 (n − 1)
235
i=1 (Xi −X )
2
n−1
− +1 2 t2
√ where =n − 1; s2 = ni=1 (Xi − X )2 =(n − 1) and t = n(X − )=s. Royall (reference [1], pp. 133–134) suggests modifying the exponent of the above prole likelihood by replacing n with n − 1, so that for small sample sizes, the probability of observing misleading evidence remains below the maximum of the bump function.
APPENDIX C: CONDITIONAL LIKELIHOOD FOR THE RELATIVE RISK This derivation follows Royall (reference [1], p. 165). The likelihood function for the relative risk is derived in the same fashion as the conditional likelihood function for the odds ratio (that is, we condition on the margin totals). However, the underlying distributions of events observed in the two groups are assumed to be negative binomial rather than binomial. The unconditional joint likelihoods are the same regardless of the underlying assumption. Furthermore, the likelihood principle implies that the choice of the underlying distribution (in this case binomial or negative binomial) is immaterial because the likelihood ratios are the same for both distributions [1, 14]. To rid ourselves of the nuisance parameter in the unconditional joint likelihood, we condition on the total number of successes which yields the conditional likelihood function for the relative risk. Note, however, that the conditional joint likelihoods are dierent (one is a function of the odds ratio and the other is a function of the relative risk) depending on the parameter of interest. Consider the following 2 × 2 table: Events Group 1 Group 2
Successes
Failures
Total number of observations
a c
b d
m n
Of interest is the RR =(a=m)=(c=n), where m is the total number of i.i.d. Bernoulli(1 ) observations required to obtain b failures and n is the total number of i.i.d. Bernoulli(2 ) observations required to obtain d failures. Let =1 =2 be the relative risk for a successful event in group 1 versus group 2.
236
J. D. BLUME
Under the above set-up we have that, M ∼ negative binomial(b; 1 − 1 ) and N ∼ negative binomial(d; 1 − 2 ) where
m−1 n−1 b m−b (1 − 1 ) 1 and P(N =n)= (1 − 2 )d 2n−d P(M =m)= b−1 d−1 Then the likelihood function for the relative risk, , is −1
m+n−d m+n−j−1 j−1 j−m L() ∝ d−1 b−1 j=b Proof
L() = P(M =m|M + N =m + n)=
P(M =m; M + N =m + n) P(M + N =m + n)
P(M =m)P(N =n) P(M + N =m + n|M =j)P(M =j) j
m−1 n−1 b m−b (1 − 1 ) 1 (1 − 2 )d 2n−d b−1 d−1
= m+n−d m + n − j − 1 j−1 m+n−j−d d (1 − 2 ) 2 (1 − 1 )b 1j−b j=b d−1 b−1 −1
m+n−d m+n−j−1 j−1 j−m ∝ d−1 b−1 j=b
=
QED APPENDIX D: COMPUTER CODE The likelihood functions presented here were drawn and analysed in the statistical software package S-plus. S-plus can be somewhat cumbersome for programming, so I have included functions that will automatically draw the likelihood functions and give their MLE, 1=8 and 1=32 support intervals. To import these functions, copy and paste them into the command window at the command prompt ‘¿’. These functions and others are available in electronic form from the author. I have included three functions: bin.lik for binomial likelihoods, betabin.lik for prole betabinomial likelihoods, and rr.lik for conditional likelihoods for the relative risk. All the functions have some common input variables: hilim is the upper limit of the X-axis; lolim is the lower limit of the X-axis; acc is a multiplier that can increase or decrease the number of points used to evaluate the likelihood; main1 is the title of the graph; and like.only suppresses the command for a new plot window and draws the likelihood on the existing plot. An example of how to call each S-plus function is given along with the function.
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
237
D1. Binomial Likelihood Figure 1 displays the likelihood function for the probability of success, when 14 successes out of 50 trials were observed. To produce Figure 1 type bin.lik(x=14,n=50)
at the command prompt. The function bin.lik is given below. bin.lik_function(x,n,like.only=F,acc = 1, lolim=0,hilim=1, main1 = "Likelihoods: Binomial Model") { p <- seq(lolim, hilim, length = 1000 * acc) like <- exp(x*log(p) + (n-x) * log(1 - p)) like <- like/max(like) if (like.only==F) { graphsheet(orientation = "portrait", height = 6, width = 7.5, pointsize = 12) plot(p, like, type = "n", xlab = "Probability", ylab = " ", main = main1)} p1 <- p[like >= 1/8] p2 <- p[like >= 1/32] i1 <- rep(1/8, length(p1)) i2 <- rep(1/32, length(p2)) lines(p,like,type="l") lines(p1, i1, type = "l") lines(p2, i2, type = "l") if (like.only==F) { whr <- if(p[like == max(like)] <= (lolim + hilim )/2) quantile(p, 0.8) else quantile(p, 0.2) text(whr, 0.95, paste("Max at", signif(c(p[like == max(like)]), digits = 2)),cex=.8) text(whr, 0.91, paste("1/8 SI (", round(min(p1), digits = 2), ",", round(max(p1), digits = 2), ")"),cex=.8) text(whr, 0.87, paste("1/32 SI (", round(min(p2), digits = 2), ",", round(max(p2), digits = 2), ")"),cex=.8)} }
D2. Conditional Likelihood for Relative Risk Figure 6 displays the likelihood function for the relative risk of cardiovascular death. Here there were 26 deaths out of 204 participants on in the treatment group and 10 deaths out of 205 participants in the placebo group (see Table E2 in Appendix E). To produce Figure 6 type rr.lik(x=26,m=204,y=10,n=205,hilim=8)
238
J. D. BLUME
the command prompt. The likelihood for the relative risk of all causes can be added by typing rr.lik(x=30,m=204,y=21,n=205,like.only=T,hilim=8)
at the next command prompt. The function rr.lik is given below. rr.lik_function(x,m,y,n,lolim=0,hilim=6, like.only=F,acc=1, main1="Conditional Likelihood: Relative Risk (Probability Ratio)"){ ## ## ## ##
conditional likelihood on total x successes out of m trials and lolim is the lower limit of the hilim is the lower limit of the
number of successes; negative Binomial y successes out of n trials relative risk axis relative risk axis
if ((m - x) * (n - y) == 0) "Conditional model requires at least one failure in each group." else { z_seq(lolim, hilim, len = 1000 * acc) like_matrix(1,nrow=length(z), ncol=1) j_seq(m-x,m+y,1) ## summation index Mx_matrix(z, nrow = length(j), ncol = length(z), byrow=T) Mx_Mx^(j - m) co_choose(m+n-j-l,(n-y)-1)*choose(j-1,(m-x)-1) like_1/(t(Mx) %*% co) if(y== 0) like_like * exp(co[length(j)]) else like_like/max(like) rrhat_z[like == max(like)] rrhat_round(rrhat, digits = 2) if(like[length(z)] == max(like)) rrhat_NA if(like [1] == max(like)) rrhat_NA L1_round(min(max(like)/like[abs(z - 1) == min(abs(z - 1))],1000000), digits = 1) z2_z[like >= 1/8] z3_z[like >= 1/32] i2_rep(1/8, length(z2)) i3_rep(1/32, length(z3)) if (like.only==F) {graphsheet(orientation = "portrait", height = 6, width = 7.5, pointsize = 12) plot(z, like, type = "n", xlab = "Relative Risk ", ylab ="Likelihood", main=mainl) }
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
239
lines(z, like, lty=1,col=1) if (like.only==F) { lines(z2, i2, type = "l",col=1) lines(z3, i3, type = "l",col=1) text(z[.85*1000*acc], 0.94, paste("Max at ", rrhat),cex=0.8) text(z[.85*1000*acc], 0.9, paste("1/8 SI (", if(min(z2) == z[1]) "NA" else round(min(z2), digits = 2), ",", if(max(z2) == z[1000]) "NA" else round(max(z2), digits = 2), ")"),cex=0.8) text(z[.85*1000*acc], 0.86, paste("1/32 SI (", if(min(z3) == z[1]) "NA" else round(min(z3), digits =2), ",", if(max(z3) == z[1000]) "NA" else round(max(z3), digits = 2), ")"),cex=0.8) text(z[.85*1000*acc], 0.81, paste("L(", } }
rrhat, ")/L(1)=", L1),cex=0.8)}
D3. Beta-Binomial Likelihood Figures 4 and 5 display the likelihood functions for the probability of cardiovascular death under placebo and Tolbutamide, respectively. This function is slightly more complicated than the previous two because it takes a data matrix as input. The data matrix has two columns: in the rst column is the number of successes and the second column is the total number of trials. The number of rows equals the number of dierent experiments, in this case centres. For the raw data see Table E1 in Appendix E. To produce a data matrix for Figure 4 dene a vector for the number of deaths at each clinic for the placebo group as cvd.pl_c(1, 2, 3, 1, 2, 0, 0, 0, 1, 0, 0, 0)
and the same for the number of participants at each clinic for the placebo group cvn.p1_c(15, 22, 22, 23, 23, 19, 24, 13, 11, 10, 12, 11)
Be sure to keep clinics in the same position across the vectors. To produce Figure 6 type betabin.lik(cbind(cvd.pl,cvn.pl),simple.model=T,twostage.model=T, hilim=0.5,acc=10)
The function betabin.lik is given below. betabin.lik_function(x, lolim = 0, hilim = 1, simple.model = F, twostage.model = F, separate.lines = T, main1 = "Likelihoods: Binomial-Beta Model\n", acc = 1) { # FUNCTION: betabin.lik # Original author: Richard Royall -- June 24/96 #
240 # # # # # # # # # # # # #
J. D. BLUME
Data matrix x : k successes (col 1) in m trials (col 2) Binomial-beta model: Xi is binom(mi,pi). p’s iid B(a,b) E(X/k) =a/(a+b) (par of interest), gamma(nuisance)=a+b Plots LFs for E(success proportion) under various models: (1) Each proportion Binomial with its own probability (pi) (separate.lines=T) (2) Each proportion Binomial with common probability (pi=p) (simple.model=T) (3) PROFILE LF under two-stage (Binomial-beta) model (with 1/8 and 1/32 support intervals) (twostage.model=T) p <- seq(lolim, hilim, length = 100 * acc) pl <- seq(lolim, hilim, length = 1000) LM <- NULL for(i in 1:nrow(x)) { lik <- dbinom(x[i, 1],x[i, 2], pl) lik <- lik/max(lik) LM <- cbind(LM, lik) } matplot(pl, LM, type = "n", xlab = "Probability", ylab = "Likelihood", main = main1, cex=1,yaxt="n") axis(side=2,at=seq(0,1,.2),cex=l,srt=90,adj=0.5,mgp=0) if(twostage.model == T) { pb <- p[2:(length(p) - 1)] llikpro <- rep(0, length(pb)) for(i in 1:length(pb)) { g <- 100000 alpha <- pb[i] * g beta <- (1 - pb[i]) * g llikproi <- sum(lgamma(alpha + x[, 1]) + lgamma(beta + x[, 2] - x[, 1]) lgamma(alpha + beta + x[, 2]) lgamma(alpha) - lgamma(beta) + lgamma( alpha + beta)) repeat { newg <- g/10 newalpha <- pb[i] * newg newbeta <- (1 - pb[i]) * newg newllikproi <- sum(lgamma(newalpha + x[, 1]) + lgamma(newbeta + x[, 2] x[, 1]) - lgamma(newalpha + newbeta + x[, 2]) - lgamma(newalpha) - lgamma( newbeta) + lgamma(newalpha + newbeta))
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
if(newllikproi > llikproi) { llikproi <- newllikproi g <- newg } else { llikproi <- newllikproi g <- newg break } } repeat { newg <- g * 10^0.2 newalpha <- pb[i] * newg newbeta <- (1 - pb[i]) * newg newllikproi <- sum(lgamma(newalpha + x[, 1]) + lgamma(newbeta + x[, 2] x[, 1]) - lgamma(newalpha + newbeta + x[, 2]) - lgamma(newalpha) - lgamma( newbeta) + lgamma(newalpha + newbeta)) if(newllikproi > llikproi) { llikproi <- newllikproi g <- newg } else { llikproi <- newllikproi g <- newg break } } repeat { newg <- g * 10^(-0.04) newalpha <- pb[i] * newg newbeta <- (1 - pb[i]) * newg newllikproi <- sum(lgamma(newalpha + x[, 1]) + lgamma(newbeta + x[, 2] x[, 1]) - lgamma(newalpha + newbeta + x[, 2]) - lgamma(newalpha) - lgamma( newbeta) + lgamma(newalpha + newbeta)) if(newllikproi > llikproi) { llikproi <- newllikproi g <- newg } else break }
241
242
J. D. BLUME
llikpro[i] <- llikproi } ###(END OF "for i")### likpro <- exp(llikpro - max(llikpro)) lines(pb, likpro, lty =1) p1 <- pb[likpro >= 1/8] p2 <- pb[likpro >= 1/32] i1 <- rep(1/8, length(p1)) i2 <- rep(1/32, length(p2)) lines(p1, i1, type = "l") lines(p2, i2, type = "l") whr <- if(pb[likpro == max(likpro)] <= (lolim + hilim )/2) quantile(p, 0.8) else quantile(p, 0.2) text(whr-0.05, 0.95, paste("Beta-Binomial Model"),cex=0.8,adj=-1) text(whr-0.05, 0.91, paste ("Max at", signif(c(pb[likpro == max(likpro)]), digits = 2)),cex=0.8,adj=-1) text(whr-0.05, 0.87, paste("1/8 SI (", round(min(p1), digits = 3), ",", round(max(p1), digits = 3), ")"), cex=0.8,adj=-1) text(whr-0.05, 0.83, paste("1/32 SI (", round(min(p2), digits = 3), ",", round(max(p2), digits = 3), ")"),cex=0.8,adj=-1) } ###(END OF "if twostage.model==T")### if (simple.model == T) { likc <- dbinom(sum(x[, 1]),sum(x[, 2]), pl) likc <- likc/max(likc) lines(pl, likc, lty = 1) p1 <- pl[likc >= 1/8] p2 <- pl[likc >= 1/32] i1 <- rep(1/8, length(p1)) i2 <- rep(1/32, length(p2)) lines(p1, i1, type = "l") lines(p2, i2, type = "l") if(twostage.model==F) { whr <- if(pl[likc == max(likc)] <= (lolim + hilim )/2) quantile(p, 0.8) else quantile(p, 0.2)} text(whr-0.05, 0.75, paste("Binomial Model"),cex=0.8,adj=-1) text(whr-0.05, 0.71, paste("Max at", signif(c(pl[likc == max(likc)]), digits = 2)),cex=0.8,adj=-1) text(whr-0.05, 0.67, paste("1/8 SI (", round(min(p1), digits = 3), ",", round(max(p1), digits = 3), ")"),cex=0.8,adj=-1) text(whr-0.05, 0.63, paste("1/32 SI (", round(min(p2), digits = 3), ",", round(max(p2), digits = 3),
243
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
")"),cex=0.8,adj=-1) } if(separate.lines == T) { matlines(pl, LM, type = "l", lty = 2, col = 1, cex = 1) } }
APPENDIX E: DATA The data assembled in the following tables (Tables E1–E5) can be found in references [29, 32, 36] Table E1. Distribution of cardiovascular deaths by clinic. Clinic
Tolbutamide
Placebo
Deaths
Total participants
Deaths
Total participants
1 7 1 6 2 3 2 4 0 0 0 0 26
22 22 18 24 20 22 11 17 12 11 14 11 204
0 2 0 2 3 1 0 1 1 0 0 0 10
24 23 19 22 22 23 13 15 11 10 12 11 205
Baltimore Cincinnati Cleveland Minneapolis New York Williamson Birmingham Boston Chicago St Louis San Juan Seattle All clinics
Table E2. Distribution of deaths. Tolbutamide
All causes Cardiovascular causes
Placebo
Deaths
Total participants
Deaths
Total participants
30 26
204 204
21 10
205 205
Table E3. Distribution of deaths by gender and race. Tolbutamide Deaths (by cause) All Cardiovascular White males Non-white males White females Non-white females
9 4 13 4
7 4 11 4
Placebo Total participants 40 23 68 73
Deaths (by cause) All Cardiovascular 8 5 3 5
5 2 1 2
Total participants 47 16 56 86
244
J. D. BLUME
Table E4. Distribution of baseline cardiovascular risk factors. Baseline cardiovascular risk factor
Tolbutamide
Placebo
Number with risk factor
Total participants
Number with risk factor
Total participants
60 15 14 8 30 92
199 198 201 201 199 192
74 9 10 6 17 88
201 202 202 199 198 186
(a) Hypertension (b) History of digitalis use (c) History of angina pectoris (d) Signicant ECG abnormality (e) Cholesterol level ¿ 300 mg=100 ml (f) One or more of the above risk factors
Table E5. Distribution of the probability of cardiovascular death based on a multiple logistic regression. Probability of cardiovascular death
(a) (b) (c) (d) (e)
¡ 0:0065 0.0065–0.0140 0.0141–0.0295 0.0297–0.0665 ¿ 0:0673
Tolbutamide
Placebo
Number in quintile
Total participants
Number in quintile
Total participants
36 38 35 54 41
204 204 204 204 204
49 37 44 34 41
205 205 205 205 205
ACKNOWLEDGEMENTS
The author wishes to thank Marie Diener-West, Paul Rathouz, Richard Royall and two referees for their insightful comments. REFERENCES 1. Royall RM. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall: London, 1997. 2. Goodman SN. Toward evidence-based medical statistics: the p-value fallacy. Annals of Internal Medicine 1999; 130(12):995 – 1004. 3. Edwards AWF. Likelihood. Cambridge University Press: London, 1971. 4. Goodman SN. P-values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. American Journal of Epidemiology 1993; 137(5):485 – 496. 5. Royall RM. The likelihood paradigm for statistical evidence (with discussion). In The Nature of Statistical Evidence. University of Chicago Press: Chicago, IL (in press). 6. Vieland VJ, Hodge SE. Review of R. Royall (1997) statistical evidence: a likelihood paradigm. Annals of Human Genetics 1998; 63:283 – 289. 7. Hacking I. Logic of Statistical Inference. Cambridge University Press: New York, 1965. 8. Goodman SN, Royall RM. Evidence and scientic research. American Journal of Public Health 1988; 78(12):1568 – 1574. 9. Fisher RA. Statistical Methods and Scientic Inference. 2nd edn. Hafner: New York, 1959. 10. Royall RM. On the probability of observing misleading statistical evidence (with discussion). Journal of the American Statistical Association 2000; 95(451):760 – 767. 11. Jereys H. Theory of Probability. 3rd edn. Oxford University Press: Oxford, 1961. 12. Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association 1995; 90:773 – 795. 13. Birnbaum A. On the foundations of statistical inference (with discussion). Journal of the American Statistical Association 1962; 53:259 – 326. 14. Berger JO, Wolpert RL. The Likelihood Principle. Institute of Mathematical Statistics: Hayward, California, 1988.
LIKELIHOOD METHODS FOR MEASURING STATISTICAL EVIDENCE
245
15. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Controlled Clinical Trials 1986; 7:8 – 17. 16. Goodman SN. Multiple comparisons, explained. American Journal of Epidemiology 1998; 147:807 – 812. 17. Royall RM. The eect of sample size on the meaning of signicant tests. American Statistician 1986; 40: 313 – 315. 18. Peto R. Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPhearson K, Peto J, Smith PG. Design and analysis of randomized clinical trials requiring prolonged observation of each patient: Introduction and design. British Journal of Cancer 1976; 34:585 – 612. 19. Pratt JW. ‘Decisions’ as statistical evidence and Birnbaum’s condence concept. Synthese 1977; 36:59 – 69. 20. Corneld J. Sequential trials, sequential analysis, and the likelihood principle. American Statistician 1966; 29(2):18 – 33. 21. Smith CAB. The detection of linkage in human genetics. Journal of the Royal Statistical Society, Series B 1953; 15:153 – 192. 22. Robbins H. Statistical methods related to the law of the iterated logarithm. Annals of Mathematical Statistics 1970; 41:1397 – 1409. 23. Blume JD. On the probability of observing misleading evidence in sequential trials, PhD dissertation, Johns Hopkins University School of Public Health, 1999. 24. Blume JD. On the probability of observing misleading evidence in sequential trials. 2001 (submitted) Available at http://alexander.stat.Brown.edu/-Jblume/slides. 25. Anscombe FJ. Normal likelihood functions. Annals of the Institute of Statistical Mathematics 1964; 26: 1 – 19. 26. Tsou TS, Royall RM. Robust likelihoods. Journal of the American Statistical Association 1995; 90:316 – 320. 27. Wolpert RL. Discussion of On the Probability of Observing Misleading Statistical Evidence by RM Royall. Journal of the American Statistical Association 2000; 95(451):760 – 767. 28. Barndorf-Nielsen OE, Cox DR. Inference and Asymptotics. Chapman and Hall: New York, 1994. 29. The University Group Diabetes Program. A study of the eects hypoglycemic agents on vascular complications in patients with adult-onset diabetes. Diabetes 1970: 19 (suppl. 2):747 – 839. 30. Meinert CL, Tonascia S. Clinical Trials: Design; Conduct; and Analysis. Oxford University Press: New York, 1986. 31. Marks H. The Progress of Experiment. Cambridge University Press: Cambridge, 1997. 32. Committee for the Assessment of Biometric Aspects of Controlled Trials of Hypoglycemic Agents. Report of the committee for the assessment of biometric aspects of controlled trials of hypoglycemic agents. Journal of the American Medical Association 1975; 231:583 – 608. 33. Schor S. The University Group Diabetes Program: a statistician looks at the mortality results. Journal of the American Medical Association 1971: 217:1671 – 1675. 34. Feinstein AR. Clinical biostatistics VIII: an analytic appraisal of the University Group Diabetes Program (UGDP) study. Clinical Pharmacology 1971; 12:167 – 191. 35. Seltzer HS. A summary of criticisms of the ndings and conclusions of the University Group Diabetes Program (UGDP). Diabetes 1972; 21:976 – 979. 36. Corneld J. The University Group Diabetes Program: a further statistical analysis of the mortality ndings. Journal of the American Medical Association 1971; 217:1676 – 1687. 37. Prentice RL. Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association 1986; 81:321 – 327. 38. Diamond G, Forrester J. Clinical trials and statistical verdicts: probable grounds for appeal. Annals of Internal Medicine 1983; 98(3):385 – 394. 39. Berry G. The analysis of mortality by the subject-years method. Biometrics 1983; 39:173 – 184. 40. Kilo C, Miller J, Williamson J. The achilles heel of the university group diabetes program. Journal of the American Medical Association 1980; 245(5):450 – 457. 41. Kilo C, Miller J, Williamson J. The crux of the UGDP: Spurious results and biologically inappropriate data analysis. Diabetologia 1980; 18(3):179 – 185.
Part II MODELLING MULTIPLE DATA SETS: META-ANALYSIS
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 249
250
S-L. NORMAND
META-ANALYSIS
251
252
S-L. NORMAND
META-ANALYSIS
253
254
S-L. NORMAND
META-ANALYSIS
255
256
S-L. NORMAND
META-ANALYSIS
257
258
S-L. NORMAND
META-ANALYSIS
259
260
S-L. NORMAND
META-ANALYSIS
261
262
S-L. NORMAND
META-ANALYSIS
263
264
S-L. NORMAND
META-ANALYSIS
265
266
S-L. NORMAND
META-ANALYSIS
267
268
S-L. NORMAND
META-ANALYSIS
269
270
S-L. NORMAND
META-ANALYSIS
271
272
S-L. NORMAND
META-ANALYSIS
273
274
S-L. NORMAND
META-ANALYSIS
275
276
S-L. NORMAND
META-ANALYSIS
277
278
S-L. NORMAND
META-ANALYSIS
279
280
S-L. NORMAND
META-ANALYSIS
281
282
S-L. NORMAND
META-ANALYSIS
283
284
S-L. NORMAND
META-ANALYSIS
285
286
S-L. NORMAND
META-ANALYSIS
287
TUTORIAL IN BIOSTATISTICS Advanced methods in meta-analysis: multivariate approach and meta-regression Hans C. van Houwelingen1; ∗; † , Lidia R. Arends 2 and Theo Stijnen2 1 Department
of Medical Statistics; Leiden University Medical Center; P.O. Box 9604; 2300 RC Leiden; The Netherlands 2 Department of Epidemiology & Biostatistics; Erasmus Medical Center Rotterdam; P.O. Box 1738; 3000 DR Rotterdam; The Netherlands
SUMMARY This tutorial on advanced statistical methods for meta-analysis can be seen as a sequel to the recent Tutorial in Biostatistics on meta-analysis by Normand, which focused on elementary methods. Within the framework of the general linear mixed model using approximate likelihood, we discuss methods to analyse univariate as well as bivariate treatment eects in meta-analyses as well as meta-regression methods. Several extensions of the models are discussed, like exact likelihood, non-normal mixtures and multiple endpoints. We end with a discussion about the use of Bayesian methods in meta-analysis. All methods are illustrated by a meta-analysis concerning the ecacy of BCG vaccine against tuberculosis. All analyses that use approximate likelihood can be carried out by standard software. We demonstrate how the models can be tted using SAS Proc Mixed. Copyright ? 2002 John Wiley & Sons, Ltd. KEY WORDS:
meta-analysis; meta-regression; multivariate random eects models
1. INTRODUCTION In this paper we review advanced statistical methods for meta-analysis as used in bivariate meta-analysis [1] (that is, two outcomes per study are modelled simultaneously) and metaregression [2]. It can be seen as a sequel to the recent Tutorial in Biostatistics on meta-analysis by Normand [3]. Meta-analysis is put in the context of mixed models using (approximate) likelihood methods to estimate all relevant parameters. In the medical literature meta-analysis is usually applied to the results of clinical trials, but the application of the theory presented in this paper is not limited to clinical trials only. It is the aim of the paper not only to discuss the underlying theory but also to give practical guidelines how to carry out these analyses. As the leading example we use the meta-analysis data set of Colditz et al. [4]. This data set is also discussed in Berkey et al. [2]. Wherever feasible, it is specied how the analysis can
∗ Correspondence
to: Hans C. van Houwelingen, Department of Medical Statistics, Leiden University Medical Center, P.O. Box 9604, 2300 RC Leiden, The Netherlands
[email protected]
† E-mail:
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 289
290
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
be performed by using the SAS procedure Proc Mixed. The paper is organized as follows. In Section 2 we review the concept of approximate likelihood that was introduced in the metaanalysis setting by DerSimonian and Laird [5]. In Section 3 we review the meta-analysis of one-dimensional treatment eect parameters. In Section 4 we discuss the bivariate approach [1] and its link with the concept of underlying risk as source of heterogeneity [6–10]. In Section 5 we discuss meta-regression within the mixed model setting. Covariates considered are aggregate measures on the study level. We do not go into meta-analysis with patientspecic covariates. In principle that is not dierent from analysing a multi-centre study [11]. In Section 6 several extensions are discussed: exact likelihood’s based on conditioning; nonnormal mixtures; multiple endpoints; other outcome measures, and other software. This is additional material that can be skipped at rst reading. Section 7 is concerned with the use of Bayesian methods in meta-analysis. We argue that Bayesian methods can be useful if they are applied at the right level of the hierarchical model. The paper is concluded in Section 8.
2. APPROXIMATE LIKELIHOOD The basic situation in meta-analysis is that we are dealing with n studies in which a parameter of interest #i (i = 1; : : : ; n) is estimated. In a meta-analysis of clinical trials the parameter of interest is some measure of the dierence in ecacy between the two treatment arms. The most popular choice is the log-odds ratio, but this could also be the risk or rate dierence or the risk or rate ratio for dichotomous outcome or similar measures for continuous outcomes or survival data. All studies report an estimate #ˆi of the true #i and the standard error si of the estimate. If the studies only report the estimate and the p-value or a condence interval, we can derive the standard error from the p-value or the condence interval. In the Sections 3 to 5, which give the main statistical tools, we act as if #ˆi has a normal distribution with unknown mean #i and known standard deviation si , that is #ˆi ∼ N(#i ; si2 )
(1)
Moreover, since the estimates are derived from dierent data sets, the #ˆi are conditionally independent given #i . This approximate likelihood approach goes back to the seminal paper by DerSimonian and Laird [5]. However, it should be stressed that it is not the normality of the frequency distribution of #ˆi that is employed in our analysis. Since our whole approach is likelihood based, we only use that the likelihood of the unknown parameter in each study looks like the likelihood of (1). Thus, if we denote the log-likelihood of the ith study by ‘i (#), the real approximation is 1 ‘i (#) = − (# − #ˆi ) 2 =si2 + ci 2
(2)
where ci is some constant that does not depend on the unknown parameter. If in each study the unknown parameter is estimated by maximum likelihood, approximation (2) is just the second-order Taylor expansion of the (prole) log-likelihood around the MLE #ˆi . The approximation (2) is usually quite good, even if the estimator #ˆi is discrete. Since most studies indeed use the maximum likelihood method to estimate the unknown parameter, we are condent that (2) can be used as an approximation. In Section 6 we will discuss
ADVANCED METHODS IN META-ANALYSIS
291
some renements of this approximation. In manipulating the likelihoods we can safely act as if we assume that (1) is valid and use, for example, known results for mixtures of normal distributions. However, we want to stress that actually we only use assumption (2). The approach of Yusuf et al. [12], popular in xed eect meta-analysis, and of Whitehead and Whitehead [13] are based on a Taylor expansion of the log-likelihood around the value # = 0. This is valid if the eects in each study are relatively small. It gives an approximation in the line of (2) with dierent estimators and standard errors but a similar quadratic expression in the unknown parameter. As we have already noted, the most popular outcome measure in meta-analysis is the logodds ratio. Its estimated standard error is equal to ∞ if one of the frequencies in the 2 × 2 table is equal to zero. That is usually repaired by adding 0.5 to all cell frequencies. We will discuss more appropriate ways of handling this problem in Section 6.
3. ANALYSING ONE-DIMENSIONAL TREATMENT EFFECTS The analysis under homogeneity makes the assumption that the unknown parameter is exactly the same in all studies, that is #1 = #2 = · · · = #n = #. The log-likelihood for # is given by ‘(#) =
i
‘i (#) =−
1 [(# − #ˆi ) 2 =si2 + ln(si2 ) + ln(2)] 2 i
(3)
Maximization is straightforward and results in the well-known estimator of the common eect $ ˆ 2 2 ˆ 1=si #i =si #hom = i
with standard error
i
% &
SE(#ˆ hom ) = 1
i
1=si2
Condence intervals for # can be based on normal distributions, since the si2 terms are assumed to be known. Assuming the si2 terms to be known instead of to be estimated has little impact on the results [14]. This is the basis for the traditional meta-analysis. The assumption of homogeneity is questionable even if it is hard to disprove for small metaanalyses [15]. That is, heterogeneity might be present and should be part of the analysis even if the test for heterogeneity is not signicant. Heterogeneity is found in many meta-analyses and is likely to be present since the individual studies are never identical with respect to study populations and other factors that can cause dierences between studies. The popular model for the analysis under heterogeneity is the normal mixture model, introduced by DerSimonian and Laird [5], that considers the #i to be an independent random sample from a normal population #i ∼ N(#; 2 ) Normality of this mixture is a true assumption and not a simplifying approximation. We will further discuss it in Section 6. The resulting marginal distribution of #i is easily obtained as
292
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
#ˆi ∼ N(#; 2 + si2 ) with corresponding log-likelihood ‘(#; 2 ) = −
1 [(# − #ˆi ) 2 =( 2 + si2 ) + ln( 2 + si2 ) + ln(2)] 2 i
(4)
Notice that (3) and (4) are identical if 2 = 0. This log-likelihood is the basis for inference about both parameters # and 2 . Maximum likelihood estimates can be obtained by dierent algorithms. In the example below, it is shown how the estimates can be obtained by using the SAS procedure Proc Mixed. If 2 were known, the ML estimate for # would be $ ˆ 2 2 2 2 ˆ (#i =( + si ) [1=( + si ) #het = i
with standard error
i
% & ˆ SE(#het ) = 1 1=( 2 + si2 ) i
The latter can also be used if 2 is estimated and the estimated value is plugged in, as is done in the standard DerSimonian and Laird approach. The construction of condence intervals for both parameters is more complicated than in the case of a simple sample from a normal distribution. Simple 2 - and t-distributions with d:f : = n−1 are not appropriate. In this article all models are tted using SAS Proc Mixed, which gives Satherthwaite approximation based condence intervals. Another possibility is to base condence intervals on the likelihood ratio test, using prole log-likelihoods. That is, the condence interval consists of all parameter values that are not rejected by the likelihood ratio test. Such condence intervals often have amazingly accurate coverage probabilities [16, 17]. Brockwell and Gordon [18] compared the commonly used DerSimonian and Laird method [5] with the prole likelihood method. Particularly when the number of studies is modest, the DerSimonian and Laird method had coverage probabilities considerably below 0.95 and the prole likelihood method achieved the best coverage probabilities. The prole log-likelihoods are dened by p‘1 (#) = max ‘(#; 2 ) 2
and
p‘2 ( 2 ) = max ‘(#; 2 ) #
ˆ − p‘1 (#)), the 95 per cent condence Based on the usual 2[1] -approximation for 2(p‘1 (#) ˆ − 1:92 (1.92 is the 95 per interval for # is obtained as all #’s satisfying p‘1 (#)¿p‘1 (#) 2 cent centile of the [1] distribution 3.84 divided by 2) and similarly for 2 . Unlike the usual condence interval based on Wald’s method, this condence interval for # implicitly accounts for the fact that 2 is estimated. Testing for heterogeneity is equivalent to testing H0 : 2 = 0 against H1 : 2 ¿0. The likelihood ratio test statistic is T = 2(p‘2 (ˆ 2 ) − p‘2 (0)). Since 2 = 0 is on the boundary of the parameter space, T does not have a 2[1] -distribution, but its distribution is a mixture with probabilities half of the degenerate distribution in zero and the 2[1] -distribution [19]. That means that the p-value of the naive LR-test has to be halved. Once the mixed model has been tted, the
ADVANCED METHODS IN META-ANALYSIS
293
following information is available at the overall level: (i) #ˆ and its condence interval, showing the existence or absence of an overall eect; (ii) ˆ 2 and its condence interval (and the test for heterogeneity), showing the variation between studies; (iii) approximate 95 per cent prediction interval for the true parameter #ˆ new of a new unrelated study: #ˆ ± 1:96ˆ (approximate in the sense that it ignores the error in the estimation of # and ); (iv) an estimate of the probability of a positive result of a new study: ˆ ) P(#new ¿0) = (#= ˆ (where is the standard normal cumulative distribution function). The following information is available at the individual study level: (i) posterior condence intervals for the true #i ’s of the studies in the meta-analysis based ˆ Bi si2 ) with Bi = ˆ 2 =(ˆ 2 + si2 ). The on the posterior distribution #i | #ˆi ∼ N(#ˆ + Bi (#ˆi − #); posterior means or so-called empirical Bayes estimates give a more realistic view on the results of, especially, the small studies. See the meta-analysis tutorial of Normand [3] for more on this subject. 3.1. Example To illustrate the above methods we make use of the meta-analysis data given by Colditz et al. [4]. Berkey et al. [2] also used this data set to illustrate their random-eects regression approach to meta-analysis. The meta-analysis concerns 13 trials on the ecacy of BCG vaccine against tuberculosis. In each trial a vaccinated group is compared with a non-vaccinated control group. The data consist of the sample size in each group and the number of cases of tuberculosis. Furthermore some covariates are available that might explain the heterogeneity among studies: geographic latitude of the place where the study was done; year of publication, and method of treatment allocation (random, alternate or systematic). The data are presented in Table I. We stored the data in an SAS le called BCG data.sd2 (see Data step in SAS commands below). The treatment eect measure we have chosen is the log-odds ratio, but the analysis could be carried out in the same way for any other treatment eect measure. 3.1.1. Fixed eects model. The analysis under the assumption of homogeneity is easily performed by hand. Only for the sake of continuity and uniformity do we also show how the analysis can be carried out using SAS software. The ML-estimate of the log-odds ratio for trial i is
YA; i =(nA; i − YA; i ) ln ORi = log YB; i =(nB; i − YB; i ) where YA; i and YB; i are the number of disease cases in the vaccinated (A) and non-vaccinated group (B) in trial i, and nA; i and nB; i the sample sizes. The corresponding within-trial variance,
294
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Table I. Example: data from clinical trials on ecacy of BCG vaccine in the prevention of tuberculosis [2, 4]. Trial
1 2 3 4 5 6 7 8 9 10 11 12 13 ∗
Vaccinated
Not vaccinated
Disease
No disease
Disease
No disease
4 6 3 62 33 180 8 505 29 17 186 5 27
119 300 228 13536 5036 1361 2537 87886 7470 1699 50448 2493 16886
11 29 11 248 47 372 10 499 45 65 141 3 29
128 274 209 12619 5761 1079 619 87892 7232 1600 27197 2338 17825
ln(OR)
Latitude
Year
Allocation∗
−0:93869 −1:66619 −1:38629 −1:45644 −0:21914 −0:95812 −1:63378 0.01202 −0:47175 −1:40121 −0:34085 0.44663 −0:01734
44 55 42 52 13 44 19 13 27∗ 42 18 33 33
48 49 60 77 73 53 73 80 68 61 74 69 76
Random Random Random Random Alternate Alternate Random Random Random Systematic Systematic Systematic Systematic
This was actually a negative number; we used the absolute value in the analysis.
computed from the inverse of the matrix of second derivatives of the log-likelihood, is var(ln ORi ) =
1 1 1 1 + + + YA; i nA; i − YA; i YB; i nB; i − YB; i
which is also known as Woolf’s formula. These within-trial variances were stored in the same SAS data le as above, called BCG data.sd2. In the analysis, these variances are assumed to be known and xed. # THE DATA STEP; data BCG_data; input TRIAL VD VWD NVD NVWD LATITUDE YEAR ALLOC; LN_OR=log((VD/VWD)/(NVD/NVWD)); EST=1/VD+1/VWD+1/NVD+1/NVWD; datalines; 1 4 119 11 128 44 2 6 300 29 274 55 3 3 228 11 209 42 4 62 13536 248 12619 52 5 33 5036 47 5761 13 6 180 1361 372 1079 44 7 8 2537 10 619 19 8 505 87886 499 87892 13 9 29 7470 45 7232 27 10 17 1699 65 1600 42 11 186 50448 141 27197 18 12 5 2493 3 2338 33 13 27 16886 29 17825 33 ; proc print;run;
48 49 60 77 73 53 73 80 68 61 74 69 76
1 1 1 1 2 2 1 1 1 3 3 3 3
295
ADVANCED METHODS IN META-ANALYSIS
Running these SAS commands gives the following output: OBS
TRIAL
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4 5 6 7 8 9 10 11 12 13
VD 4 6 3 62 33 180 8 505 29 17 186 5 27
VWD
NVD
119 300 228 13536 5036 1361 2537 87886 7470 1699 50448 2493 16886
11 29 11 248 47 372 10 499 45 65 141 3 29
NVWD 128 274 209 12619 5761 1079 619 87892 7232 1600 27197 2338 17825
LATITUDE 44 55 42 52 13 44 19 13 27 42 18 33 33
YEAR
ALLOC
48 49 60 77 73 53 73 80 68 61 74 69 76
1 1 1 1 2 2 1 1 1 3 3 3 3
LN_OR
EST
-0.93869 -1.66619 -1.38629 -1.45644 -0.21914 -0.95812 -1.63378 0.01202 -0.47175 -1.40121 -0.34085 0.44663 -0.01734
0.35712 0.20813 0.43341 0.02031 0.05195 0.00991 0.22701 0.00401 0.05698 0.07542 0.01253 0.53416 0.07164
The list of variables matches that in Table I (VD = vaccinated and diseased, VWD = vaccinated and without disease, NVD = not vaccinated and diseased, NVWD = not vaccinated and without disease). The variable ln or contains the estimated log-odds ratio of each trial and the variable est contains its variance per trial. In the Proc Mixed commands below, SAS assumes that the within trial variances are stored in a variable with the name ‘est’. # THE FIXED EFFECTS MODEL; Proc mixed method =ml data=BCG data; class trial; model ln or=/ s ; repeated /group=trial; parms / parmsdata=BCG data
eqcons=1 to 13;
# call SAS procedure; # species ‘trial’ as classication variable; # an intercept only model; print the solution s; # each trial has its own within-trial variance; # the parmsdata-option reads in the variable EST (indicating the within-trial variances) from the dataset BCG data.sd2; # the within trial variances are considered to be known and must be kept constant;
run;
Running this analysis gives the following output: The MIXED Procedure (...) Solution for Fixed Effects Effect INTERCEPT
Estimate -0.43627138
Std Error 0.04227521
DF 12
t -10.32
Pr > |t| 0.0001
Alpha 0.05
Lower Upper -0.5284 -0.3442
The estimate of the common log-odds ratio is equal to −0:436 with standard error = 0:042 leading to a 95 per cent Wald based condence interval of the log-odds ratio from −0:519 to −0:353. (Although it seems overly precise, we will present results to three decimals, since these are used in further calculations and to facilitate comparisons between results of dierent
296
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
models.) This corresponds to an estimate of 0.647 with a 95 per cent condence interval from 0.595 to 0.703 for the odds ratio itself. Thus we can conclude that vaccination is benecial. The condence intervals and p-values provided by SAS Proc Mixed are based on the t-distribution rather than on the standard normal distribution, as is done in the standard likelihood approach. The number of degrees of freedom of the t-distribution is determined by Proc Mixed according to some algorithm. One can choose between several algorithms, but one can also specify in the model statement the number of degrees of freedom to be used for each covariable, except for the intercept. To get the standard Wald condence interval and p-value for the intercept, the number of degrees of freedom used for the intercept should be specied to be ∞, which can be accomplished by making a new intercept covariate equal to 1 and subsequently specifying ‘no intercept’ (‘noint’). The SAS statement to be used is then: model ln or = int = s cl noint ddf = 1000;
(the variable ‘int’ is a self-made intercept variable equal to 1). 3.1.2. Simple random eects model, maximum likelihood. The analysis under heterogeneity can be carried out by executing the following SAS statements. Unlike the previous model where we read in the within-trial variances from the datale, we now specify the within-trial variances explicitly in the ‘parms’ statement. This has to be done because we want to dene a grid of values for the rst covariance parameter, that is, the between-trial variance, to get the prole likelihood function for the between-trial variance to get its likelihood ratio based 95 per cent condence interval. Of course, one could also give only one starting value and read the data from an SAS datale like we did before. # THE RANDOM EFFECTS MODEL (MAXIMUM LIKELIHOOD); Proc mixed cl method=ml data=BCG data; # call of procedure; ‘cl’ asks for condence class trial; model ln or= / s cl; random int/ subject=trial s; repeated /group=trial; parms (0.01 to 2.00 by 0.01)(0.35712) (0.20813)(0.43341)(0.02031)(0.05195) (0.00991)(0.22701)(0.00401)(0.05698) (0.07542)(0.01253)(0.53416)(0.07164) /eqcons=2 to 14; make ‘Parms’ out=Parmsml;
run;
intervals of covariance parameters; # trial is classication variable; # an intercept only model. print xed eect solution ‘s’ and its condence limits ‘cl’; # trial is specied as random eect; ‘s’ asks for the empirical Bayes estimates; # each trial has its own within trial variance; # denes grid of values for between trial variance (from 0.01 to 1.00), followed by the 13 within trial variances which are assumed to be known and must be kept xed; # in the dataset ‘Parms’ the maximum log likelihood for each value of the grid specied for the between trial variance is stored, in order to read o the prole likelihood based 95% CI for the between trial variance;
297
ADVANCED METHODS IN META-ANALYSIS
Profile log-likelihood for variance -12
Profile log-likelihood
-14 -16 -18 -20 -22 -24 -26 2
1 .8
.6
.4
.2
.1 8 .0 6 .0
4 .0
Between-study variance
Figure 1. The 95 per cent condence interval of the between-trial variance 2 based on the prole likelihood funtion: (0:12; 0:89).
Running this program gives the following output: The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm INTERCEPT
Subject TRIAL
Group
Estimate 0.30245716
Alpha 0.05
Lower 0.1350
Upper 1.1810
(...) Solution for Fixed Effects Effect INTERCEPT
Estimate -0.74197023
Std Error 0.17795376
DF 12
t -4.17
Pr > |t| 0.0013
Alpha 0.05
Lower Upper -1.1297 -0.3542
The ML-estimate of the mean log-odds ratio is −0:742 with standard error 0.178. The standard Wald based 95 per cent condence interval is −1:091 to −0:393. (SAS Proc Mixed gives a slightly wider condence interval based on a t-distribution with d:f : = 12). This corresponds to an estimated odds ratio of 0.476 with a 95 per cent condence interval from 0.336 to 0.675. The ML-estimate of the between-trial variance 2 is equal to 0.302. For each value of the grid specied in the ‘Parms’ statement for the between-trial variance (in the example the grid runs from 0.01 to 2.00 with steps of 0.01), the maximum log-likelihood value is stored as variable ‘LL’ in the SAS le ‘Parmsml.sd2’. Plotting the maximum log-likelihood values against the grid of between-trial variances gives the prole likelihood plot for the between-trial variance presented in Figure 1. From this plot or a listing of the data set ‘Parmsml.sd2’ one can read o the prole likelihood based 95 per cent condence interval for the between-trial
298
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Profile log-likelihood for mean -13
profile log-likelihood
-14 -15 -16 -17 -18 -19 -20 -21 -2.0
-1.6
-1.2
-.8
-.4
0.0
.4
mean
Figure 2. The 95 per cent condence interval of the treatment eect (log-odds ratio) based on the prole likelihood funtion: (−1:13; −0:37).
variance 2 . This is done by looking for the two values of the between-trial variance with a corresponding log-likelihood of 1.92 lower than the maximum log-likelihood. The 95 per cent prole likelihood based condence interval for 2 is (0.12, 0.89). (SAS Proc Mixed gives a Satterthwaite approximation based 95 per cent condence interval running from 0.135 to 1.180.) Notice that by comparing the maximum log-likelihood of this model with the previous xed eects model, one gets the likelihood ratio test for homogeneity (the p-value has to be halved, because 2 = 0 is on the boundary of the parameter space). A prole likelihood based condence interval for the mean treatment eect # can be made by trial and error by dening the variable y=ln or-c as dependent variable for various values of c and specifying a model without intercept (add ‘noint’ after the slash in the model statement). Then look for the two values of c that decrease the maximum log-likelihood by 1.92. The prole log-likelihood plot for # is given in Figure 2. The 95 per cent condence interval for the log-odds ratio # is (−1:13; −0:37), slightly wider than the simple Wald approximation given above. This corresponds with a 95 per cent condence interval for the odds ratio of 0.323 to 0.691. Remark In Proc Mixed one can also choose the restricted maximum likelihood (REML) estimate (specify method=reml instead of method=ml). Then the resulting estimate for the betweentrial variance 2 is identical to the iterated DerSimonian–Laird estimator [5]. However, in this case the prole likelihood function should not be used to make a condence interval for the log-odds ratio #. The reason is that dierences between maximized REML likelihoods cannot be used to test hypotheses concerning xed parameters in a general linear mixed model [20]. The observed and corresponding empirical Bayes estimated log-odds ratios with their 95 per cent standard Wald, respectively, the 95 per cent posterior condence intervals per trial, are presented in Figure 3. This gure shows the shrinkage of the empirical Bayes estimates
ADVANCED METHODS IN META-ANALYSIS
1 (n=
262)
2 (n=
609)
3 (n=
451)
299
4 (n= 26,465)
TRIAL
5 (n= 10,877) 6 (n=
2,992)
7 (n=
3,174)
8 (n=176,782) 9 (n= 14,776) 10 (n=
3,381)
11 (n= 77,972) 12 (n=
4,839)
13 (n= 34,767) -3
-2
-1
0
1
2
3
LOG ODDS RATIO
95% Confidence interval:
ˆ = -0.742, 95% Conf. Int.= (-1.091 ; -0.393)
95% Prediction interval:
ˆ = -0.742, 95% Pred. Int. = (-1.820 ; 0.336)
Figure 3. Forrest plot with the estimated log-odds ratios of tuberculosis with their 95 per cent condence intervals in the trials included in the meta-analysis. The dashed horizontal lines indicate the standard Wald condence intervals. The solid horizontal lines indicate the posterior or so-called empirical Bayes condence intervals. The vertical line indicates the ML-estimate of the common (true) log-odds ratio. Below the gure the 95 per cent condence interval for the mean log-odds ratio and the 95 per cent prediction interval for the true log-odds ratio are presented.
towards the estimated mean log-odds ratio and their corresponding smaller posterior condence intervals. The overall condence interval of the mean true treatment eect and the overall prediction interval of the true treatment eect are given at the bottom of the gure. The 95 per cent prediction interval indicates the interval in which 95 per cent of the true treatment eects of new trials are expected to fall. It is calculated as the ML-estimate plus and minus 1.96 times the estimated between-trial standard deviation s and is here equal to (−1:820 to 0.336). The estimated probability for a new trial having a positive true treatment eect is (0:742=0:302) = 0:993.
4. BIVARIATE APPROACH In the previous section the parameter of interest was one-dimensional. In many situations it can be bivariate or even multivariate, for instance when there are more treatment groups or more outcome variables. In this section we discuss the case of a two-dimensional parameter of interest. We introduce the bivariate approach with special reference to the situation where one is interested in ‘control rate regression’, that is, relating the treatment eect size to the risk of events in the control group. However, the approach applies generally. Many studies show considerable variation in what is called the baseline risk. The baseline risk indicates the risk for patients under the control condition, which is the average risk of the patients in that trial when the patients were treated with the control treatment. One might wonder if there is a relation between treatment eect and baseline risk. Considering only the
300
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Figure 4. L’Abb e plot of observed log(odds) of the not-vaccinated trial arm versus the vaccinated trial arm. The size of the circle is an indication for the inverse of the variance of the log-odds ratio in that trial. Below the x = y line, the log-odds in the vaccinated are lower than the log-odds in the not-vaccinated arm, indicating that the vaccination works. On or above the x = y line, vaccination does not work benecially.
dierences between the study arms may hide a lot information. Therefore, we think it is wise to consider the pair of outcomes of the two treatments. This is nicely done in the l’Abb e plot [21], that gives a bivariate representation of the data by plotting the log-odds in arm A versus the log-odds in arm B. We show the plot in Figure 4 for the data of our example with A the vaccinated arm and B the not-vaccinated arm. The size of each circle represents the inverse of the variance of the log-odds ratio in that trial. Points below the line of identity correspond to trials with an observed positive eect of vaccination. The graph shows some eect of vaccination especially at the higher incidence rates. A simple (approximate) bivariate model for any observed pair of arm specic outcome measures !i = (!ˆ A; i ; !ˆ B; i ) with standard errors (sA; i ; sB; i ) in trial i is
2 s !ˆ A; i !A; i ; A; i ∼N !ˆ B; i ! B; i 0
0 2 sB; i
(i = 1; : : : ; n)
where !i = (!A; i ; ! B; i ) is the pair of true arm specic outcome measures for trial i. The conditional independence of !ˆ A and !ˆ B given the true !A and ! B is a consequence of the randomized parallel study design and the fact that !A and ! B are arm specic. In general, for instance in a cross-over study, or when !A and ! B are treatment eects on two dierent outcome variables, the estimates might be correlated.
ADVANCED METHODS IN META-ANALYSIS
301
The mixed model approach assumes the pair (!A; i ; ! B; i ) to follow a bivariate normal distribution, where, analogous to the univariate random eects model of Section 3, the true outcome measures for both arms in the trials are normally distributed around some common mean treatment-arm outcome measure with a between-trial covariance matrix :
!A; i !A
AA AB ; with = ∼N ! B; i !B
AB BB
AA and BB describe the variability among trials in true risk under the vaccination and control condition, respectively. AB is the covariance between the true risk in the vaccination and control group. The resulting marginal model is
!ˆ A; i !A ; + Ci ∼N !ˆ B; i !B
with Ci the diagonal matrix with the si2 ’s. Maximum likelihood estimation for this model can be quite easily carried out by a selfmade program based on the EM algorithm as described in reference [1], but more practically convenient is to use appropriate mixed model software from statistical packages, such as the SAS procedure Proc Mixed. Once the model is tted, the following derived quantities are of interest: (i) The mean dierence (!A − ! B ) and its standard error √
{(var(!A ) + var(! B ) − 2 cov(!A ; ! B ))}
(ii) The population variance of the dierence var(!A − ! B ) = AA + BB − 2 AB . (iii) The shape of the bivariate relation between the (true) !A and ! B . That can be described by ellipses of equal density or by the regression lines of !A on ! B and of the ! B on !A . These lines can be obtained from classical bivariate normal theory. For example, the regression line of !A on ! B has slope = AB = BB and residual variance AA − 2AB = BB . The regression of the dierence (!A − ! B ) on either !A or ! B can be derived similarly. At the end of this section we come back to the usefulness of these regression lines. The standard errors of the regression slopes can be calculated from the covariance matrix of the estimated covariance parameters by the delta method or by Fieller’s method [22]. 4.1. Example (continued): bivariate random eects model As an example we carry out a bivariate meta-analysis with !A and ! B the log-odds of tuberculosis in the vaccinated and the not-vaccinated control arm, respectively. To execute a bivariate analysis in the SAS procedure Proc Mixed, we have to change the structure of the data set. Each treatment arm of a trial becomes a row in the data set, resulting in twice as many rows as in the original data set. The dependent variable is now the estimated log-odds
302
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
in a treatment arm instead of the log-odds ratio. The new data set is called BCGdata2.sd2 and the observed log-odds is called lno. The standard error of the observed log-odds, estimated by taking the 'square root ( of minus the inverse of the second derivative of the log-likelihood, √ 1 1 is equal to x + n−x , where n is the sample size of a treatment arm and x is the number of tuberculosis cases in a treatment arm. These standard errors are also stored in the SAS data set covvars2.sd2. The bivariate random eects analysis can be carried out by running the SAS commands given below. In the data step, the new data set BCGdata2.sd2 is made out of the data set BCGdata2.sd2, and the covariates are dened on trial arm level. The variable exp is 1 for the vaccinated (experimental) arms and 0 for the not-vaccinated (control) arms. The variable con is dened analogously with experimental and control reversed. The variable arm identies the 26 unique treatment arms from the 13 studies (here from 1 to 26); latcon, latexp, yearcon and yearexp are covariates to be used later. For numerical reasons we centralized the four variables latcon, latexp, yearcon and yearexp by substracting the mean. # THE DATA STEP (BIVARIATE ANALYSIS)
data bcgdata2;set bcg_data; treat=1; lno=log(vd/vwd); var=1/vd+1/vwd; n=vd+vwd; output; treat=0; lno=log(nvd/nvwd); var=1/nvd+1/nvwd; n=nvd+nvwd; output; keep trial lno var n treat latitude--alloc; run; data bcgdata2;set bcgdata2; arm=_n_; exp=(treat=1); con=(treat=0); latcon=(treat=0)*(latitude-33); latexp=(treat=1)*(latitude-33); yearcon=(treat=0)*(year-66); yearexp=(treat=1)*(year-66); proc print noobs;run;
Running these SAS commands gives the following output:
T R I A L
L A T I T U D E
Y E A R
A L L O C
T R E A T
L N O
V A R
A R M
E X P
1 1 2 2 3 3 4 4
44 44 55 55 42 42 52 52
48 48 49 49 60 60 77 77
1 1 1 1 1 1 1 1
1 0 1 0 1 0 1 0
-3.39283 -2.45413 -3.91202 -2.24583 -4.33073 -2.94444 -5.38597 -3.92953
0.25840 0.09872 0.17000 0.03813 0.33772 0.09569 0.01620 0.00411
1 2 3 4 5 6 7 8
1 0 1 0 1 0 1 0
C O N
L A T C O N
L A T E X P
Y E A R C O N
Y E A R E X P
0 1 0 1 0 1 0 1
0 11 0 22 0 9 0 19
11 0 22 0 9 0 19 0
0 -18 0 -17 0 -6 0 11
-18 0 -17 0 -6 0 11 0
303
ADVANCED METHODS IN META-ANALYSIS
5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13
13 13 44 44 19 19 13 13 27 27 42 42 18 18 33 33 33 33
73 73 53 53 73 73 80 80 68 68 61 61 74 74 69 69 76 76
2 2 2 2 1 1 1 1 1 1 3 3 3 3 3 3 3 3
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
-5.02786 -4.80872 -2.02302 -1.06490 -5.75930 -4.12552 -5.15924 -5.17126 -5.55135 -5.07961 -4.60458 -3.20337 -5.60295 -5.26210 -6.21180 -6.65844 -6.43840 -6.42106
0.03050 0.02145 0.00629 0.00361 0.12539 0.10162 0.00199 0.00202 0.03462 0.02236 0.05941 0.01601 0.00540 0.00713 0.20040 0.33376 0.03710 0.03454
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 -20 0 11 0 -14 0 -20 0 -6 0 9 0 -15 0 0 0 0
-20 0 11 0 -14 0 -20 0 -6 0 9 0 -15 0 0 0 0 0
0 7 0 -13 0 7 0 14 0 2 0 -5 0 8 0 3 0 10
7 0 -13 0 7 0 14 0 2 0 -5 0 8 0 3 0 10 0
# THE PROCEDURE STEP (BIVARIATE RANDOM EFFECTS ANALYSIS) Proc mixed cl method=ml data=BCGdata2 # call procedure; ‘asycov’ asks for asycov; asymptotic covariance matrix of covariance class trial arm; model lno= exp con / noint s cl covb ddf=1000, 1000;
random exp con/ subject=trial type=un s;
repeated /group=arm;
estimate ‘difference’ exp 1 con -1/cl df=1000;
parms /parmsdata=covvars2 eqcons=4 to 29;
run;
parameters # trial and arm are classication variables; # model with indicator variables ‘exp’ and ‘con’ as explanatory variables for log-odds; condence intervals and p-values for coecients of ‘exp’ and ‘con’ should be based on standard normal distribution (i.e. t-distribution with df = ∞). ‘covb’ asks for covariance matrix of xed eects parameters. # experimental and control treatment are random eects, possibly correlated within a trial, and independent between trials; covariance matrix ( ) is unstructured; print empirical Bayes estimates ‘s’; # each study-arm in each trial has its own within study-arm variance (matrix Ci ); within study estimation errors are independent (default); # the ‘estimate’ command produces estimates of linear combinations of the xed parameters with standard error computed from the covariance matrix of the estimates. Here we ask for the estimate of mean log-odds ratio; # data le covvars2.sd2 contains the variable ‘est’ with starting values for the three covariance parameters of the random eects together with the 26 within study-arm variances. The latter are assumed to be known and should be kept xed;
304
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Running this program gives the following output: The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm UN(1,1) UN(2,1) UN(2,2)
Subject TRIAL TRIAL TRIAL
Group
Estimate 1.43137384 1.75732532 2.40732608
Alpha 0.05 0.05 0.05
Lower 0.7369 0.3378 1.2486
Upper 3.8894 3.1768 6.4330
(...) Solution for Fixed Effects Effect EXP CON
Estimate -4.83374538 -4.09597366
Std Error DF t 0.33961722 1000 -14.23 0.43469692 1000 -9.42
Pr > |t| 0.0001 0.0001
Alpha 0.05 0.05
Lower -5.5002 -4.9490
Upper -4.1673 -3.2430
Covariance Matrix for Fixed Effects Effect EXP CON
Row 1 2
COL1 0.11533985 0.13599767
COL2 0.13599767 0.18896142
(...) ESTIMATE Statement Results Parameter difference
Estimate -0.73777172
Std Error 0.17973848
DF 1000
t -4.10
Pr > |t| 0.0001
Alpha Lower 0.05 -1.0905
Upper -0.3851
The xed parameter estimates !ˆ = (!ˆ A ; !ˆ B ) = (−4:834; −4:096) represent the estimated mean log-odds in the vaccinated and non-vaccinated group, respectively. The between-trial estimated variance of the log-odds is ˆ AA = 1:431 in the vaccinated groups and ˆ BB = 2:407 in the notvaccinated groups. The between-trial covariance is estimated to be ˆ AB = 1:757. Thus, the esti√ √ mated correlation between the true vaccinated and true control log-odds is ˆ AB =( ˆ AA · ˆ BB ) = 0:947. The estimated covariance matrix for the ML-estimates !ˆ B and !ˆ A is
var(!ˆ A ) cov(!ˆ A ; !ˆ B ) 0:115 0:136 = cov(!ˆ B ; !ˆ A ) 0:136 0:189 var(!ˆ B ) The estimated mean vaccination eect, measured as the log-odds ratio, is equal to (!ˆ A − !ˆ B ) = √(−4:834−(−4:096)) = −0:738. The standard √ error of the mean vaccination eect is equal ˆ A ) + var(!ˆ B ) − 2 cov(!ˆ A ; !ˆ B )} = (0:115 + 0:189 − 2 · 0:136) = 0:180, almost to {var(! identical to the result of the univariate mixed model. This corresponds to an estimated odds ratio of exp(−0:738) = 0:478 with a 95 per cent condence interval equal to (0:336; 0:680), again strongly suggesting an average benecial vaccination eect. The slope of the regression line to predict the log-odds in the vaccinated group from the log-odds in the not-vaccinated
ADVANCED METHODS IN META-ANALYSIS
305
Figure 5. The 95 per cent coverage region for the pairs of true log-odds under vaccination and non-vaccination. The diagonal line is the line of equality between the two log-odds. Observed data from the trials are indicated with ◦, the empirical Bayes estimates are with . The common mean is indicated with the • central in the plot. The ellipse is obtained from a line plot based on the equation ˆ ˆ −1 (x − !) ˆ = 5:99. (x − !)
ˆ AB = ˆ BB = (1:757=2:407) = 0:730. The slope of the reverse relationgroup is equal to AB = ˆ AB = ˆ AA = (1:757=1:431) = 1:228. The variance of the treatment efship is equal to BA = ˆ is (1:431 + 2:407 − 2 · 1:757) = 0:324, fect, measured as the log-odds ratio, calculated from which is only slightly dierent from what we found earlier in the univariate random effects analysis. The conditional variance of the true log-odds, and therefore also of the logodds ratio, in the vaccinated group given the true log-odds in the not-vaccinated group is 2 ( AA − AB = BB ) = (1:431 − 1:757 2 =2:407) = 0:149, which is interpreted as the variance between treatment eects among trials with the same baseline risk. So baseline risk, measured as the true log-odds in the not-vaccinated group, explains (0:324 − 0:149)=0:324 = 54 per cent of the heterogeneity in vaccination eect between the trials. The 95 per cent coverage region of the estimated bivariate distribution can be plotted in the so-called l‘Abb e plot [21] in Figure 5. Figure 5 nicely shows that the vaccination eect depends on the baseline risk (log-odds in not-vaccinated group) and that the heterogeneity in the dierence between the log-odds in the vaccinated versus the not-vaccinated treatment arms is for a large part explained by the regression coecient being substantially smaller than 1. It also shows the shrinkage of the empirical Bayes estimates towards the main axis of the ellipse. In this example we specied the model in Proc Mixed as a model with two random intercepts in which the xed parameters
306
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
correspond to !A and ! B . An alternative would be to specify the model as a random-intercept random-slope model, in which the xed parameters correspond to ! B and the mean treatment eect !A − ! B . Then the SAS commands should be modied as follows: model lno=treat/s cl covb ddf=1000; random int treat/subject=trial type=un s;
Here int refers to a random trial specic intercept. 4.2. Relation between eect and baseline risk The relation between treatment eect and baseline risk has been very much discussed in the literature [6–9, 23–30]. There are two issues that complicate the matter: 1. The relation between ‘observed dierence A − B’ and ‘observed baseline risk B’ is prone to spurious correlation, since the measurement error in the latter is negatively correlated with measurement error in the rst. It would be better to study B versus A or B − A versus (A+B)=2. 2. Even in the regression of ‘observed risk in group A’ on ‘observed baseline risk in group B’, which is not hampered by correlated measurement errors, the estimated slope is attenuated due to measurement error in the observed baseline risk [31]. For an extensive discussion of these problems see the article of Sharp et al. [32]. In dealing with measurement error there are two approaches [31, 33]: (i) The functional equation approach: true regressors as nuisance parameters. (ii) The structural equation approach: true regressors as random quantities with an unknown distribution. The usual likelihood theory is not guaranteed to work for the functional equation approach because of the large number of nuisance parameters. The estimators may be inconsistent or have the wrong standard errors. The bivariate mixed model approach to meta-analysis used in this paper is in the spirit of the structural approach. The likelihood method does work for the structural equation approach, so in this respect our approach is safe. Of course, the question of robustness of the results against misspecication of the mixing distribution is raised. However, Verbeke and Lesare [34] have shown that, in the general linear mixed model, the xed eect parameters as well as the covariance parameters are still consistently estimated when the distribution of the random eects is misspecied, so long as the covariance structure is correct. Thus our approach yields (asymptotically) unbiased estimates of slope and intercept of the regression line even if the normal distribution assumption is not fullled, although the standard errors might be wrong. Verbeke and Lesare [34] give a general method for robust estimation of the standard errors. The mix of many xed and a few random eects as proposed by Thompson et al. [8] and the models of Walter [9] and Cook and Walter [29] are more in the spirit of the functional approach. These methods are meant to impose no conditions on the distribution of the true baseline risks. The method of Walter [9] was criticized by Bernsen et al. [35]. Sharp and Thompson [30] use other arguments to show that Walter’s method is seriously awed. In a letter to the editor by Van Houwelingen and Senn [36] following the article of Thompson
ADVANCED METHODS IN META-ANALYSIS
307
et al. [8], Van Houwelingen and Senn [36] argue that putting Bayesian priors on all nuisance parameters, as done by Thompson et al., does not help solving the inconsistency problem. This view is also supported in the chapter on Bayesian methods in the book of Carroll et al. [31]. It would be interesting to apply the ideas of Caroll et al. [31] in the setting of meta-analysis, but that is beyond the scope of this paper. Arends et al. [10] compare, in a number of examples, the approach of Thompson et al. [8] with the method presented here and the results were in line with the remarks of Van Houwelingen and Senn [36]. Sharp and Thompson [30], comparing the dierent approaches in a number of examples, remark that whether or not to assume a distribution for the true baseline risks remains a debatable issue. Arends et al. [10] also compared the approximate likelihood method as presented here with an exact likelihood approach where the parameters are estimated in a Bayesian manner with vague priors and found no relevant dierences.
5. META-REGRESSION In case of substantial heterogeneity between the studies, it is the statistician’s duty to explore possible causes of the heterogeneity [15, 37–39]. In the context of meta-analysis that can be done by covariates on the study level that could ‘explain’ the dierences between the studies. The term meta-regression to describe such analysis goes back to papers by Bashore et al. [40], Jones [41], Greenland [42] and Berlin and Antman [37]. We consider only analyses at the aggregated meta-analytic level. Aggregated information (mean age, percentage males) can describe the dierences between studies. We will not go into covariates on the individual level. If such information exists, the data should be analysed on the individual patient level by hierarchical models. That is possible and a sensible thing to do, but beyond the scope of this paper. We will also not consider covariates on the study arm level. That can be relevant in nonbalanced observational studies. Such covariates could both correct the treatment eect itself in case of confounding as well as explain existing heterogeneity between studies. Although the methods presented in this paper might be applied straightforwardly, we will restrict attention to balanced studies in which no systematic dierence between the study arms is expected. Since the number of studies in a meta-analysis is usually quite small, there is a great danger of overtting. The rule of thumb of one explanatory variable for each 5 (10) ‘cases’ leaves only room for a few explanatory variables in a meta-regression. In the example we have three covariates available: latitude; year of study, and method of treatment allocation. Details are given in Table I. In the previous section we have seen that heterogeneity between studies can be partly explained by dierences in baseline risk. Thus, it is also important to investigate whether covariates on the study level are associated with the baseline risk. That asks for a truly multivariate regression with a two-dimensional outcome, but we will start with the simpler regression for the one-dimensional treatment eect dierence measure. 5.1. Regression for dierence measure Let Xi stand for the (row)vector of covariates of study i including the constant term. Metaregression relates the true dierence #i to the ‘predictor’ Xi . This relation cannot be expected to be perfect; there might be some residual heterogeneity that could be modelled by a normal
308
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
distribution once again, that is #i ∼ N(Xi ; 2 ). Taking into account the imprecision of the observed dierence measure #ˆi we get the marginal approximate model #ˆi ∼ N(Xi ; 2 + si2 ) This model could be tted by iteratively reweighted least squares, where a new estimate of 2 is used in each iteration step or by full maximum likelihood with appropriate software. In the following we will describe how the model can be tted by SAS. 5.2. Example (continued) A graphical presentation of the data is given in Figure 6. Latitude and year of publication both seem to be associated with the log-odds ratio, while latitude and year are also correlated. Furthermore, at rst sight, the three forms of allocation seem to have little dierent average treatment eects. 5.2.1. Regression on latitude. The regression analysis for the log-odds ratio on latitude can be carried out by running the following mixed model in SAS: # call procedure;
Proc mixed cl method=ml data=BCG data; class trial; model ln or=latitude / s cl covb; random int/ subject=trial s; repeated /group=trial; parms /parmsdata=covvars3 eqcons=2 to 14;
# trial is classication variable; # latitude is only predictor variable; # random trial eect; # each trial has its own within study variances; # data set covvars3 contains a starting value for between study variance and 13 within study variances which should be kept xed;
run;
Running this program gives the following output: The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm INTERCEPT
Subject TRIAL
Group
Estimate 0.00399452
Alpha 0.05
Lower 0.0004
Upper 1.616E29
(...) Solution for Fixed Effects Effect INTERCEPT LATITUDE
Estimate 0.37108745 -0.03272329
Std Error 0.10596655 0.00337134
DF 11 0
t 3.50 -9.71
Pr > |t| 0.0050 .
Covariance Matrix for Fixed Effects Effect INTERCEPT LATITUDE
Row 1 2
COL1 0.01122891 -0.00031190
COL2 -0.00031190 0.00001137
Alpha 0.05 0.05
Lower Upper 0.1379 0.6043 . .
ADVANCED METHODS IN META-ANALYSIS
309
Figure 6. Graphical relationships between the variables with a weighted least squares regression line. The size of the circle corresponds to the inverse variance of the log-odds ratio in that trial.
The residual between-study variance in this analysis turns out to be 0.004, which is dramatically smaller than the between-study variance of 0.302 in the random eect model above without the covariate latitude in the model. Thus latitude explains 98.7 per cent of the between-trial variance in treatment eects dierences. The regression coecients for the intercept and for latitude are 0.371 (standard error = 0:106) and −0:033 (standard error = 0:003), respectively.
310
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
The estimated correlation between these estimated regression coecients is −0:873. Just for comparison we give the results of an ordinary weighted linear regression. The weights are equal to the inverse squared standard error of the log-odds ratio, instead of the correct weights equal to the inverse squared standard error of the log-odds ratio plus ˆ 2 . The intercept was 0.395 (SE = 0:124) and the slope −0:033 (SE = 0:004). The results are only slightly dierent, which is explained by the very small residual between-study variance. 5.2.2. Regression on year. Running the same model as above, only changing latitude into year, the residual between-study variance becomes 0.209. Thus year of publication explains 30.8 per cent of the between-trial variance in treatment eects dierences, much less than the variance explained by the covariate latitude. The regression coecients for the intercept and for year are −2:800 (standard error = 1:031) and 0.030 (standard error = 0:015), respectively. The estimated correlation between these estimated regression coecients is −0:989. Again, just for comparison, we also give the results of the ordinary weighted linear regression. The intercept was −2:842 (SE = 0:876) and the slope 0.033 (SE = 0:012). Like in the previous example, the dierences are relatively small. 5.2.3. Regression on allocation. Running the model with allocation as only (categorical) covariate (in the SAS commands, specify: class trial alloc;), gives a residual betweenstudy variance equal to 0.281. This means that only 7 per cent of the between-trial variance in the treatment eect dierences is explained by the dierent forms of allocation. The treatment eects (log-odds ratio) do not dier signicantly between the trials with random, alternate and systematic allocation (p = 0:396). 5.2.4. Regression on latitude and year. When both covariates latitude and year are put into the model, the residual between-study variance becomes only 0.002, corresponding with an explained variance of 99.3 per cent, only slightly more than by latitude alone. The regression coecients for the intercept, latitude and year are, respectively, 0.494 (standard error = 0:529), −0:034 (standard error = 0:004) and −0:001 (standard error = 0:006). We conclude that latitude gives the best explanation of the dierences in vaccination eect between the trials, since it already explains 98 per cent of the variation. Since the residual variance is so small, the regression equation in this example could have been obtained by ordinary weighted linear regression under the assumption of homogeneity. In the original medical report [4] on this meta-analysis the authors mentioned the strong relationship between treatment eect and latitude as well. They speculated that the biological explanation might be the presence of non-tuberculous myobacteria in the population, which is associated with geographical latitude. Goodness-of-t of the model obtained above can be checked as in the weighted least squares ˆ √( 2 +si2 ) and using standard approach by individual standardization of the residuals (#ˆi −Xi )= goodness-of-t checks. In interpreting the results of meta-regression analysis, it should be kept in mind that this is all completely observational. Clinical judgement is essential for correct understanding of what is going on. Baseline risk may be an important confounder and we will study its eect below.
ADVANCED METHODS IN META-ANALYSIS
311
5.3. Bivariate regression The basis of the model is the relation between the pair (!A; i ; ! B; i ), for example, (true log-odds in vaccinated group, true log-odds in control group) and the covariate vector Xi . Since the covariate has inuence on both components we have a truly multivariate regression problem in the classical sense that can be modelled as
!A; i ! B; i
∼ N(BXi ; )
Here, the matrix B is a matrix of regression coecients: the rst row for the A-component and the second row for the B-component. Taking into account the errors in the estimates we get the (approximate) model
!ˆ A; i !ˆ B; i
∼ N(BXi ; + Ci )
Fitting this model to the data can again be done by a self-made program using the EM algorithm or by programs such as SAS Proc Mixed. The hardest part is the interpretation of the model. We will discuss the interpretation for the example. So far we have shown for our leading example the univariate xed eects model, the univariate random eect without covariates, the bivariate random eects model without covariates and eventually the univariate random eects model with covariates. We end this section with a bivariate random eects model with covariates. 5.4. Example (continued): bivariate meta-analysis with covariates To carry out the bivariate regression analyses in SAS Proc Mixed we again need the data set BCGdata2.sd2 which was organized on treatment arm level. In this example we take latitude as the covariate. The model can be tted using the SAS code given below, where the variables exp, con and arm have the same meaning as in the bivariate analysis above without covariates. The variable latcon is for the not-vaccinated (control) groups equal to the latitude value of the trial and zero for the vaccinated (experimental) groups. The variable latexp, is dened analogously with vaccinated and non-vaccinated reversed. Proc mixed cl method=ml data=BCGdata2; class trial arm; model lno= con exp latcon latexp/noint s cl ddf=1000,1000,1000,1000;
random con exp / subject=trial type=fa0(2) s ;
repeated /group=arm;
# call procedure; # trial and treatment arm are dened as classication variables; # model with indicator variables ‘exp’ and ‘con’ together with latitude as explanatory variable for log-odds in both treatment groups; # control arm and experimental trial arm are specied as random eects; covariance matrix is unstructured, parameterized as factor analytic; # each study-arm in each trial has its own within study-arm error variance;
312
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
parms /parmsdata=covvars4 eqcons=4 to 29;
estimate ‘difference slopes’ latexp 1 latcon -1 /cl df = 1000;
# in the data le covvars4 three starting values are given for the between study covariance matrix, together with the 26 within study-arm variances. The latter are assumed to be known and kept xed; # estimate of the dierence in slope between the vaccinated and not-vaccinated groups;
run;
Remark In the program above we specied type=fa0(2) instead of type=un for . If one chooses the latter, the covariance matrix is parameterized as 1 2 2 3 and unfortunately the program does not converge if the estimated correlation is (very near to) 1, as is the case here. If one chooses the former, the covariance matrix is parameterized as 2 11 11 12 11 12
2 2 12 + 22
and the program converges even if the estimated correlation is 1, that is, if 22 = 0. Running the program gives the following output: The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm FA(1,1) FA(2,1) FA(2,2)
Subject TRIAL TRIAL TRIAL
Group
Estimate 1.08715174 1.10733154 -0.00000000
Alpha 0.05 0.05 .
Lower 0.7582 0.6681 .
Upper 1.6896 1.5466 .
(...) Solution for Fixed Effects Effect CON EXP LATCON LATEXP
Estimate -4.11736845 -4.82570990 0.07246261 0.03913388
Std Error 0.30605608 0.31287126 0.02192060 0.02239960
DF 1000 1000 1000 1000
t -13.45 -15.42 3.31 1.75
Pr > |t| 0.0001 0.0001 0.0010 0.0809
Alpha 0.05 0.05 0.05 0.05
Lower -4.7180 -5.4397 0.0294 -0.0048
Upper -3.5168 -4.2118 0.1155 0.0831
ESTIMATE Statement Results Parameter
Estimate
difference slopes -0.03332874
Std Error 0.00284902
DF
t
1000 -11.70
Pr > |t|
Alpha
0.0001
0.05
Lower
Upper
-0.0389 -0.0277
In Figure 7 the relationship between latitude and the log-odds of tuberculosis is presented for the vaccinated treatment arms A as well as for the non-vaccinated treatment arms B. For
ADVANCED METHODS IN META-ANALYSIS
313
-1
-2
ln(odds)
-3
-4
-5
-6 Control group A -7 10
Experimental group B 20
30
40
50
60
(absolute) latitude
Figure 7. Log-odds versus latitude for control group A and experimental group B.
the not-vaccinated trial arms the regression line is log(odds) = − 0:4117 + 0:072 (latitude − 33) =−6:509 + 0:072 latitude (standard errors of intercept and slope are 0.794 and 0.022, respectively). Notice that latitude was centralized at latitude = 33 (see Section 4.1). For the vaccinated trial arms the regression line is log(odds) = − 0:483 (latitude − 33) =−6:117 + 0:039 latitude (standard errors of intercept and slope are 0.809 and 0.022, respectively). We see that latitude has a strong eect, especially on the log-odds of the non-vaccinated study group. ˆ is equal to the nearly singular matrix The between-study covariance matrix
1:1819 1:2038
1:2038 1:2262
The estimated regression line of the treatment dierence measure on latitude is log-odds ratioA vs B = 0:392 − 0:033 latitude, with standard errors 0.093 and 0.003 for intercept and slope, respectively. This regression line is almost identical to the one resulting from the univariate analysis in the previous example. The estimated residual between-study variance is only 0.0003, meaning that latitude explains almost all heterogeneity in the treatment eects. The regression line of the dierence measure on both latitude and baseline risk is: log-odds ratioA vs B = 0:512 − 0:039 latitude + 0:019 log-oddsB . The standard errors can be calculated by the delta method. We see that the regression coecient of the baseline log-odds is quite small compared to the analysis without any covariates. The results of this bivariate regression and the results of the simple bivariate model without covariates of Section 4 are summarized in Table II. By explaining variation in treatment eects by latitude, hardly any residual variation is left. Although this is all observational, we
314
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Table II. Residual variance of treatment eect in dierent meta-regression models. Explanatory variables in the model
Residual variance of treatment eect
No covariates
0.324
Baseline Latitude Baseline + latitude
0.149 0.0003 0.0001
come to the tentative conclusion that the eect of vaccination depends on latitude rather than on baseline risk.
6. EXTENSIONS: EXACT LIKELIHOODS, NON-NORMAL MIXTURES, MULTIPLE ENDPOINTS The approximate likelihood solutions may be suspect if the sample sizes per study are relatively small. There are dierent approaches to repair this and to make the likelihoods less approximate. We will rst discuss the bivariate analysis where things are relatively easy and then the analysis of dierence measures. 6.1. More precise analysis of bivariate data Here, the outcome measures per study arm are direct maximum likelihood estimates of the relevant parameter. The estimated standard error is derived from the second derivative of the log-likelihood evaluated at the ML-estimate. Our approach is an approximation for tting a generalized linear mixed model (GLMM) by the maximum likelihood method. The latter is hard to carry out. A popular approximation is by means of the second-order Laplace approximation or the equivalent PQL method [43], that is, based on an iterative scheme where the second derivative is evaluated at the posterior mode. This can easily be mimicked in the SAS procedure Proc Mixed by iteratively replacing the estimated standard error computed from the empirical Bayes estimate as yielded by the software. For the analysis of log-odds as in the example, one should realize that the variance of log-odds is derived from the second derivative of the log-likelihood evaluated at the ML-estimate of p, and is given by 1=(np(1 − p)). In the rst iteration, p is estimated by the fraction of events in the study arm. In the next iteration p is replaced by the value derived from the empirical Bayes estimate for log-odds. This is not very hard to do and easy to implement in a SAS macro that iteratively uses Proc Mixed (see the example below; the macro version is available from the authors). This will help for intermediate sample sizes and moderate random eect variances. There are, however, possible situations (small samples, large random eect variances) in which the second-order approximations do not work [44] and one has to be very careful in computing and maximizing the likelihoods. Fortunately, that is much more of a problem for random eects at the individual level than at the aggregated level we have here.
315
ADVANCED METHODS IN META-ANALYSIS
6.2. Example (continued) After running the bivariate random eects model discussed in Section 4, the empirical Bayes estimates can be saved by adding the statement: make ‘Predicted’ out = Pred;
in the Proc Mixed command and adding a ‘p’ after the slash in the model statement. In this way the empirical Bayes estimates for log-odds are stored as variable PRED in the new SAS data le Pred.sd2. The within-trial variances in the next iteration of the SAS procedure Proc Mixed are derived from these empirical Bayes estimates in the way we described above. The three starting values needed for the between-trial variance matrix are stored as variable est in the SAS le covvars5.sd2. Thus, after running the bivariate random eects model once and saving the empirical Bayes estimates for log-odds, one can run the two data steps described below to compute the new estimates for the within-trial variances, use these within-trial variances in the next bivariate mixed model, save the new empirical Bayes estimates and repeat the whole loop. This iterative process should be continued until the parameter estimates converge. # DATA STEP TO COMBINE EMPIRICAL BAYES ESTIMATES AND ORIGINAL DATAFILE FROM SECTION 4 AND TO CALCULATE THE NEW WITHIN-TRIAL VARIANCES; data Pred1; merge BCGdata2 Pred; pi=exp(_PRED_)/(1+exp(_PRED_)); est=1/(n*pi*(1-pi)); run; # DATA STEP TO CREATE THE TOTAL DATAFILE THAT IS NEEDED IN THE PARMS-STATEMENT (BETWEENAND WITHIN-TRIAL VARIANCES); data Pred2; set covvars5 Pred1; run; # PROCEDURE STEP TO RUN THE BIVARIATE RANDOM EFFECTS MODEL WITH NEW WITHIN-TRIAL VARIANCES, BASED ON THE EMPIRICAL BAYES ESTIMATES. proc mixed cl method=ml data=BCGdata2 asycov; class trial arm; model lno= exp con / p noint s cl covb ddf=1000, 1000; random exp con/ subject=trial type=un s; repeated /group=arm subject=arm; estimate ‘difference’ exp 1 con -1 / cl df=1000; parms / parmsdata=Pred2 eqcons=4 to 29; run;
Running the data steps and the mixed model iteratively until convergence is reached gives the following output: The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm UN(1,1) UN(2,1) UN(2,2)
Subject TRIAL TRIAL TRIAL
Group
Estimate 1.43655989 1.76956270 2.43849037
Alpha 0.05 0.05 0.05
Lower 0.7392 0.3395 1.2663
Upper 3.9084 3.1996 6.4991
316
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Solution for Fixed Effects Effect EXP CON
Estimate -4.84981269 -4.10942999
Std Error
DF
t
Pr > |t|
Alpha
0.34001654 1000 -14.26 0.0001 0.05 0.43736103 1000 -9.40 0.0001 0.05 Covariance Matrix for Fixed Effects Effect EXP CON
Row 1 2
COL1 0.11561125 0.13690215
Lower -5.5170 -4.9677
Upper -4.1826 -3.2512
COL2 0.13690215 0.19128467
ESTIMATE Statement Results Parameter difference
Estimate -0.74038270
Std Error 0.18191102
DF 1000
t -4.07
Pr > |t| 0.0001
Alpha 0.05
Lower -1.0974
Upper -0.3834
The mean outcome measures (log-odds) for arms A and B are, respectively, −4:850 (standard error = 0:340) and 4.109 (standard error = 0:437). The between-trial variance of the ˆ BB = 2:438 in the notˆ AA = 1:437 and log-odds in the vaccinated treatment arm A is ˆ AB = 1:770. The vaccinated arm B. The estimate of the between-trial covariance is equal to estimated mean vaccination eect in terms of the log-odds ratio is −0:740 (standard error = 0:182). In this example, convergence was already reached after one or two iterations. The nal estimates are very similar to the original bivariate random eects analysis we have discussed in Section 4, where the mean outcome measures !ˆ A and !ˆ B were, respectively, −4:834 (SE = 0:340) and −4:096 (SE = 0:434). Of course, when the number of patients in the trials were smaller, the benet and necessity of this method would be more substantial. Another possibility if the approximate likelihood solutions are suspect is to use the exact likelihood, based on the binomial distribution of the number of events per treatment arm, and to estimate the parameters following a Bayesian approach with vague priors in combination with Markov chain Monte Carlo (MCMC) methods [45]. Arends et al. [10] give examples of this approach. In their examples the dierence with the approximate likelihood estimates turned out to be very small.
6.3. More precise analysis of dierence measures The analysis of dierence measures, that is, one summary measure per trial characterizing the dierence in ecacy between treatments, is a bit more complicated because the baseline value is considered to be a nuisance parameter. Having this nuisance parameter can be avoided and a lot of ‘exactness’ in the analysis can be gained by suitable conditioning on ancillary statistics. In the case of binary outcomes one can condition on the marginals of the 2 × 2 tables and end up with the non-central hypergeometric distribution that only depends on the log-odds ratio. Details are given in Van Houwelingen et al. [1]. However, the hypergeometric distribution is far from easy to handle and it does not seem very attractive to try to incorporate covariates in such an analysis as well. The bivariate analysis is much easier to carry out at the price of the assumption that the baseline parameter follows a normal distribution. However, that assumption can be relaxed as well and brings us to the next extension: the non-normal mixture.
ADVANCED METHODS IN META-ANALYSIS
317
6.4. Non-normal mixtures The assumption of a normal distribution for the random eects might not be realistic. Technically speaking it is not very hard to replace the normal mixture by a fully non-parametric mixture. As is shown by Laird [46], the non-parametric maximum likelihood estimator of the mixing distribution is always a discrete mixture and can easily be estimated by means of the EM algorithm [47]. An alternative is to use the software C.A.MAN of Bohning et al. [48]. However, just tting a completely non-parametric mixture is no good way of checking the plausibility of the normal mixture. The non-parametric estimates are always very discrete even if the true mixture is normal. A better way is to see whether a mixture of two normals (with the same variance) ts better than a single normal. This model can describe a very broad class of distributions: unimodal as well as bimodal, symmetric as well as very skewed [19]. Another way is to estimate the skewness of the mixture somehow and mistrust the normality if the skewness is too big. It should be realized, however, that estimating mixtures is a kind of ill-posed problem and reliable estimates are hard to obtain [49]. To give an impression we tted a non-parametric mixture with the homemade program based on the EM algorithm described in Van Houwelingen et al. [1] to the log-odds ratio of our example using approximate likelihoods. The results were as follows: atom −1:4577 −0:9678 −0:3296 0:0023
probability 0:3552 0:1505 0:2980 0:1963
corresponding mean: −0:761 corresponding variance: 0:349 The rst two moments agree quite well with the normal mixture. It is very hard to tell whether this four-point mixture gives any evidence against normality of the mixture. The bivariate normal mixture of Section 4 is even harder to check. Non-parametric mixtures are hard to t in two dimensions. An interesting question is whether the estimated regression slopes are robust against non-normality. Arends et al. [10] modelled the baseline distribution with a mixture of two normal distributions and found in all their examples a negligible dierence with modelling the baseline parameter with one normal distribution, indicating that the method is robust indeed [10]. However, this was only based on three examples and we do not exclude the possibility that in some other data examples the regression slopes might be more dierent. 6.5. Multiple outcomes In a recent paper Berkey et al. [50] discussed a meta-analysis with multiple outcomes. A similar model was used in the context of meta-analysis of surrogate markers by Daniels and Hughes [51] and discussed by Gail et al. [52]. In the simplest case of treatment dierence measures for several outcomes, the situation is very similar to the bivariate analysis of Sections 4 and 5. The model
!A; i ∼ N(BXi ; ) ! B; i
318
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
Table III. Trial 1 2 3 4 5
Publication year
!ˆ PD; i
!ˆ AL; i
var(!ˆ PD; i )
var(!ˆ AL; i )
cov ar(!ˆ PD; i ; !ˆ AL; i )
1983 1982 1979 1987 1988
0.47 0.20 0.40 0.26 0.56
−0:32 −0:60 −0:12 −0:31 −0:39
0.0075 0.0057 0.0021 0.0029 0.0148
0.0077 0.0008 0.0014 0.0015 0.0304
0.0030 0.0009 0.0007 0.0009 0.0072
could be used, where !A stands for the (dierence) measure on outcome A and ! B for the measure on outcome B. It could easily be generalized to more measures C, D, etc. The main dierence is that the estimated eects are now obtained in the same sample and, therefore, will be correlated. An estimate of this correlation is needed to perform the analysis. The only thing that changes in comparison with Section 5 is that the matrix Ci in
!ˆ A; i ∼ N(BXi ; + Ci ) !ˆ B; i is not diagonal anymore but allows within-trial covariation. This approach can easily be adapted to the situation where there are more than two outcome variables or more treatment groups. 6.6. Example Berkey et al. [50] Berkey et al. [50] illustrate several xed and random (multivariate) meta-regression models using a meta-analysis from Antczak-Bouckoms et al. [53]. This meta-analysis concerns ve randomized controlled trials, where a surgical procedure is compared with a non-surgical procedure. Per patient two outcomes are assessed: (pre- and post-treatment change in) probing depth (PD) and (pre- and post-treatment change in) attachment level (AL). Since the ecacy of the surgical procedure may improve over time, a potential factor that may inuence the trial results is the year of publication [50]. The two treatment eect measures are dened as: ! PD = mean PD under surgical treatment − mean PD under non-surgical treatment !AL = mean AL under surgical treatment − mean AL under non-surgical treatment The data are given in Table III. As an example we t the model with year of publication as explanatory variable. Berkey et al. [50] tted this model using a self-written program in SAS Proc IML. We show how it can be done with SAS Proc Mixed. The data set-up is the same as in the earlier discussed bivariate models with two data rows per trial, one for each outcome measure. Also the Proc Mixed program is completely analogous. The only dierence is that in the data set containing the elements of the Ci ’s now the covariance between the two outcomes per trial must be
319
ADVANCED METHODS IN META-ANALYSIS
specied as well. The SAS code is: proc mixed cl method=ml data=berkey; class trial type; model outcome=pd al pdyear alyear/noint s cl;
random pd al / subject=trial type=un s;
repeated type /subject=trial group=trial type=un; parms /parmsdata=covvars6 eqcons=4 to 18;
# call procedure; # trial and outcome type (PD or AL) are classication variables; # model with indicator variables ‘pd’ and ‘al’ together with publication year as explanatory variable; # specication of among-trial covariance matrix for both outcomes; # specication of (non-diagonal) within-trial covariance matrix; # covvars6 contains: 3 starting values for the two between-trial variances and covariance, 10 within-trial variances (5 per outcome measure) and 5 covariances. The last 15 parameters are assumed to be known and must be kept xed.
run;
Part of the SAS Proc Mixed output is given below. The MIXED Procedure (...) Covariance Parameter Estimates (MLE) Cov Parm
Subject
UN(1,1) UN(2,1) UN(2,2)
TRIAL TRIAL TRIAL
Group
Estimate
Alpha
Lower
Upper
0.00804054 0.00934132 0.02501344
0.05 0.05 0.05
0.0018 -0.0113 0.0092
2.0771 0.0300 0.1857
(...) Solution for Fixed Effects Effect PD AL PDYEAR ALYEAR
Estimate 0.34867848 -0.34379097 0.00097466 -0.01082781
Std Error 0.05229098 0.07912671 0.01543690 0.02432860
DF 3 3 0 0
t 6.67 -4.34 0.06 -0.45
Pr > |t| 0.0069 0.0225 . .
Alpha 0.05 0.05 0.05 0.05
The estimated model is
! PD = 0:34887 + 0:00097∗ (year-1984) !AL = −0:34595 − 0:01082∗ (year-1984)
Lower 0.1823 -0.5956 . .
Upper 0.5151 -0.0920 . .
320
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
The standard errors of the slopes are 0.0154 and 0.0243 for PD and AL, respectively. The estimated among-trial covariance matrix is
ˆ = 0:008 0:009
0:009 0:025 The results are identical to those of Berkey et al. [50] with the random-eects multiple outcomes that were estimated with the method named by Berkey as the multivariate maximum likelihood (MML) method. 6.7. Other outcome measures Our presentation concentrates on dichotomous outcomes. Much of it carries over to other eect measures that are measured on a dierent scale. For instance, our methods apply if the outcome variable is continuous and an estimate of the average outcome and its standard error is available in both treatment arms. However, in some cases only a relative eect is available, such as the standardized eect measure (dierence in outcome=standard deviation of the measurements in the control group) which is popular in psychological studies. In that case only the one-dimensional analysis applies. A special case is survival analysis. The log hazard ratio in the Cox model cannot be written as the dierence of two eect measures. However, some measure of baseline risk, for example, one-year survival rate in the control arm, might be dened and the bivariate outcome analysis described above can be used to explore the relation between treatment eect and baseline risk. A complicating factor is that the two measures are not independent any more. However, if an estimate of the correlation between the two measures is available, the method can be applied. 6.8. Other software Although we illustrated all our examples with the SAS procedure Proc Mixed, most if not all analyses could be carried out by other (general) statistical packages as well. A nice review of available software for meta-analysis has recently been written by Sutton et al. [54]. Any package like SPSS, SAS, S-plus and Stata that can perform a weighted linear regression suces to perform a standard xed eect meta-analysis or a xed eects meta-regression. For tting random eects models with approximate likelihood, a program for the general linear mixed model (GLMM) is needed, which is available in many statistical packages. However, not all GLMM programs are appropriate. One essential requirement of the program is that one can x the within-trial variance in the model at arbitrary values per trial. In S-plus the function lme is used to t linear mixed eects models and all the analyses carried out with Proc Mixed of SAS in our examples can also be carried out with lme from S-plus. The ‘parms’ statement used by SAS to x the within-trial variances corresponds with ‘varFixed’ in S-plus [55]. Several Stata macros have been written which implement some of the discussed methods [56, 57]. The Stata program meta of Sharp and Sterne [56] performs a standard xed and random eects meta-analysis without covariates. The Stata command metareg of Sharp [57] extends this to univariate meta-regression. We are not aware of Stata programs that are capable of tting bivariate meta-regression models, but of course one can do an univariate metaregression on the log-odds ratios instead of a bivariate meta-regression on the log-odds of
ADVANCED METHODS IN META-ANALYSIS
321
the two treatment arms. However, such an analysis does not give any information about the relationship between the (true) log-odds of the two arms. Mlwin or MLn appears to be one of the most exible methods to t mixed-eect regression models [54]. Although we do not have experience with this package, we assume that most if not all of the discussed models can be tted in it. Finally, in the free available Bayesian analysis software package BUGS, one can also execute all approximate likelihood analyses that have been presented in this article. If vague prior distributions are used, the results are very similar. With BUGS it is also possible to t the models using the exact likelihood, based on the binomial distribution of the number of events in a treatment arm. The reader is referred to Arends et al. [10] for examples and the required BUGS syntax.
7. BAYESIAN STATISTICS IN META-ANALYSIS As we mentioned in Section 4, putting uninformative Bayesian priors on all individual nuisance parameters as done in Thompson et al. [8], Daniels and Hughes [51], Smith et al. [58] and Sharp and Thompson [30] can lead to inconsistent results as the number of nuisance parameters grows with the number of studies [36]. This observation does not imply that we oppose Bayesian methods. First of all, there is a lot of Bayesian avour to random eects meta-analysis. The mixing distribution can serve as a prior distribution in the analysis of the results of a new trial. However, the prior is estimated from the data and not obtained by educated subjective guesses, that is why random eects meta-analysis can be seen as an example of the empirical Bayes approach. For each study, the posterior distribution given the observed value can be used to obtain empirical Bayes corrections. In this paper we describe estimating the mixing distribution by maximum likelihood. The maximum likelihood method has two drawbacks. First, in complex problems, maximizing the likelihood might become far from easy and quite time-consuming. Second, the construction of condence intervals with the correct coverage probabilities can become problematic. We proposed the prole likelihood approach in the simple setting of Section 3. For more complex problems, the prole likelihood gets very hard to implement. When the maximum likelihood approach gets out of control (very long computing times, non-convergence of the maximization procedure), it can be very protable to switch to a Bayesian approach with vague priors on the parameters of the model in combination with Markov chain Monte Carlo (MCMC) methods [45] that circumvent integration by replacing it by simulation. If one wants to use the MCMC technique in this context, the prior should be set on all parameters of the hierarchical model. Such a model could be described as a Bayesian hierarchical or Bayesian empirical Bayes model. For examples of this approach, see Arends et al. [10]. The dierence with the approach of Thompson et al. [8, 30] is then that they assume that the true baseline log-odds are a random sample of a fully specied at normal distribution (for example, N(0; 10)), while we assume that the true log-odds are sampled from a N(; ) distribution with and parameters to be estimated, putting vague priors on them. So Thompson et al.’s model is a special case of our model. We prefer the parameters of the baseline risks distribution to be determined by the data. For the examples discussed in this paper, maximum likelihood was quite convenient in estimating the parameters of the model and getting a rough impression of their precision. It suced for the global analysis described
322
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
here. If the model is used to predict outcomes of new studies, as in the surrogate marker setting of Daniels and Hughes [51], nominal coverage of the prediction intervals becomes important and approximate methods can be misleading. MCMC can be very convenient, because the prediction problem can easily be embedded in the MCMC computations. An alternative is bootstrapping as described in Gail et al. [52]. 8. CONCLUSIONS We have shown that the general linear mixed model using an approximate likelihood approach is a very useful and convenient framework to model meta-analysis data. It can be used for the simple meta-analysis up to complicated meta-analyses involving multivariate treatment eect measures and explanatory variables. Extension to multiple outcome variables and multiple treatment arms is very straightforward. Suitable software is widely available in statistical packages. ACKNOWLEDGEMENTS
This paper is based on a course on meta-analysis given by Van Houwelingen and Stijnen at the ISCB in Dundee in 1998. We thank the organizers for their stimulating invitation. We thankfully acknowledge the help of Russ Wolnger of SAS in nding the best Proc Mixed specications for the bivariate models. We are very grateful to three reviewers for their detailed comments and valuable suggestions, which have led to considerable improvements of this paper. REFERENCES 1. Van Houwelingen H, Zwinderman K, Stijnen T. A bivariate approach to meta-analysis. Statistics in Medicine 1993; 12:2272 – 2284. 2. Berkey CS, Hoaglin DC, Mosteller F, Colditz GA. Random eects regression model for meta-analysis. Statistics in Medicine 1995; 14:396 – 411. 3. Normand S-L. Meta-analysis: formulating, evaluating, combining and reporting. Statistics in Medicine 1999; 18:321– 359. 4. Colditz GA, Brewer FB, Berkey CS, Wilson EM, Burdick E, Fineberg HV, Mosteller F. Ecacy of BCG vaccine in the prevention of tuberculosis. Journal of the American Medical Association 1994; 271:698 –702. 5. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials 1986; 7(3):177–188. 6. Brand R, Kragt H. Importance of trends in the interpretation of an overall odds ratio in the meta-analysis of clinical trials. Statistics in Medicine 1992; 11(16):2077– 2082. 7. McIntosh MW. The population risk as an explanatory variable in research syntheses of clinical trials. Statistics in Medicine 1996; 15:1713 –1728. 8. Thompson SG, Smith TC, Sharp SJ. Investigating underlying risk as a source of heterogeneity in meta-analysis. Statistics in Medicine 1997; 16(23):2741– 2758. 9. Walter SD. Variation in baseline risk as an explanation of heterogeneity in meta-analysis. Statistics in Medicine 1997; 16:2883 – 2900. 10. Arends LR, Hoes AW, Lubsen J, Grobbee DE, Stijnen T. Baseline risk as predictor of treatment benet: three clinical meta-re-analyses. Statistics in Medicine 2000; 19(24):3497– 3518. 11. Fleiss JL. Analysis of data from multiclinic trials. Controlled Clinical Trials 1986; 7:267– 275. 12. Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta blockade during and after myocardial infarction: an overview of the randomised trials. Progress in Cardiovascular Diseases 1985; 27:335 – 371. 13. Whitehead A, Whitehead J. A general parametric approach to the meta-analysis of clinical trials. Statistics in Medicine 1991; 10:1665 –1677. 14. Hardy RJ, Thompson SG. A likelihood approach to meta-analysis with random eects. Statistics in Medicne 1996; 15:619 – 629. 15. Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiologic Reviews 1987; 9: 1– 30.
ADVANCED METHODS IN META-ANALYSIS
323
16. Newcombe RG. Improved condence intervals for the dierence between binomial proportions based on paired data. Statistics in Medicine 1998; 17:2635 – 2650. 17. Newcombe RG. Interval estimation for the dierence between independent proportions: comparison of eleven methods. Statistics in Medicine 1998; 17:873 – 890. 18. Brockwell SE, Gordon IR. A comparison of statistical methods for meta-analysis. Statistics in Medicine 2001; 20:825 – 840. 19. Verbeke G. Linear mixed models for longitudinal data. In Linear Mixed Models in Practice, Verbeke G, Molenberghs G (eds). Springer-Verlag: New York, 1997; 63 –153. 20. Roger JH, Kenward MG. Repeated measures using proc mixed instead of proc glm. In Proceedings of the First Annual South-East SAS Users Group Conference. SAS Institute: Cary NC, 1993; 199 – 208. 21. L’Abb e KA, Detsky AS. Meta-analysis in clinical research. Annals of Internal Medicine 1987; 107(2): 224 – 233. 22. Armitage P, Berry G. Statistical Methods in Medical Research. Blackwell Scientic Publications: Oxford, 1978. 23. Davey Smith G, Song F, Sheldon TA. Cholesterol lowering and mortality: the importance of considering initial level of risk. British Medical Journal 1993; 306(6889):1367–1373. 24. Senn S. Importance of trends in the interpretation of an overall odds ratio in the meta-analysis of clinical trials (letter). Statistics in Medicine 1994; 13(3):293 – 296. 25. Brand R. Importance of trends in the interpretation of an overall odds ratio in the meta-analysis of clinical trials (letter). Statistics in Medicine 1994; 13(3):293 – 296. 26. Hoes AW, Grobbee DE, Lubsen J. Does drug treatment improve survival? Reconciling the trials in mild-tomoderate hypertension. Journal of Hypertension 1995; 13(7):805 – 811. 27. Egger M, Smith GD. Risks and benets of treating mild hypertension: a misleading meta-analysis? [comment]. Journal of Hypertension 1995; 13(7):813 – 815. 28. Senn SJ. Relation between treatment benet and underlying risk in meta-analysis. British Medical Journal 1996; 313:1550. 29. Cook RJ, Walter SD. A logistic model for trend in 2 × 2 × K tables with applications to meta-analyses. Biometrics 1997; 53(1):352– 357. 30. Sharp SJ, Thompson SG. Analysing the relationship between treatment eect and underlying risk in metaanalysis: comparison and development of approaches. Statistics in Medicine 2000; 19:3251– 3274. 31. Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Chapman & Hall: London, 1995. 32. Sharp SJ, Thompson SG, Altman DG. The relation between treatment benet and underlying risk in metaanalysis. British Medical Journal 1996; 313(7059):735 –738. 33. Kendall MG, Stuart A. The Advanced Theory of Statistics. Volume II: Inference and Relationship. Grin: London, 1973. 34. Verbeke G, Lesare E. The eect of misspecifying the random eects distribution in linear models for longitudinal data. Computational Statistics and Data Analysis 1997; 23:541– 556. 35. Bernsen RMD, Tasche MJA, Nagelkerke NJD. Some notes on baseline risk and heterogeneity in meta-analysis. Statistics in Medicine 1999; 18(2):233 – 238. 36. Van Houwelingen HC, Senn S. Investigating underlying risk as a source of heterogeneity in meta-analysis (letter). Statistics in Medicine 1999; 18:107–113. 37. Berlin JA, Antman EM. Advantages and limitations of meta-analytic regressions of clinical trials data. Online Journal of Current Clinical Trials 1994; 134. 38. Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated. British Medical Journal 1994; 309(6965):1351–1355. 39. Thompson SG, Sharp SJ. Explaining hetergoneity in meta-anaysis: a comparison of methods. Statistics in Medicine 1999; 18:2693 – 2708. 40. Bashore TR, Osman A, Heey EF. Mental slowing in elderly persons: a cognitive psychophysiological analysis. Psychology & Aging 1989; 4(2):235 – 244. 41. Jones DR. Meta-analysis of observational epidemiological studies: a review. Journal of the Royal Society of Medicine 1992; 85(3):165 –168. 42. Greenland S. A critical look at some popular meta-analytic methods. American Journal of Epidemiology 1994; 140(3):290 – 296. 43. Platt RW, Leroux BG, Breslow N. Generalized linear mixed models for meta-analysis. Statistics in Medicine 1999; 18:643 – 654. 44. Engel B. A simple illustration of the failure of PQL, IRREML and APHL as approximate ML methods for mixed models for binary data. Biometrical Journal 1998; 40:141–154. 45. Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. Chapman & Hall: London, 1996. 46. Laird NM. Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association 1978; 73:805 – 811.
324
H. C. VAN HOUWELINGEN, L. R. ARENDS AND T. STIJNEN
47. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B 1977; 39(1):1– 38. 48. Bohning D, Schlattman P, Linsay B. Computer-assisted analysis of mixtures. Biometrics 1992; 48:283 – 304. 49. Eilers PHC, Marx BD. Flexible smooting using B-splines and penalized likelihood. Statistical Science 1996; 11:89 –121. 50. Berkey CS, Hoaglin DC, Antczak-Bouckoms A, Mosteller F, Colditz GA. Meta-analysis of multiple outcomes by regression with random eects. Statistics in Medicine 1998; 17:2537– 2550. 51. Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Statistics in Medicine 1997; 16:1965 –1982. 52. Gail MH, Pfeier R, van Houwelingen HC, Carroll RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics 2000; 1(3):231– 246. 53. Antczak-Bouckoms A, Joshipura K, Burdick E, Tulloch JFC. Meta-analysis of surgical versus non-surgical method of treatment for periodontal disease. Journal of Clinical Periodontology 1993; 20:259 – 268. 54. Sutton AJ, Lambert PC, Hellmich M, Abrams KR, Jones DR. Meta-analysis in practice: a critical review of available software. In Meta-Analysis in Medicine and Health Policy, Berry DA, Stangl DK (eds). Marcel Dekker: New York, 2000. 55. Pinheiro JC, Bates DM. Mixed-eects Models in S and S-Plus. Springer-Verlag: Berlin, 2000. 56. Sharp S, Sterne J. Meta-analysis. Stata Technical Bulletin 1997; 38(sbe 16):9 –14. 57. Sharp S. Meta-analysis regression. Stata Technical Bulletin 1998; 42(23):16 – 22. 58. Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to random-eects meta-analysis: a comparative study. Statistics in Medicine 1995; 14:2685 – 2699.
Part III MODELLING GENETIC DATA: STATISTICAL GENETICS
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 327
328
E. A. THOMPSON
GENETIC EPIDEMIOLOGY: A REVIEW OF THE STATISTICAL BASIS
329
330
E. A. THOMPSON
GENETIC EPIDEMIOLOGY: A REVIEW OF THE STATISTICAL BASIS
331
332
E. A. THOMPSON
GENETIC EPIDEMIOLOGY: A REVIEW OF THE STATISTICAL BASIS
333
334
E. A. THOMPSON
GENETIC EPIDEMIOLOGY: A REVIEW OF THE STATISTICAL BASIS
335
336
E. A. THOMPSON
GENETIC EPIDEMIOLOGY: A REVIEW OF THE STATISTICAL BASIS
337
338
E. A. THOMPSON
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 339
340
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
341
342
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
343
344
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
345
346
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
347
348
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
349
350
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
351
352
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
353
354
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
355
356
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
357
358
J. M. OLSON, J. S. WITTE AND R. C. ELSTON
GENETIC MAPPING OF COMPLEX TRAITS
359
TUTORIAL IN BIOSTATISTICS A statistical perspective on gene expression data analysis Jaya M. Satagopan∗; † and Katherine S. Panageas Department of Epidemiology and Biostatistics; Memorial Sloan-Kettering Cancer Center; 1275 York Avenue; New York; NY 10021; U.S.A.
SUMMARY Rapid advances in biotechnology have resulted in an increasing interest in the use of oligonucleotide and spotted cDNA gene expression microarrays for medical research. These arrays are being widely used to understand the underlying genetic structure of various diseases, with the ultimate goal to provide better diagnosis, prevention and cure. This technology allows for measurement of expression levels from several thousands of genes simultaneously, thus resulting in an enormous amount of data. The role of the statistician is critical to the successful design of gene expression studies, and the analysis and interpretation of the resulting voluminous data. This paper discusses hypotheses common to gene expression studies, and describes some of the statistical methods suitable for addressing these hypotheses. S-plus and SAS codes to perform the statistical methods are provided. Gene expression data from an unpublished oncologic study is used to illustrate these methods. Copyright ? 2003 John Wiley & Sons, Ltd. KEY WORDS:
class discovery; class prediction; hierarchical clustering; multiple correction; classication; Fisher’s linear discriminant function; compound covariate predictor
1. INTRODUCTION A major milestone in medical research has been the sequencing of the human genome, which has identied more than 30 000 gene sequences. DNA (deoxyribonucleic acid) microarrays, arrays of a large set of DNA sequences on a glass or nylon substrate, oer the potential to investigate the expression of several thousand genes from blood or tissue samples. This technology is critical for identifying gene subsets associated with various diseases, which is the major goal of clinical genomic studies. Targets thus identied will be pathologically validated to determine specic candidates. These candidates will be further studied for improved diagnosis and drug development. Since information is obtained from thousands of genomic
∗
Correspondence to: Jaya M. Satagopan, Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, U.S.A. † E-mail:
[email protected] Contract=grant sponsor: National Institutes of Health; contract=grant numbers: GM60457, CA84499.
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 361
362
J. M. SATAGOPAN AND K. S. PANAGEAS
regions, this technology results in an enormous amount of data and statistical assumptions may be violated. The purpose of this paper is to provide statisticians with a basic understanding of gene expression studies, and to provide S-plus and SAS computer codes to perform these analyses. Data from an unpublished oncologic study is used to illustrate these methods. 1.1. Genetic overview Every cell nucleus contains the nucleic acids DNA and RNA (ribonucleic acid). DNA, the inheritable genetic material, forms the underlying structure of all organisms, and genes are sections of DNA. DNA consists of two strands, each of which is the complement of the other. The strands are made of a combination of four basic nucleotides (or bases): adenine (A); thymidine (T); cytosine (C); and guanine (G). A and T are complements of each other. In other words, the A (T) base of one DNA strand only binds with the T (A) base from the complementary strand (cDNA), and likewise for C and G. It is this complementarity principle that forms the foundation of DNA microarrays. Genes are the units of heredity that are transmitted from parents to ospring. The instructions to make proteins are carried by genes in DNA. Proteins are the molecules that control structures and functions within cells. Each gene contains information to make one or more proteins, and dierent genes encode distinct proteins. To make proteins, the genetic information in DNA is transcribed to an intermediate product, mRNA (messenger RNA), in the nucleus of cells. Genetic instructions to manufacture proteins are then carried by mRNA from the nucleus for translation into proteins. Gene expression is measured as the amount of mRNA for a particular gene in cells. For instance, human DNA contains more than 30 000 dierent genes. All these genes are present in DNA in every cell in the body. However, genes are expressed (that is, make mRNA) in dierent ways in dierent tissues in the body. As an example, consider genes expressed in the prostate gland. Some genes are expressed in the prostate and in many or all other tissues in the body (these are called housekeeping genes). Many genes are not expressed in the prostate, but are expressed in other tissues (for example, lung or skin). A few genes are expressed in the prostate but not other tissues, and in this case mRNA (and protein) encoded by these genes will only be present in the prostate cells (that is, high expression) but not in skin or lung cells (that is, no or low expression). In the case of cancer of the prostate, most of the genes that are present in normal prostate will also be expressed in the prostate cancer cells. However certain genes will be overexpressed or underexpressed in the cancer cells compared to normal prostate cells. This altered gene expression contributes to the uncontrolled growth and spread of the tumour. 1.2. Microarrays Microarrays are a tool to measure and investigate gene expressions. Every known gene sequence of interest is printed as a probe on a glass (or nylon) array. mRNA from tissue or blood sample is tagged with uorescent dyes and hybridized to the array probes. The complementarity principle, described above, allows the mRNA to bind to the appropriate probe sequences on the array. The uorescence intensities of these probes correspond to the amount of mRNA per gene, and thus are measures of gene expression. Two types of arrays are most commonly utilized – arrays based on oligonucleotides, and those based on complementary DNA (spotted cDNA). An oligonucleotide (or, briey, oligo) is a short sequence of DNA, usually consisting of 25 base pairs. There are several underlying
GENE EXPRESSION DATA ANALYSIS
363
dierences in the gene expressions derived from these two types of arrays. Given below is a brief description of the two array types. Although these two types of arrays provide gene expressions using dierent methods, both can be used to address the same research questions and the same data analytic techniques can be applied. 1.2.1. Oligonucleotide array. Every gene is represented using approximately 16 to 20 subsections (oligo), each consisting of up to 25 base pairs. In order to measure gene expression, every oligo is represented using a perfect match (PM) and a mismatch (MM) probe. A PM is the exact complement of the subsection of the gene of interest, whereas the MM is the same complement with a base change in the central position (for example, the 13th position of a 25 base length probe array). cDNA from a single tissue or blood sample is hybridized to such an array. The average over the 16 to 20 subsections of the dierence in intensities between the PM and MM pairs is a measure of the level of expression for that gene. A more technical description of oligonucleotide arrays can be found in reference [1]. 1.2.2. Spotted cDNA array. In a spotted cDNA array, every gene is represented using a longer sequence (for example, between 200 and 500 bases). cDNA from two dierent samples, one being the test sample of interest and the other being a reference sample, are hybridized to a single array. The test sample is labelled with a red uorescence dye, and the reference sample is labelled with a green dye. A measure of gene expression from a cDNA array is the logarithm of the ratio of the red and green intensities. Eisen and Brown [2] give a detailed description on the construction and uses of spotted cDNA arrays. A notable dierence between the two array types is that a single sample is hybridized on each oligo array, while two samples are hybridized on each cDNA array. A test sample of interest and a reference sample are hybridized together on each cDNA array. The reference sample can be chosen in several dierent manners. For example, the reference sample may correspond to a specic normal tissue, or a pool of several normal or test sample tissues. Thus, the gene expression obtained from a cDNA array is a relative measure. Alternatively, the gene expression obtained from an oligo array corresponds to one particular test sample and is not measured relative to any reference sample. We address measurement of gene expressions further in the Discussion section. Biologically relevant dierences, clinical applications and future directions are discussed in reference [3].
2. STATISTICAL QUESTIONS Both oligo and cDNA arrays have been used in several gene expression studies. Golub et al. [4] used oligo arrays to identify genetic markers for classication between two leukaemia subtypes – AML and ALL. In a breast cancer study using cDNA arrays, Hedenfalk et al. [5] identied dierent groups of genes expressed by breast cancers with BRCA1 mutations and those with BRCA2 mutations. Recently a greater emphasis on the genetic aetiology of diseases has resulted in an overwhelming interest in the utilization of such arrays. Numerous published studies have used oligo or cDNA arrays to study gene expressions in various biological scenarios (for example, references [4–17]). Common research questions in such studies fall under the topics of class discovery, gene identication and class prediction. Class discovery refers to identifying previously unknown
364
J. M. SATAGOPAN AND K. S. PANAGEAS
sample subtypes, whereas class prediction methods are used to classify a new independent sample based on a prediction scheme. Gene identication corresponds to selecting overexpressed and underexpressed gene subsets from a given sample. We illustrate commonly applied statistical methods to address these questions using gene expression data from an oncologic study.
3. ILLUSTRATIVE EXAMPLE Data from a cancer gene expression study conducted at Memorial Sloan-Kettering Cancer Center are described below. This study consists of 32 cancer patients: tissues from the primary disease site of 23 patients with no known metastasis at the time of tissue extraction, and tissues from the metastatic disease site of 9 patients. We will address the following specic aims: 1. Identify disease clusters based on the patients’ gene expression, without utilizing any prior knowledge about the groups. 2. Identify dierentially expressed genes by comparing the primary (P) versus metastatic (M) groups. 3. Identify genes that can discriminate between primary versus metastatic groups. Expression data on approximately 63 000 genomic regions were obtained using ve oligonucleotide chips for every patient. We use the term genes to refer to these genomic regions, although they may not correspond to unique gene sequences. These chips are labelled A, B, C, D and E, with 12 626 genes on each chip, of which 67 are control genes. Chip A consists of known genes and well-characterized expressed sequence tags (ESTs) which are small portions of only the active part of the genes. Chips B to E contain less well-studied genes and ESTs. Although gene expressions were obtained from all ve chips, we illustrate data analyses using data from only one chip. 3.1. Class discovery Gene expressions in the experimental samples have a complex multivariate relationship. Exploratory graphical methods such as cluster analysis and multi-dimensional scaling are used to gain insight into these multivariate patterns [18, 19]. Under these methods, samples are partitioned into clusters based on some similarity measure calculated from the gene expression vectors. The term ‘class discovery’ is used to refer to such exploratory analyses. 3.1.1. Cluster analysis. The goal of cluster analysis is to obtain groupings or clusters of similar samples. This is accomplished by using a distance measure derived from the multivariate gene expression data that characterizes the ‘distance’ of the patients’ expression patterns with each other. Clusters are formed based on the patients’ gene expressions, without utilizing any prior knowledge about the predened groups (such as, for example, a known pathological tumour classication). This is referred to as ‘unsupervised analysis’. The choice of the distance measure includes Euclidean distance, one minus Pearson’s correlation coecient, and squared dierence between the means. Cluster analysis falls under two major categories depending upon whether or not the number of desired groupings or clusters is prespecied. Under hierarchical clustering the number of clusters is unspecied and generated from the observed
GENE EXPRESSION DATA ANALYSIS
365
data. Alternatively, with K-means clustering, the number of clusters (K) is specied a priori. These methods are illustrated below. Hierarchical Clustering Hierarchical clustering results in groups of related samples that can be visualized with a dendrogram, a tree-based two-dimensional plot. Clusters are formed in an iterative manner using a bottom-up approach (agglomerative method) based on the distance measure of choice. The clustering algorithm is outlined below. Let N be the total number of samples. C(:) denotes a cluster consisting of the samples contained in the parentheses: 1. Begin with N clusters, where every patient sample is a unique cluster. 2. Obtain an N × N symmetric distance matrix D, where the component dij is the distance measure between samples i and j. This distance measure is calculated using the gene expression vectors of samples i and j. One minus Pearson’s correlation coecient and the Euclidean distance between the vectors are commonly used. 3. Form a new cluster C(i; j) from clusters C(i) and C(j) that have the smallest distance dij . 4. Create a new (N − 1) × (N − 1) distance matrix. The updated distance between C(i; j) and the remaining clusters is calculated using the single, complete or average linkage approach. 5. Repeat steps 3 and 4 until all the samples are merged into one single cluster. This will require at most N − 1 iterations. One of the following approaches can be used to update the distance matrix in step 4: Single linkage, the minimum distance between pairs of clusters dC(i; j); C(k) = min{dik ; djk } Complete linkage, the maximum distance between pairs of clusters dC(i; j); C(k) = max{dik ; djk } Average linkage, the average distance between pairs of clusters dC(i; j); C(k) =
dik + djk nij nk
where the numerator is the sum of all between-cluster distances. The terms nij and nk in the denominator correspond to the number of samples in clusters C(i; j) and C(k). The S-plus code for average linkage clustering and for plotting the corresponding dendrogram is as follows: average.cluster <- hclust(distance.dat, method=‘‘average’’) plclust(average.cluster)
This code uses a data set called distance.dat containing the N × N distance matrix for the initial calculations in step 2. The hclust function performs the clustering algorithm. The option method = ‘‘average’’ species the average linkage approach to be used. To specify single linkage, use method = ‘‘connected’’, or to specify complete linkage, use method = ‘‘compact’’. The results, consisting of the distance between the samples and the
366
J. M. SATAGOPAN AND K. S. PANAGEAS
1.1
P7
1.0
P 11 P 12 P2 P3
P1 P6
P 13 P 14
P8
P4
P9
P 20 P 22
P 10
M 32 M 24 M 29 P5
M 26 M 27
M 28 M 31
P 15 P 23
0.7
M 25 M 30
0.8
P 18 P 17 P 21 P 19 P 16
0.9
Figure 1. Dendrogram from average linkage hierarchical clustering of 32 cancer samples.
order in which the clusters are combined, are stored as an object called average.cluster. The plclust function creates a dendrogram from the average.cluster object. Figure 1 shows the dendrogram for the 32 cancer samples. The 23 primary samples are denoted P1; : : : ; P23, and the nine metastatic samples are denoted M24; : : : ; M32. The distance.dat object chosen here is a 32 × 32 matrix of one minus Pearson correlation coecients, and the method = ‘‘average’’ option was used. This dendrogram can be interpreted in the following manner. The vertical axis represents the between-cluster distance calculated using the average linkage method. Since one minus Pearson correlation was used, the range of the vertical axis can span from 0 to 2. Starting from the bottom of the dendrogram, the height of the bars indicates at what point two clusters are merged. For example, the primary samples P13 and P14 were merged to form the rst cluster, P1 and P6 formed the second cluster etc. Overall it appears that two general clusters are emanating from this dendrogram: one consisting of eight primary samples P7, P18, P17 down to P23, and the second consisting of the remaining 24 samples. Within the second cluster, the nine metastatic samples and the 15 primary samples are clustered in separate branches. These might be interpreted as separate subclusters. Results from such a dendrogram would then trigger further investigations regarding clinically relevant dierences between these clusters. K-means Clustering K-means clustering [18] is an alternative clustering algorithm where the number of clusters (K) is prespecied arbitrarily. Similar to hierarchical clustering, clusters are formed in an iterative manner based on a distance measure. The algorithm to assign the samples to the K clusters is outlined below: 1. Partition the samples arbitrarily into K initial clusters. 2. Calculate the vector of centroids for every cluster. The centroid is dened as the average expression value for every gene across all the samples in that cluster.
GENE EXPRESSION DATA ANALYSIS
367
3. Calculate the distance between every sample and every centroid. Re-assign a sample to the cluster with the shortest distance. 4. Form new clusters after all possible re-assignments, and recalculate the centroids of the newly created clusters. 5. Repeat steps 2–4 until no re-assignment takes place. Given below is S-plus code to perform K-means clustering: centroid1
apply(cancer.data[1:16,], 2, mean) apply(cancer.data[17:32,], 2, mean) rbind(centroid1, centroid2) <- kmeans(cancer.data, centroids)
This code uses a data set called cancer.data which is a matrix with 32 rows and 12 626 columns containing the gene expression values for all samples. Here we have arbitrarily chosen K = 2 and divided the 32 samples into two clusters of equal size, with the rst 16 samples (cancer.data[1:16,]) in cluster 1 and the remaining samples (cancer.data[17:32,]) in cluster 2. The apply function creates the two centroid vectors called centroid1 and centroid2. These vectors contain the average expression level for every gene from the samples within each cluster. The rbind function creates a 2 × 12 626 matrix called centroids combining these two vectors. The kmeans function performs the clustering algorithm using this information by setting K as the number of rows of the centroids matrix, which is 2 in this example. The arguments required for the kmeans function are the data set cancer.data and the initial matrix centroids. The results of the kmeans function are stored in the object kmeans.2clus, which contains the component cluster, a vector of size N (number of samples). This component provides the cluster membership of every sample. For this example, cluster is the following vector of length 32: (2 2 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2): The values 1 and 2 denote the cluster identiers of the samples (P and M, respectively) obtained from the K-means algorithm. Recall that the rst 23 are primary samples, and the nal nine are metastatic samples. Fifteen primary samples formed cluster 1. All nine metastatic samples are within cluster 2 along with eight primary samples. As expected, we see overlap of these results with the dendrogram (Figure 1) in the following manner. The metastatic samples cluster with eight of the same primary samples as in the dendrogram. The primary samples P7, P18, P17 down to P23 which formed a cluster in the dendrogram are also contained in cluster 1 in this analysis. Since the choice of K is arbitrary, several choices should be investigated when performing this clustering algorithm. The best choice of K can be determined based on an F-statistic [19]. The F-statistic is the ratio of the between- to the within-cluster mean square errors. The F-values corresponding to dierent choices of K can be compared, and the K with the largest F-value is then declared to be the appropriate number of clusters. Table I gives the F-values for the illustrative example when K = 2; 3 and 4. It is evident that the appropriate number of clusters is 2.
368
J. M. SATAGOPAN AND K. S. PANAGEAS
Table I. F-statistics for various choices of K used in a K-means clustering algorithm. K
F-statistic
2 3 4
89.0 0.80 0.39
3.1.2. Multi-dimensional scaling. Multi-dimensional scaling (MDS) is a data reduction technique that can be used to visually identify the underlying structure of the data. It is used in an exploratory manner to gain initial insight into sample groupings. The data required for MDS are the N (N − 1)=2 pairwise distances between the N samples. A conguration of the distances in a lower dimension is obtained such that the distance between the samples in this dimension closely matches the original pairwise distances. A graphical display of the results allows one to identify any data patterns and sample groupings without specication of the desired number of groups, analogous to hierarchical clustering. Initially every sample is represented as a 12 626-dimensional vector of gene expressions. The goal of MDS is to reduce this to a q-dimensional vector (q¡12 626), and plot the samples in this smaller dimension to identify any underlying pattern in the data. To achieve this, MDS considers two distance matrices – the rst is a matrix of distances between the samples based on the 12 626-dimensional vector of gene expressions, and the second is a matrix of distances based on the q-dimensional vector, for a xed q. This q-dimensional reduction of the gene expressions is derived such that the following two properties are satised: (a) the rank-ordering of the distances obtained from the q-dimensional reduction of the data is the same as that obtained from the original 12 626 dimensional gene expressions; and (b) stress, dened as the sum of squared dierence between the two distance measures relative to the sum of squared distance obtained from the q-dimensional reduction, is minimized. For a xed dimension size q, the specic steps are detailed below: 1. Compute the N (N − 1)=2 pairwise distances (dij ) between the N samples using a distance metric. A commonly used distance measure between samples i and j is one minus Pearson correlation between their gene expression vectors. 2. For every sample, obtain a conguration of q-dimensional vectors. Calculate distances, 2 ˆ = (dij −dˆij ) . Furtherdˆij , in q-dimension to minimize the stress, dened as S 2 (d; d) i=j dij more, the distances dˆij must be chosen such that the ordering of the samples based on dˆij is the same as that using dij . 3. Plot the q-dimensional conguration of the N samples. Note that when Euclidean distance between the samples is used in step 1, MDS is equivalent to principal component analysis on the gene expression data. The clustering relationship between the samples is then assessed from the resulting gures. A SAS program to perform these computations is given in the following: proc mds data=distance dimension=q out=mdsout; id sampleid; run;
GENE EXPRESSION DATA ANALYSIS
369
Figure 2. Two-dimensional plot from MDS. The labels P1 to P23 denote the 23 primary samples and M24 to M32 correspond to the nine metastatic samples.
The SAS procedure proc mds uses a data set called distance which is an N × N matrix of pairwise distances between the samples. The argument dimension=q species the number of dimensions (q) to be used in the MDS. Use dimension=2 and dimension=3 for two and three dimensions, respectively. The argument out denes the output data set which has been named mdsout. The id statement species the descriptive labels for the N samples. The SAS macro plotit, given below, plots the output of proc mds (mdsout) in two dimensions. %plotit(data=mdsout, datatype=mds, labelvar=sampleid, vtoh=1.75); run;
The argument datatype=mds species that mdsout has been obtained from proc mds. The labelvar command species the sample identier to be used in the graph. The graphical command vtoh species the plotting scale. Figure 2 shows the results of MDS in twodimensions. The axes Dimension1 and Dimension2 specify the distances dˆij , and are calculated ˆ in step 2. by proc mds to minimize the stress S(d; d) Three-dimensional plots can be obtained using the proc g3d procedure in SAS in the following manner: data mdsout; set mdsout; if _type_=’CONFIG’; length colorval $2 shapeval $2; if sampletype="P" then do; shapeval=’club’; colorval=’blue’; end; if sampletype="M" then do; shapeval=’diamond’; colorval=’green’; end; run;
370
J. M. SATAGOPAN AND K. S. PANAGEAS
proc g3d data=mdsout; scatter dim1*dim3=dim2 / grid color=colorval shape=shapeval; scatter dim1*dim3=dim2 / color=colorval shape=shapeval; run;
First, a data step procedure is used to add shape and colour values (shapeval and colorval, respectively) to the MDS output. Here we have specied a blue club to identify primary cancer samples, and a green diamond to identify metastatic samples using the two if statements in the data step procedure. In the presence of more than two sample groups additional shape and colour values must be specied. The proc g3d procedure then uses the resulting augmented MDS output to create a three-dimensional plot. The two scatter statements in proc g3d specify the three axes to plot the grid and the data points. Figure 3 shows the results of MDS in three-dimensions. As before, the axes Dimension1, Dimension2, ˆ is minimized in three and Dimension3 specify the distances dˆij , such that the stress S(d; d) dimensions. From Figure 2 one can conclude that, in general, the primary samples are grouped more closely together as a cluster. The metastatic samples are not grouped with the primary samples. Further, the metastatic samples are more scattered and hence appear to be heterogeneous. Recall from Section 3 that these tissues are obtained from the various metastatic sites of the nine samples, and thus could contribute to the heterogeneity, whereas the metastatic samples are more scattered and not grouped with the primary samples. However, consistent with results from hierarchical and K-means clustering, the primary samples P7 down to P23 appear to be the farthest from the metastatic samples. The three-dimensional plot of Figure 3 provides additional evidence that the primary samples are grouped together and are separated from the metastatic samples. 3.2. Gene identication The methods described under ‘Class discovery’ provide exploratory evidence regarding the complex genetic relationship between the samples. The ‘gene identication’ methods discussed in this section provide a statistical approach to identify gene subsets (for example, a few hundred) that are dierentially expressed in various disease groups. Several graphical displays to identify genes with dierential expression specic for data obtained from cDNA arrays have been discussed by Dudoit et al. [20]. We refer to this further in the Discussion section. Unlike the class discovery methods, the class membership of every sample is assumed to be known, and is utilized in these analyses. To determine whether there is an association between the expression level of a single gene and disease group, a test statistic such as a two-sample t-test or Wilcoxon Mann–Whitney test could be calculated in the presence of two groups. This test statistic will be calculated for every gene, and genes ranked based on the observed test statistic or the corresponding p-value. A subset of the top ranked or signicantly dierentially expressed genes will be identied for further study. Owing to multiple testing, appropriate p-value adjustments are essential to obtain the statistical signicance of every test. Several methods to correct for multiple testing are available, however, approaches such as the Bonferroni correction may be too conservative for data from these studies. Since it is likely that genes are correlated with each other, resampling-based methods are more appropriate to obtain corrected p-values. Several resampling-based methods to correct for multiple testing have been discussed by Westfall and Young [21] and West-
GENE EXPRESSION DATA ANALYSIS
371
Figure 3. Three-dimensional plot from MDS. The 23 primary samples are represented by clubs, and the nine metastatic samples by diamonds.
fall and Wolnger [22]. Gene identication and ranking could be based on these corrected p-values. The following is a SAS code to obtain adjusted p-values using this method: proc multtest perm data=analysis out=adjp; class group; test mean(gene1-gene12626); contrast "P vs M" -1 1; run;
This code uses a data set called analysis with one record per patient. It comprises the following variables: gene1 through gene12626 consisting of the actual gene expressions, and group referring to the disease group P or M. An output data set called adjp is created
372
J. M. SATAGOPAN AND K. S. PANAGEAS
Table II. Top 20 dierentially expressed genes of primary versus metastatic cases. The rst column gives the gene identier. The corresponding t-statistics, unadjusted, and adjusted p-values are given in columns 2, 3 and 4, respectively. Gene identier Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene Gene
5546 10692 3845 3437 4743 12561 6530 4734 7621 10867 3000 7700 1739 10314 11130 8923 5327 7015 8656 4297
t-statistic
Unadjusted p-value
Adjusted p-value
−8:85 −8:39 8:15 −7:97 7:87 −7:53 −7:37 −7:36 −7:12 −6:98 −6:97 6:96 −6:94 −6:86 −6:82 −6:78 −6:76 −6:75 6:73 −6:56
7:2 × 10−10 2:3 × 10−9 4:3 × 10−9 6:8 × 10−9 8:8 × 10−9 2:1 × 10−8 3:3 × 10−8 3:4 × 10−8 6:5 × 10−8 9:3 × 10−8 9:7 × 10−8 1:0 × 10−7 1:1 × 10−7 1:3 × 10−7 1:5 × 10−7 1:6 × 10−7 1:7 × 10−7 1:7 × 10−7 1:8 × 10−7 3:0 × 10−7
2:5 × 10−4 2:5 × 10−4 2:5 × 10−4 3:5 × 10−4 3:5 × 10−4 7:5 × 10−4 7:5 × 10−4 7:5 × 10−4 1:4 × 10−3 1:8 × 10−3 1:9 × 10−3 1:9 × 10−3 2:0 × 10−3 2:2 × 10−3 2:3 × 10−3 2:5 × 10−3 2:5 × 10−3 2:6 × 10−3 2:7 × 10−3 4:6 × 10−3
that contains the unadjusted (from two-sample t-test) and adjusted p-values. The perm option indicates that the resampling is based on permutation; the class statement identies the group variable; mean identies the continuous gene expression variables; and the contrast statement species the comparison of interest. Table II provides the top 20 genes, their t-statistics and the corresponding unadjusted and adjusted p-values based on the resampling method of Westfall and Wolnger for the comparison of P versus M. It is important to note that the ranking of genes is invariant to p-value adjustments, but the frequency of signicant genes obtained after adjustments will be fewer. Further, the genes selected will likely be correlated. For these data, a total of 631 (12 626 × 0:05) and 126 (12 626 × 0:01) genes are expected to be signicant at the 5 per cent and 1 per cent levels, respectively. A total of 3009 (1530) genes were signicant without adjustment to the p-value, as opposed to only 83 (35) at the 5 per cent (1 per cent) level after adjustment. Identifying dierentially expressed genes for more than two groups can be performed in the same manner using ANOVA or the Kruskal–Wallis test. However, the proc multtest procedure given above is not applicable for this scenario. SAS programming guidelines for multiple comparison correction in ANOVA can be found in reference [23].
GENE EXPRESSION DATA ANALYSIS
373
3.3. Class prediction Class prediction methods are used to derive a classication rule that can discriminate between groups of observations. Thus class prediction is, in essence, discriminant analysis, applied in the context of a very large number of variables relative to samples. The classication rule is based on dierentially expressed genes identied by applying the methods of Section 3.2, and is used to determine the class of new samples. The classication rule is derived from a ‘training’ data set and is then used to classify a ‘test’ data set. Ideally two independent and comparable data sets should be used for class prediction. Dierent methods can be used to develop classication rules such as Fisher’s linear discriminant function [19], compound covariate predictor function [5], or quadratic classication functions. In addition, regardless of the specic classication method, any subset of dierentially expressed genes can be used to obtain a rule. For example, the best dierentially expressed gene (for example, from Table II, gene 5546 only), or the top T genes for some pre-specied T can be used (for example, the top 5 genes from Table II). Here we describe Fisher’s linear discriminant function and compound covariate predictor function for class prediction. Fisher’s linear discriminant function uses the predetermined subset of T gene expressions from the training data set to form a linear statistic that best discriminates between two groups. The optimal classication function is dened as the largest standardized dierence between the linear combination of gene expressions from any two separations of the training data set. This method does not assume that the gene expressions are normally distributed. However, the two underlying groups are assumed to come from populations with a common variance, and hence a pooled variance is used for standardization. A classication rule using the training data set is dened as follows. i gijt denote the Suppose we have selected a total of T candidate genes. Let git = 1=ni nj=1 mean expression level of the tth gene (t = 1; : : : ; T ) in the ith group (i = 1; 2) consisting of ni samples (n1 + n2 = N ). For a subset containing T genes, let giT = (gi1 ; gi2 ; : : : ; giT ) denote the vector of T mean gene expressions for the ith group and let ST be the estimated pooled covariance matrix. Then 1 dT = (g1T − g2T ) ST−1 (g1T + g2T ) 2
(1)
is the standardized midpoint between the two population means based on T genes. A sample from the test data set x1 comprising expression levels from the T genes is classied to group 1 if 1 (g1T − g2T ) ST−1 x1 ¿dT 2
(2)
and classied to group 2 otherwise. The rank of the matrix ST is min(N − 2; T ). Therefore, the number of candidate genes that can be used to derive the discriminant function is at most N − 2. The following is S-plus code to perform class prediction when T = 5: linear.function <- discrim(group.id ~ gene1 + gene2 + gene3 + gene4 + gene5, data=train.dat) predict.group <- predict.discrim(linear.function, newdata=test.dat)
374
J. M. SATAGOPAN AND K. S. PANAGEAS
The function discrim uses a data set called train.dat which is an N × (T + 1) matrix with the rows corresponding to the N samples, and the rst T columns, named gene1, gene2; : : : ; geneT, corresponding to the expression levels of the predetermined T genes. The last column is the known group identier (P or M), and is called group.id. The output object, denoted linear.function, contains Fisher’s linear discriminant function. The function predict.discrim calls this object, and classies the test data set, denoted test.dat. The data set test.dat is an N ∗ × T matrix containing the expression values of the same T genes for an independent set of N ∗ test samples. For example, when T = 5, the columns of train.dat are gene1, gene2, gene3, gene4, gene5 and group.id. The columns of test.dat are the same, excluding the last group.id column. The component group of the output object predict.group species the group allocation predicted from the linear discriminant function. Since we have assumed T = 5 in this example, we have used the command gene1 + gene2 + gene3 + gene4 + gene5 in the discrim function to derive the linear discriminant function. For example, these ve genes would be the top genes selected using the methods described in Section 3.2 (Table II), that is, genes numbered 5546, 10692, 3845, 3437 and 4743 in the data set containing all 12 626 genes. The accuracy of the classication rule, and the precision in classication using a prespecied number of T genes, can be evaluated based on the misclassication rate, dened as the probability of incorrect assignment of the test samples. As stated earlier, independent and comparable training and test data sets are highly desirable for class prediction. In the absence of an independent test set, the techniques of cross-validation or jack-knife can be applied. Cross-validation would be appropriate in the presence of a reasonably large data set. A random sample of approximately two-thirds of the observations can be set to be the training data set. The remaining one-third of the observations can then be used as the test set. Alternatively, a jack-knife (or leave-one-out) method can be used with a small to moderately sized data set. Since our example contains only 32 samples, we illustrate class prediction using a jack-knife approach. The jack-knife approach involves leaving out one sample at a time, setting this aside as the test sample. Applying the methods of Section 3.2, dierentially expressed genes are identied and ranked using the remaining 31 samples. A classication rule is derived from these 31 samples using the best T genes, and the test sample is then classied as belonging to the P group or the M group. The classication error for this test sample is recorded as 1 if the sample is incorrectly classied (that is, a P sample is incorrectly classied as M, and vice versa), and 0 if it is correctly classied. This process is repeated by leaving out each of the subsequent 31 samples one at a time. The misclassication rate is then calculated as the proportion of the 32 ‘test’ samples that have been incorrectly classied (proportion of 1s) using the gene subsets selected from the corresponding leave-one-out step. For illustrative purposes, we selected the top T = 1; 5; 8; 10; 15 and 20 dierentially expressed genes, and used the discrim and predict.discrim functions for classication for every choice of T . Table III shows the misclassication rates using the jack-knife approach and Fisher’s linear discriminant function. Using a subset of T = 1 gene resulted in a high misclassication rate. In other words, using the top ranked gene alone is not sucient for classifying a new sample. Using the top T = 8 genes resulted in the lowest misclassication rates among those considered. As mentioned earlier, several other classication rules can be constructed. The compound covariate predictor function is similar to the Fisher’s linear discriminant function. A weighted
GENE EXPRESSION DATA ANALYSIS
375
Table III. Misclassication rates of Fisher’s linear discriminant function for various prespecied values of T . Misclassication rates are calculated using the jack-knife method using 23 primary and nine metastatic samples. T 1 5 8 10 15 20
Misclassication rate (per cent) 6=32 3=32 1=32 2=32 2=32 2/32
(19%) (9%) (3%) (6%) (6%) (6%)
linear combination of the T genes is obtained for every sample, where the weights are based on a test statistic that compares the expression level of the genes across groups. For example, in the presence of two groups the expression levels of the T genes can be weighted by their corresponding t-statistic values. This results in a vector of length N (sample size of the training set), where each value is the weighted sum of the expression levels of the T genes for the training set sample. The compound covariate predictor function is a linear discriminant function using these values, and is analogous to the Fisher’s linear discriminant function with T = 1. Since weighted linear combinations of the T genes are considered for every sample, there is no restriction on the total number of genes T that can be used to derive a compound covariate predictor function. This, however, is not the case with Fisher’s linear discriminant function, where only T 6N − 2 genes can be used. Built-in functions for the compound covariate predictor method are not available in S-plus or SAS. However, an S-plus function can be written as follows: weighted.sum.expr <- rep(0,N) for(i in 1:N){ weighted.sum.expr[i] <- sum(genes[i,] * t.values) } train.dat <- data.frame(gene = weighted.sum.expr, group.id = class.id)
The rep function is used to initialize the vector weighted.sum.expr of length N . This vector is then created as a weighted sum of the gene expression values of the N samples, weighted by the t-statistic values of the T genes. This is calculated using the N × T matrix of gene expressions called genes, and a vector t.values of length T containing the t-statistics corresponding to the T genes. The rows and columns of the matrix genes correspond to the N samples and the T genes, respectively. The data set train.dat is created using the data.frame function. The rst column of this data set is named gene, and contains the newly created vector weighted.sum.expr. The second column, named group.id, contains the already existing class identiers class.id. The discrim function can be used to create the linear discriminant function. After creating a corresponding test data set, the function predict.discrim function can be used to classify a test data set. Misclassication rates can be calculated as described earlier. For the oncologic example, compound covariate functions were calculated for the top T = 1; 5; 8; 10; 15 and 20 genes selected using the methods of Section 3.2 (and given in Table II). The misclassication rates are given in Table IV. As expected, we see
376
J. M. SATAGOPAN AND K. S. PANAGEAS
Table IV. Misclassication rates based on compound covariate predictor function for various prespecied values of T . Misclassication rates are calculated using the jack-knife method using 23 primary and nine metastatic samples. T 1 5 8 10 15 20
Misclassication rate (per cent) 6=32 4=32 3=32 3=32 3=32 3=32
(19%) (12.5%) (9%) (9%) (9%) (9%)
an overlap of these results with those using Fisher’s linear discriminant function. Classication using the top gene alone (T = 1) resulted in the highest misclassication rate of 19 per cent. Further, the minimum number of genes required to obtain the lowest misclassication rate is T = 8. In this example, misclassication rates tended to be higher with compound covariate predictor than with Fisher’s linear discriminant function. 4. DISCUSSION The purpose of this paper is to familiarize statisticians with the basic premise of gene expression studies, and to review some of the commonly used statistical techniques to analyse these data, providing S-plus (version 6.0) and SAS (version 8.0) codes where appropriate. The methods discussed here are useful for addressing the general hypotheses such as those mentioned in Section 3. Specic hypotheses and endpoints could vary for dierent studies and hence careful consideration must be given to statistical methods applied to individual studies. In addition, for these studies attention must be given to sample ascertainment, experimental design, and the underlying assumptions prior to analyses since: (a) gene evaluation on a large number of individuals is frequently unfeasible due to array costs; and (b) the number of genes evaluated is substantially larger than the number of samples, thus violating standard statistical modelling assumptions. As a result of this, the use of gene expression arrays has triggered much methodological research in the areas of experimental design, expression value estimation and multiple comparison methods. Since oligonucleotide and spotted cDNA arrays are fundamentally dierent (Section 1.2), experimental design and methods to estimate gene expressions are specic to each of these array types. Complete block designs and loop designs have been proposed by Kerr and Churchill [24, 25] to identify variations due to several factors (for example, pin eect, dye eect and chip eect) in cDNA arrays. Related software written in Matlab can be obtained from http:==www.jax.org=research=churchill=software=anova=index.html. The need for replicate arrays to identify dierentially expressed genes has been discussed by Pan et al. [26]. Kerr et al. [27] discuss an analysis of variance approach to identify dierentially expressed genes from data with replicate samples. Valid measurement of gene expressions is becoming a topic of intense research for both spotted cDNA and oligonucleotide array experiments. Chen et al. [28] used the logarithm
GENE EXPRESSION DATA ANALYSIS
377
of the red to green intensity ratio as a measure of expression for cDNA array experiments. Newton et al. [29] used shrinkage estimators with a gamma prior on the intensity distributions to derive expression levels for cDNA array experiments. Recall that expression levels from oligonucleotide arrays are measured as the average of the dierence between the PM and MM probe intensities, where 16 to 20 PM=MM probe pairs or oligos are used to represent a single gene. Adjustments to this estimate due to cross-hybridizing probes and contaminated array regions were discussed by Li and Wong [30]. Software for these computations is available at http:==biosun1.harvard.edu=cli=dchip request.htm. Efron et al. [31] used an alternative estimate and measured gene expressions as the logarithm of the PM values minus one-half of the logarithm of the MM values, averaged over the 16 or 20 probe pairs. Applying this estimate to cell line data, they showed that this estimate of gene expression results in a more accurate detection of dierentially expressed genes than the average dierence estimate. Irizarry et al. (http:==biosun01.biostat.jhsph.edu= ririzarr=papers=), proposed an estimate based on the probe level PM data. This was compared to the average dierence estimator and Li and Wong’s estimator and found to be a valid approach for measuring gene expression. An R package with the functions to perform these calculations is available at http:==biosun01.biostat.jhsph.edu= ririzarr=Ray. Specic data visualization and analytic methods for cDNA arrays have been discussed by Dudoit et al. [20]. Graphical display of the red and green intensities can be useful to visually identify genes with dierential expression. They suggest several exploratory diagnostic plots to identify variations in the gene expressions such as plotting the log intensity ratio versus the mean log intensity of all the genes. The log intensity ratio is dened as the base 2 logarithm of the test to reference expression ratio. The mean log intensity is dened as the base 2 logarithm of the square root of the product of the test and reference expressions. Other relevant graphical displays have been discussed by Sapir and Churchill [32] and Newton et al. [29]. All arrays are subject to background contamination of the intensity measurements, making comparison of gene expressions between dierent arrays unreliable and dicult to interpret. Hence, every array must be standardized to ensure that the average observed background contamination intensity level is comparable across all the arrays under study. This process of standardization is commonly referred to as scaling and normalization. Richmond et al. [33] used a two-step procedure to normalize cDNA array expressions: adjusting for background and adjusting for dierences in average intensities between arrays. Scaling and normalization methods for oligonucleotide data are detailed in reference [34]. Alternative normalization methods for oligo arrays have been proposed by Irizarry et al. As we have illustrated, both S-plus and SAS can be used to address general hypotheses related to gene expression studies. In addition, Eisen et al. [35] provide software for analysis and visualization based on hierarchical clustering algorithms (http:==rana.lbl.gov). In Section 3.1 we illustrated several clustering algorithms. The interpretation of results from clustering algorithms must be treated with caution. Since a set of clusters can be obtained for any set of randomly generated numbers using either hierarchical clustering or K-means clustering methods, these methods should not be utilized as an analytic tool for conclusive analyses. More objective statistical tests must be used to address specic hypotheses. In Section 3.2 we illustrated gene identication using a two-sample t-test for two disease groups. This can easily be extended to more than two disease groups using ANOVA. Kerr and Churchill [24, 25] used ANOVA methods for cDNA arrays to model several sources of
378
J. M. SATAGOPAN AND K. S. PANAGEAS
variation. Pan et al. [36] applied mixture models to identify dierentially expressed genes. Multiple comparison issues arising from these data can be addressed by either xing the familywise error rate [20] or by xing the false discovery rate [37]. Dudoit et al. provide a collection of R functions to identify dierentially expressed genes from a cDNA array (http:==www.stat.berkeley.edu=users=terry=zarray=Software=smacode.html). The SAS code provided in Section 3.2 calculates permutation-based adjusted p-values by xing the familywise error rate. In Section 3.3 we illustrated class prediction methods using linear discriminant functions. Other more computationally intense approaches, such as nearest neighbour methods, CART, quadratic discrimination, and machine learning methods such as articial neural networks, boosting and bagging are illustrated and compared by Dudoit et al. [38]. Cautious interpretation of results from any algorithm is necessary due to sample size limitations. ACKNOWLEDGEMENTS
The authors would like to thank William Gerald for providing the oncology data, Alex Smith for valuable discussions about the data, Mithat Gonen and Elyn Riedel for advice on the SAS macros, Alan Houghton for advise on genetics terminologies, and Colin Begg and an external reviewer for providing several insightful comments. This research was supported in part by National Institutes of Health grants GM60457 and CA84499. REFERENCES 1. Lipshutz RJ, Fodor SPA, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nature Genetics Supplement 1999; 21:20 –24. 2. Eisen MB, Brown PO. DNA arrays for analysis of gene expression. Methods in Enzymology 1999; 303: 179 –205. 3. Ramaswamy S, Golub TR. DNA microarrays in clinical oncology. Journal of Clinical Oncology 2002; 20(7):1932–1941. 4. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomeld CD, Lander ES. Molecular classication of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286 (5439):531–537. 5. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raeld M, Yakhini Z, Ben-Dor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Wilfond B, Sauter G, Kallioniemi OP, Borg A, Trent J. Gene-expression proles in hereditary breast cancer. New England Journal of Medicine 2001; 344(8):539 –548. 6. Perou CM, Jerey SS, van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JC, Lashkari D, Shalon D, Brown PO, Botstein D. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proceedings of the National Academy of Science 1999; 96(16):9212–9217. 7. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM. Distinct types of diuse large B-cell lymphoma identied by gene expression proling. Nature 2000; 403(6769):503–511. 8. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V, Hayward N, Trent J. Molecular classication of cutaneous malignant melanoma by gene expression proling. Nature 2000; 406(6795):536 –540. 9. Clark EA, Golub TR, Lander ES, Hynes RO. Genomic analysis of metastasis reveals an essential role for RhoC. Nature 2000; 406(6795):532–535. 10. Coller HA, Grandori C, Tamayo P, Colbert T, Lander ES, Eisenman RN, Golub TR. Expression analysis with oligonucleotide microarrays reveals that MYC regulates genes involved in growth, cell cycle, signaling, and adhesion. Proceedings of the National Academy of Science 2000; 97(7):3260 –3265. 11. Welsh JB, Zarrinkar PP, Sapinoso LM, Kern SG, Behling CA, Monk BJ, Lockhart DJ, Burger RA, Hampton GM. Analysis of gene expression proles in normal and neoplastic ovarian tissue samples identies candidate
GENE EXPRESSION DATA ANALYSIS
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38.
379
molecular markers of epithelial ovarian cancer. Proceedings of the National Academy of Sciences 2001; 98(3):1176 –1181. Luo J, Duggan DJ, Chen Y, Sauvageot J, Ewing CM, Bittner ML, Trent JM, Isaacs WB. Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression proling. Cancer Research 2001; 61(12):4683– 4688. Lee C-K, Weindruch R, Prolla TA. Gene-expression prole of the ageing brain in mice. Nature Genetics 2000; 25(3):294 –297. Nadler ST, Stoehr JP, Schueler KP, Tanimoto G, Yandell BS, Attie AD. The expression of adipogenic genes is decreased in obesity and diabetes mellitus. Proceedings of the National Academy of Sciences 2000; 97(21):11371–11376. Clement K, Viguerie N, Diehn M, Alizadeh A, Barbe P, Thalamas C, Storey JD, Brown PO, Barsh GS, Langin D. In vivo regulation of human skeletal muscle gene expression by thyroid hormone. Genome Research 2002; 12(2):281–291. Frueh FW, Hayashibara KC, Brown PO, Whitlock JP Jr. Use of cDNA microarrays to analyze dioxin-induced changes in human liver gene expression. Toxicology Letters 2001; 122(3):189 –203. Protchenko O, Ferea T, Rashford J, Tiedeman J, Brown PO, Botstein D, Philpott CC. Three cell wall mannoproteins facilitate the uptake of iron in Saccharomyces cerevisiae. Journal of Biological Chemistry 2001; 276(52):49244 – 49250. Hartigan JA. Clustering Algorithms. Wiley: New York, 1975. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Prentice-Hall: New Jersey, 1992. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying dierentially expressed genes in replicated cDNA microarray experiments. Technical report 578. 2000, Department of Statistics, University of California, March 2002 (http:==www.stat.Berkeley.EDU=webmastr=users=terry=zarray=Html=papersindex.html). Westfall PH, Young SS. Resampling-based multiple testing. Wiley: New York, 1993. Westfall PH, Wolnger RD. Multiple tests with discrete distributions. American Statistician 1997; 51(1):3–8. Westfall PH, Tobias RD, Rom D, Wolnger RD, Hochberg Y. Multiple Comparisons and Multiple Tests using the SAS System. SAS Institute: North Carolina, 1999. Kerr MK, Churchill GA. Experimental design for gene expression microarrays. Biostatistics 2001; 2(2):183–201. Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarrays. Genetical Research 2001; 77(2):123–128. Pan W, Lin J, Le C. How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Technical report 2001– 012, Division of Biostatistics, University of Minnesota, March 2002 (http:==www.biostat.umn.edu=weip=paperSubj.html). Kerr MK, Afshari CA, Bennett L, Bushel P, Martinez J, Walker NJ, Churchill GA. Statistical analysis of a gene expression microarray experiment with replication. Statistica Sinica 2002;12:203–218. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics 1997; 2:364 –374. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On dierential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 2001; 8(1):37–52. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences 2001; 98(1):31–36. Efron B, Tibshirani R, Goss V, Chu G. Microarrays and their use in a comparative experiment. Technical report 213. 2000, Department of Statistics, Stanford University, March 2002 (http:==www-stat.stanford.edu= tibs=research.html). Sapir M, Churchill GA. Estimating the posterior probability of dierential gene expression from microarray data. Statistical Genetics Group, The Jackson Laboratory, 2001 (htt:==www.jax.org=research/churchill). Richmond CS, Glasner JD, Mau R, Jin H, Blattner FR. Genome-wide expression proling in Escherchia coli K-12. Nucleic Acids Research 1999; 27(19):3821–3835. Aymetrix Microarray Suite User Guide. Version 4.0. 2000; Appendix A2, A3. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 1998; 95(25): 14863–14868. Pan W, Lin J, Le C. A mixture model approach to detecting dierentially expressed genes with microarray data. Technical report 2001- 011, Division of Biostatistics, University of Minnesotta, March 2002 (http:==www.biostat.umn.edu=weip=paperSubj.html). Efron B, Storey JD, Tibshirani R. Microarrays, empirical Bayes methods, and false discovery rates. Technical report 217. 2001, Department of Statistics, Stanford University, March 2002 (http:==www-stat.stanford.edu= tibs=research.html). Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classication of tumors using gene expression data. Journal of the American Statistical Association 2002; 97(457):77–87.
Part IV DATA REDUCTION OF COMPLEX DATA SETS
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 383
Edited by
384
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
385
386
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
387
388
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
389
390
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
391
392
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
393
394
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
395
396
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
397
398
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
399
400
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
401
402
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
403
404
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
405
406
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
407
408
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
409
410
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
411
412
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
413
414
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
415
416
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
417
418
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
419
420
N. LANGE
STATISTICAL APPROACHES TO HUMAN BRAIN MAPPING BY fMRI
421
422
N. LANGE
TUTORIAL IN BIOSTATISTICS Disease map reconstruction Andrew B. Lawson∗ Department of Mathematical Sciences; University of Aberdeen; Aberdeem, U.K.
SUMMARY The analysis of the geographical distribution of disease incidence or prevalence is now of considerable importance for public health workers and epidemiologists alike. Important disease variations often have a spatial expression and so spatial analysis methods are an important additional tool in this connection. In this tutorial I have aimed to highlight the main issues relating to the analysis of disease where the goal is the reduction in noise in a disease map. This area is sometimes simply called disease mapping. A number of modelling approaches to disease mapping are considered and a case study highlighting the methods advocated is also included. Copyright ? 2001 John Wiley & Sons, Ltd.
1. INTRODUCTION The representation and analysis of maps of disease-incidence data is now established as a basic tool in the analysis of regional public health. The development of methods for mapping disease incidence has progressed considerably in recent years. One of the earliest examples of disease mapping is the map of the addresses of cholera victims related to the locations of water supplies, by John Snow in 1854 [1]. In that case, the street addresses of victims were recorded and their proximity to putative pollution sources (water supply pumps) was assessed. The subject area of disease mapping has developed considerably in recent years. This growth in interest has led to a greater use of geographical or spatial statistical tools in the analysis of data both routinely collected for public health purposes and in the analysis of data found within ecological studies of disease relating to explanatory variables. The study of the geographical distribution of disease can have a variety of uses. The main areas of application can be conveniently broken down into the following classes: (i) disease mapping; (ii) disease clustering; (iii) ecological analysis. In the rst class, usually the object of the analysis is to provide (estimate) the true relative risk of a disease of interest across a geographical study area (map); a focus similar to the processing of pixel images to remove noise. Applications for such methods lie in health services resource allocation, and in disease atlas construction (see, for example reference [2]).
∗ Correspondence to: Andrew B. Lawson, Department of Mathematical Sciences, University of Aberdeen, Aberdeen, U.K.
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 423
424
A. B. LAWSON
The second class, that of disease clustering, has particular importance in public health surveillance, where it may be important to be able to assess whether a disease map is clustered and where the clusters are located. This may lead to examination of potential environmental hazards. A particular special case arises when a known location is thought to be a potential pollution hazard. The analysis of disease incidence around a putative source of hazard is a special case of cluster detection. The third class, that of ecological analysis, is of great relevance within epidemiological research, as its focus is the analysis of the geographical distribution of disease in relation to explanatory covariates, usually at an aggregated spatial level. Many issues relating to disease mapping are also found in this area, in addition to issues relating specically to the incorporation of covariates. In the following, the issues surrounding the rst class of problems, namely disease mapping, are the focus of attention. While the focus here is on statistical methods and issues in disease mapping, it should be noted that the results of such statistical procedures are often represented visually in mapped form. Hence, some consideration must be given to the purely cartographic issues that aect the representation of geographical information. The method chosen to represent disease intensity on the map, be it colour scheme or symbolic representation, can dramatically aect the resulting interpretation of disease distribution. It is not the purpose of this tutorial to detail such cognitive aspects of disease mapping, but the reader is directed to some recent discussions of these issues [3–6].
2. DISEASE MAPPING AND MAP RECONSTRUCTION To begin, we consider two dierent mapping situations which clearly demarcate approaches to this area. These situations are dened by the form of the mapped data which arises in such studies. First, the lowest level of aggregation of data observable in disease incidence studies is the case itself. Its geographical reference (georeference), usually the residential address of the case, is the basic mapping unit. This type of data is often referred to as ‘case event’ data. We usually dene a xed study area, denoted as W , the study window, within which occurs m case events. We term this a realization of events within W. The locations of the residences of the cases are denoted as {xi }; i = 1; : : : ; m. The second type of data commonly found in such studies is a count of disease cases within arbitrarily dened administrative regions (tracts), such as census tracts, electoral districts or health authority areas. Essentially the count is an aggregation of all the cases within the tract. Therefore the georeference of the count is related to the tract ‘location’, where the individual case spatial references (locations) are lost. Denote the counts of disease within p tracts as {ni }; i = 1; : : : ; p. Often this latter form of data is more commonly available from routine data sources such as government agencies, than the rst form. Condentiality can limit access to the case event realization. Figures 1 and 2 display examples of such data formats. The rst example is the address locations of cancer of the larynx cases in southern Lancashire, England, for the period 1974–1983. The second example is of 26 enumeration districts (census tracts) in central Falkirk, Scotland, in which were collected the respiratory cancer disease counts for the period 1978–1983. Figure 3 highlights the correspondence between case event and count data.
DISEASE MAP RECONSTRUCTION
425
Figure 1. Larynx cancer case event realization: Lancashire, England.
Figure 2. Respiratory cancer: 26 census tract (enumeration district) counts, Falkirk, central Scotland.
3. DISEASE MAP RESTORATION 3.1. Simple statistical representations The representation of disease-incidence data can vary from simple point object maps for cases and pictorial representation of counts within tracts, to the mapping of estimates from complex models purporting to describe the structure of the disease events. In this section, and Section 3.2, we describe the range of mapping methods from simple representations to modelbased forms. The geographical incidence of disease has as its fundamental unit of observation the address location of cases of disease. The residential address (or possibly the employment address) of cases of disease contains important information relating to the type of exposure to environmental risks. Often, however, the exact address locations of cases are not directly available, and one must use instead counts of disease in arbitrary administrative regions, such as census tracts or postal districts. This lack of precise spatial information may be due to
426
A. B. LAWSON
Figure 3. Case event map with arbitrary regions superimposed. Reprinted from Lawson and Cressie, Spatial Statistical Methods for Environmental Epidemiology, ch. 11, in Handbook of Statistics (vol. 18), 2000, pp. 357– 396, with permission from Elsevier Science.
condentiality constraints relating to the identication of case addresses or may be due to the scale of information gathering. 3.1.1. Crude representation of disease distribution. The simplest possible mapping form is the depiction of disease rates at specic sets of locations. For case events, this is a map of case-event locations. For counts within tracts, this is a pictorial representation of the number of events in the tracts plotted at a suitable set of locations (for example, tract centroids). The locations of case events within a spatially heterogeneous population can display a small amount of information concerning the overall pattern of disease events within a window. Ross and Davis [7] provide an example of such an analysis of leukaemia cluster data. However, any interpretation of the structure of these events is severely limited by the lack of information concerning the spatial distribution of the background population which might be ‘at risk’ from the disease of concern and which gave rise to the cases of disease. This population also has a spatial distribution and failure to take account of this spatial variation severely limits the ability to interpret the resulting case-event map. In essence, areas of high density of ‘at risk’ population would tend to yield high incidence of case events and so, without taking account of this distribution, areas of high disease intensity could be spuriously attributed to excess disease risk. Figure 4 displays an example of a case address map for Humberside, U.K., for cases of childhood leukaemia and lymphoma in a xed time period. In the case of counts of cases of disease within tracts, similar considerations apply when crude count maps are constructed. Here, variation in population density also aects the spatial incidence of disease. It is also important to consider how a count of cases could be depicted in a mapped representation. Counts within tracts are totals of events from the whole tract region. If tracts are irregular, then a decision must be made to either ‘locate’ the count at some tract location (for example, tract centroid, however dened) with suitable symbolization, or to represent the count as a ll colour or shade over the whole tract (choropleth thematic map). In the former case, the choice of location will aect interpretation. In the latter case, symbolization choice (shade and=or colour) could distort interpretation also, although an attempt to represent the whole tract may be attractive.
DISEASE MAP RECONSTRUCTION
427
Figure 4. Case address locations of childhood leukaemia and lymphoma in the Humberside region of the U.K. within a xed time period. Reprinted from Lawson and Cressie, Spatial Statistical Methods for Environmental Epidemiology, ch. 11, in Handbook of Statistics (vol. 18), 2000, pp. 357– 396, with permission from Elsevier Science.
In general, methods that attempt to incorporate the eect of background ‘at risk’ population are to be preferred. These are discussed in the next section. 3.1.2. Standardized mortality=morbidity ratios and standardization. To assess the status of an area with respect to disease incidence, it is convenient to attempt to rst assess what disease incidence should be locally ‘expected’ in the tract area and then to compare the observed incidence with the ‘expected’ incidence. This approach has been traditionally used for the analysis of counts within tracts and can also be applied to case event maps. Case events. Case events can be mapped as a map of point event locations. For the purposes of assessment of dierences in local disease risk it is appropriate to convert these locations into a continuous surface describing the spatial variation in intensity of the cases. Once this surface is computed, then a measure of local variation is available at any spatial location within the observation window. Denote the intensity surface as (x), where x is a spatial location. This surface can be formally dened as the rst-order intensity of a point process on 2 [8] and can be estimated by a variety of methods including density estimation [9]. To provide an estimate of the ‘at risk’ population at spatial locations, it is necessary rst to choose a measure that will represent the intensity of cases ‘expected’ at such locations. Dene this measure as g(x). Two possibilities can be explored. First, it is possible to obtain rates for the case disease from either the whole study window or a larger enclosing region. Often, these rates are available only at an aggregated level (for example, census tracts). The rates are obtained for a range of subpopulation categories which are thought to aect the case disease incidence. For example, the age and sex structure of the population or the deprivation status of the area (see, for example, reference [10]) could aect the amount of population ‘at risk’ from the case disease. The use of such external rates is often called external standardization [11]. It should be noted that rates computed from aggregated data will be less variable than
428
A. B. LAWSON
those based on density estimation of case events. An alternative method of assessing the ‘at risk’ population structure is to use a case event map of another disease, which represents the background population but is not aected by the aetiological processes of interest in the case disease. For example, the spatial distribution of coronary heart disease (CHD: ICD code, list A 410–414), could provide a control representation for respiratory cancer (ICD code, list A 162) when the latter is the case disease in a study of air pollution eects, as CHD is less closely related to air-pollution insult. Other examples of the cited use of a control disease would be: larynx cancer (case) and lung cancer (control) [12] (however this control is complicated by the fact that lung cancer is also related to air pollution risk); lower body cancers (control) and gastric cancer (case) where lower body organs may only be aected by specic pollutants (for example [13], nickel) birth defects (case) and live births (control). While exact matching of diseases in this way will always be dicult, there is an advantage in the use of control diseases in case event examples. If a realization of the control disease is available in the form of a point-event map, then it is possible to also compute an estimate of the rst-order intensity of the control disease. This estimate can then be used directly to compare case intensity with background intensity. Note that g(x) can be estimated, equally, from census-tract standardized rates (see, for example, reference [14]). The comparison of estimates of (x) and g(x) can be made in a variety of ways. First, it is possible to map the ratio form ˆ (x) ˆ (1) R(x) = g(x) ˆ This was suggested by Bithell [15]. Modications to this procedure have been proposed by Lawson and Williams [16] and Kelsall and Diggle [17]. Care must be taken to consider the eects of study=observation window edges on the interpretation of the ratio. Some edge-eect compensation should be considered when there is a considerable inuence of window edges in the nal interpretation of the map. A detailed discussion of edge eects can be found elsewhere [18]. Apart from ratio forms, it is also possible to map transformations of ratios (for example, ˆ log R(x)) or to map ˆ ˆ − g(x) ˆ (2) D(x) = (x) The choice of (1) or (2) will depend on the underlying model assumed for the excess risk. This is discussed in Section 3.2.1. In all the approaches above to the mapping of case event data, some smoothing or interpolation of the event or control data has to be made. The statistical properties of this operation depend on the method used for estimation of each component of the map. Optimal choice of the smoothing constant (that is, bandwidth) are known for density estimation and kernel smoothing (see, for example, reference [9]) and these are discussed further in Section 3.1.3. Tract counts. As in the analysis of case events, it is usual to assess maps of count data by comparison of the observed counts, to those counts ‘expected’ to arise given the ‘at risk’ population structure of the tract. Traditionally, the ratio of observed to expected counts within tracts is called a standardized mortality/morbidity ratio (SMR) and this ratio is an estimate of relative risk within each tract (that is, the ratio describes the odds of being in the disease group rather than the background group). The justication for the use of SMRs can be supported by the analysis of likelihood models with multiplicative expected risk (see,
DISEASE MAP RECONSTRUCTION
429
Figure 5. Falkirk: respiratory cancer SMR thematic map.
for example, reference [19]). In Section 3.2.1, we explore further the connection between likelihood models and tract-based estimators of risk. Figure 5 displays the SMR thematic map for the Falkirk example, based on expected rates calculated from the local male=female population counts and the Scottish respiratory cancer rate for the period. Dene ni as the observed count of the case disease in the ith tract, and ei as the expected count within the same tract. Then the SMR is dened as ni (3) Rˆ i = ei The alternative measure of relation between observed and expected counts, which is related to an additive risk model, is the dierence Dˆ i = ni − ei
(4)
In both cases, the comments made in Section 3.1.1 and 3.2.1 about mapping counts within tracts apply. In this case it must be decided whether to express the Rˆ i or Dˆ i as ll patterns in each region, or to locate the result at some specied tract location, such as the centroid. If it is decided that these measures should be regarded as continuous across regions then some further interpolation of Rˆ i or Di must be made (see, for example, reference [19], pp. 198– 199). This is discussed briey in the next section. SMRs are commonly used in disease map presentation, but have many drawbacks. First, they are based on ratio estimators and hence can yield large changes in estimate with relatively small changes in expected value. In the extreme, when a (close to) zero expectation is found, the SMR will be very large for any positive count. Also the zero SMRs do not distinguish variation in expected counts, and the SMR variance is proportional to 1=ei . The SMR is essentially a saturated estimate of relative risk and hence is not parsimonious. 3.1.3. Interpolation. In many of the mapping approaches mentioned above, use must be made of interpolation methods to provide estimates of a surface measure at locations where there
430
A. B. LAWSON
are no observations. For example, we may wish to map contours of a set of tract counts if we believe the counts to represent a continuously varying risk surface. For the purposes of contouring, a grid of surface interpolant values must be provided. Smoothing of SMRs has been advocated by Breslow and Day [19]. Those authors employ kernel smoothing to interpolate the surface (in a temporal application). The advantage of such smoothing is that the method preserves the positivity condition of SMRs; that is, the method does not produce negative interpolants (which are invalid), unlike Kriging methods (see, for example, reference [20], for discussion of this issue). Other interpolation methods also suer from this problem (see, for example, references [21–23]). Many mapping packages utilize interpolation methods to provide gridded data for further contour and perspective view plotting (for example, AXUM, S-plus). However, often the methods used are not clearly dened or they are based on mathematical rather than statistical interpolants (for example, the Akima or Delauney interpolator [21]). Note that the comments above also apply directly to case-event density estimation. The use of kernel density estimation is recommended, with edge correction as appropriate. For ratio estimation, Kelsall and Diggle [17] recommend the joint estimation of a common smoothing parameter for numerator and denominator of R(x) when a control-disease realization is available. 3.1.4. Exploratory methods. The discussion above, concerning the construction of disease maps, could be considered as exploratory analysis of spatial disease patterns. For example, the construction and mapping of ratios or dierences of case and background measures is useful for highlighting areas of incidence requiring further consideration. Contour plots or surface views of such mapped data can be derived. Comments concerning the psychological interpretation of mapped patterns also apply here (see, for example, references [6] and [23]). However, inspection of maps of simple ratios or dierences cannot provide accurate assessment of the statistical signicance of, for example, areas of elevated disease risk. Proper inference requires statistical models and that is the subject of the next section. 3.2. Basic models In the previous section, we have discussed the use of primarily descriptive methods in the construction of disease maps. These methods do not introduce any particular model structure or constraint into the mapping process. This can be advantageous at an early or exploratory stage in the analysis of disease data but, when more substantive hypotheses and=or greater amounts of prior information are available concerning the problem, then it may be advantageous to consider a model-based approach to disease-map construction. Model-based approaches can also be used in an exploratory setting, and if suciently general models are employed then this can lead to better focusing of subsequent hypothesis generation. In what follows, we consider rst likelihood models for case event data and then discuss the inclusion of extra information in the form of random eects. 3.2.1. Likelihood models. Denote a realization of m case events within a window W; as {xi }, i = 1; : : : ; m. In addition, dene the count of cases of disease within the ith tract {Wi }, i = 1; : : : ; p of an arbitrarily regionalized tract map as ni . Case-event data. Usually the basic model for case-event data is derived from the following assumptions:
DISEASE MAP RECONSTRUCTION
431
(i) Individuals within the study population behave independently with respect to disease propensity, after allowance is made for observed or unobserved confounding variables. (ii) The underlying ‘at risk’ population intensity has a continuous spatial distribution, within specied boundary vertices. (iii) The case-events are unique, in that they occur as single spatially separate events. Assumption (i) allows the events to be modelled via a likelihood approach, which is valid conditional on the outcomes of confounder variables. Further, assumption (ii), if valid, allows the likelihood to be constructed with a background continuous modulating intensity function {g(x)} representing the ‘at risk’ population. The uniqueness of case-event locations is a requirement of point process theory (the property called orderliness, see, for example, reference [24]), which allows the application of Poisson-process models in this analysis. Assumption (i) is generally valid for non-infectious diseases. It may also be valid for infectious diseases if the information about current infectives were known at given time points. Assumption (ii) will be valid at appropriate scales of analysis. It may not hold when large areas of a study window include zones of zero population (for example, harbours=industrial zones). Often models can be restricted to exclude these areas, however. Assumption (iii) will usually hold for relatively rare diseases but may be violated when households have multiple cases and these occur at coincident locations. This may not be important at more aggregate scales, but could be important at a ne spatial scale. Remedies for such non-orderliness are the use of declustering algorithms (which perturb the locations by small amounts), or analysis at a higher aggregation level. Note that it is also possible to use a conventional case-control approach to this problem [25]. Given the assumptions above, it is possible to specify that the case events arise as a realization of a Poisson point process, modulated by g(x), with rst-order multiplicative intensity (x) = g(x)f(x; )
(5)
In this denition f(:) represents a function of confounder variables as well as location, is a parameter (vector) and is the overall constant rate of the process. The confounder variables can be widely dened however. For example, a number of random eects could be included to represent unobserved eects, as well as observed covariates, as could functions of other locations. The inclusion of random eects could be chosen if it is felt that unobserved heterogeniety is present in the disease process. This could represent the eect of known or unknown covariates which are unobserved. The likelihood associated with this is given by m ) (xi ) exp − (u)du (6) L= i=1
W
For suitably specied f(:), a variety of models can be derived. In the case of disease mapping, where only the background intensity is to be accounted for, a reasonable approach to intensity parameterization is (x) = g(x)f(x): The preceding denition can be used as an ˆ informal justication for the use of intensity ratios ((x)= g(x)), ˆ in the mapping of case-event data; such ratios represent the local ‘extraction’ of ‘at risk’ background, under a multiplicative hazard model. On the other hand, under a pure additive model, (x) = [g(x) + f(x; )] say, dierencing the two estimated rates would be supported. Tract-count data. In the case of observed counts of disease within tracts, the Poissonprocess assumptions given above mean that the counts are Poisson distributed with, for each
432
A. B. LAWSON
tract, a dierent expectation: Wi (u)du, where Wi denotes the extent of the ith tract. Then the log-likelihood based on a Poisson distribution is, bar an additive constant only depending on the data, given by ) ) p l= ni log (u)du − (u)du (7) i=1
Wi
Wi
where p is the number of tracts. Often a parameterization in (7) is assumed where, as in the case-event example, the intensity is dened as a simple multiplicative function of the background g(x). An assumption is often made at this point that the integration over the ith tract area can be regarded as a parameter within a model hierarchy, without considering the spatial continuity of the intensity. This ignores any conditioning on W (u)du, the total integral over the study region, and may lead to erroneous inference for spatially dependent covariates, included within the specication of (u). This assumption leads to considerable simplications, but at a cost. The eect of such an approximation should be considered in any application example, but is seldom considered in the existing literature (references [26, 27, 62] provide reviews). The mapping of ‘extracted’ intensities for case events or modied SMRs for tract counts, is based on the view that once the ‘at risk’ background is extracted from the observed data, then the resulting distribution of risk represents a ‘clean’ map of the ground truth. Of course, as the background function g(x) must usually be estimated, then some variability in the resulting map will occur by inclusion of dierent estimators of g(x). For example, for tract-count data, the use of external standardization alone to estimate the expected counts within tracts may provide a dierent map from that provided by a combination of external standardization and measures of tract-specic deprivation (for example, deprivation indices [10]). If any confounding variables are available and can be included within the estimate of the ‘at risk’ background, then these should be considered for inclusion within the g(x) function. Examples of confounding variables could be found from national census data, particularly relating to socio-economic measures. These measures are often dened as ‘deprivation’ indicators, or could relate to life-style choices. For example, the local rate of car ownership or per cent unemployed within a census tract or other small area, could provide a surrogate measure for increased risk, due to correlations between these variables and poor housing, smoking lifestyles and ill health. Hence, if it is possible to include such variables in the estimation of g(x), then any resulting map will display a close representation of the ‘true’ underlying risk surface. When it is not possible to include such variables within g(x), it is sometimes possible to adapt a mapping method to include covariables of this type by inclusion within f(x) itself. 3.2.2. Random eects and Bayesian models. In the sections above some simple approaches to mapping intensities and counts within tracts have been described. These methods assume that once all known and observable confounding variables are included within the g(x) estimation then the resulting map will be clean of all artefacts and hence depict the true excess risk surface. However, it is often the case that unobserved eects could be thought to exist within the observed data and that these eects should also be included within the analysis. These eects are often termed random eects, and their analysis has provided a large literature both in statistical methodology and in epidemiological applications (see, for example, references [28–34]). Within the literature on disease mapping, there has been a considerable
DISEASE MAP RECONSTRUCTION
433
growth in recent years in modelling random eects of various kinds. In the mapping context, a random eect could take a variety of forms. In its simplest form, a random eect is an extra quantity of variation (or variance component) which is estimable within the map and which can be ascribed a dened probabilistic structure. This component can aect individuals or can be associated with tracts or covariables. For example, individuals vary in susceptibility to disease and hence individuals who become cases could have a random component relating to dierent susceptibility. This is sometimes known as frailty. Another example is the interpolation of a spatial covariable to the locations of case events or tract centroids. In that case, some error will be included in the interpolation process, and could be included within the resulting analysis of case or count events. Also, the locations of case-events might not be precisely known or subject to some random shift, which may be related to uncertain residential exposure. (However, this type of uncertainty may be better modelled by a more complex integrated intensity model, which no longer provides an independent observation model). Finally, within any predened spatial unit, such as tracts or regions, it may be expected that there could be components of variation attributable to these dierent spatial units. These components could have dierent forms depending on the degree of prior knowledge concerning the nature of this extra variation. For example, when observed counts, thought to be governed by a Poisson distribution, display greater variation than expected (that is, variance¿mean); it is sometimes described as overdispersion. This overdispersion can occur for various reasons. Often it arises when clustering occurs in the counts at a particular scale. It can also occur when considerable numbers of cells have zero counts (sparseness), which can arise when rare diseases are mapped. In spatial applications, it is important furthermore to distinguish two basic forms of extra variation. First, as in the aspatial case, a form of independent and spatially uncorrelated extra variation can be assumed. This is often called uncorrelated heterogeneity (see, for example, reference [35]). Another form of random eect is that which arises from a model where it is thought that the spatial unit (such as case-events, tracts or regions) is correlated with neighbouring spatial units. This is often termed correlated heterogeneity. Essentially, this form of extra variation implies that there exists spatial autocorrelation between spatial units (see reference [36] for an accessible introduction to spatial autocorrelation). This autocorrelation could arise for a variety of reasons. First, the disease of concern could be naturally clustered in its spatial distribution at the scale of observation. Many infectious diseases display such spatial clustering, and a number of apparently non-infectious diseases also cluster (see, for example, references [37, 38]). Second, autocorrelation can be induced in spatial disease patterns by the existence of unobserved environmental or frailty eects. Hence the extra variation observed in any application could arise from confounding variables that have not been included in the analysis. In disease mapping examples, this could easily arise when simple mapping methods are used on SMRs with just basic age-sex standardization. In the discussion above on heterogeneity, it is assumed that a global measure of heterogeneity applies to a mapped pattern. That is, any extra variation in the pattern can be captured by including a general heterogeneity term in the mapping model. However, often spatiallyspecic heterogeneity may arise where it is important to consider local eects as well as, or instead of, general heterogeneity. To dierentiate these two approaches, we use the term specic and non-specic heterogeneity. Specic heterogeneity implies that spatial locations are to be modelled locally; for example, clusters of disease are to be detected on the map. In contrast, ‘non-specic’ describes a global approach to such modelling, which does not address the question of the location of eects. In this denition, it is tacitly assumed that the locations
434
A. B. LAWSON
of clusters of disease can be regarded as random eects themselves. Hence, there are strong parallels between image processing tasks and the tasks of disease mapping. Random eects can take a variety of forms and suitable methods must be employed to provide correctly estimated maps under models including these eects. In this section, we discuss simple approaches to this problem both from a frequentist and a Bayesian viewpoint. A frequentist approach. In what follows, we use the term ‘frequentist’ to describe methods that seek to estimate parameters within a hierarchical model structure. The methods do assume that the random eects have mixing (or prior) distributions. For example, a common assumption made when examining tract counts is that ni ∼ Pois(ei i ) independently, and that i ∼ G(; ). This latter distribution is often assumed for the Poisson rate parameter and provides for a measure of overdispersion relative to the Poisson distribution itself, depending on the ; values used. The joint distribution is now given by the product of a Poisson likelihood and a gamma distribution. At this stage a choice must be made concerning how the random intensities are to be estimated or otherwise handled. One approach to this problem is to average over the values of i to yield what is often called the marginal likelihood. Having averaged over this density, it is then possible to apply standard methods such as maximum likelihood. This is usually known as marginal maximum likelihood (see, for example, references [39, 40]). In this approach, the parameters of the gamma distribution are estimated from the integrated likelihood. A further development of this approach is to replace the gamma density with a nite mixture. This approach is essentially non-parametric and does not require the complete specication of the parameter distribution (see, for example, reference [41]). Although the example specied here concerns tract counts, the method described above can equally be applied to case-event data, by inclusion of a random component in the intensity specication. A Bayesian approach. It is natural to consider modelling random eects within a Bayesian framework. First, random eects naturally have prior distributions and the joint density discussed above is proportional to the posterior distribution for the parameters of interest. Hence, the development of full Bayes and empirical Bayes (posterior approximation) methods have developed naturally in the eld of disease mapping. The prior distribution(s) for the (, say) parameters in the intensity specication g(x)f(x; ); have hyperparameters (in the Poisson– gamma example above, these were ; ). These hyperparameters can also have hyperprior distributions. The distributions chosen for these parameters depend on the application. In the full Bayesian approach, inference is based on the posterior distribution of given the data. However, as in the frequentist approach above, it is possible to adopt an intermediate approach where the posterior distribution is approximated in some way, and subsequent inference may be made via ‘frequentist-style’ estimation of parameters or by computing the approximated posterior distribution. In the tract-count example, approximation via intermediate prior-parameter estimation would involve the estimation of and , followed by inference on the estimated posterior distribution (see, for example, reference [42], p. 67–68). Few examples exist of simple Bayesian approaches to the analysis of case-event data in the disease-mapping context. One approach which has been described [32] can be used with simple prior distributions for parameters and the authors provide approximate empirical Bayes estimators based on Dirichlet tile area integral approximations. For count data, a number of ex-
DISEASE MAP RECONSTRUCTION
435
Figure 6. Falkirk: empirical Bayes relative risk estimates (gamma–Poisson model).
amples exist where independent Poisson distributed counts (with constant within-tract rate, i ) are associated with prior distributions of a variety of complexity. The earliest examples of such a Bayesian mapping approach can be found in Manton et al. [28] and Tsutakawa [30]. Also, Clayton and Kaldor [43] developed a Bayesian analysis of a Poisson likelihood model where ni has expectation i ei ; and found that with a prior distribution given by i ∼ G(; ); the Bayes estimate of i is the posterior expectation ni + ei +
(8)
Hence one could map directly these Bayes estimates. Now, the distribution of i conditional on ni is G(ni +; ei +) and a Bayesian approach would require summarization of i from this posterior distribution. In practice, this is often obtained by generation of realizations from this posterior and then the summarizations are empirical (for example, MCMC methods). Figure 6 displays the Bayes estimates under the gamma–Poisson model with and estimated as in Clayton and Kaldor. Note that in contrast to the SMR map (Figure 5), Figure 6 presents a smoother relative risk surface. Other approaches and variants in the analysis of simple mapping models have been proposed by Tsutakawa [30], Marshall [31] and Devine and Louis [44]. In the next section, more sophisticated models for the prior structure of the parameters of the map are discussed. 3.3. Advanced Bayesian models Many of the models discussed above can be extended to include the specication of prior distributions for parameters and hence can be examined via Bayesian methods. In general, we distinguish here between empirical Bayes methods and full Bayes methods on the basis that any method which seeks to approximate the posterior distribution is regarded as empirical Bayes [45]. All other methods are regarded as full Bayes. This latter category includes maximum a posteriori estimation, estimation of posterior functionals, as well as posterior sampling.
436
A. B. LAWSON
3.3.1. Empirical Bayes methods. The methods encompassed under the denition above are wide ranging, and here we will only discuss a subset of relevant methods. The rst method considered by the earliest workers was the evaluation of simplied (constrained) posterior distributions. Manton and co-workers [28] used a direct maximization of a constrained posterior distribution, Tsutakawa [30] used integral approximations for posterior expectations, while Marshall [31] used a method of moments estimator to derive shrinkage estimates. Devine and Louis [44] further extended this method by constraining the mean and variance of the collection of estimates to equal the posterior rst and second moments. The second type of method which has been considered in the context of disease mapping is the use of likelihood approximations. Clayton and Kaldor [43] rst suggested employing a quadratic normal approximation to a Poisson likelihood, with gamma prior distribution for the intensity parameter of the Poisson distribution and a spatial correlation prior. This leads to estimates via the EM algorithm. Extensions to this approach lead to simple GLS estimators for a range of likelihoods [32, 46, 47]. A third type is the Laplace asymptotic integral approximation, which has been applied by Breslow and Clayton [33] to a generalized linear modelling framework in a disease-mapping example. This integral approximation method allows the estimation of posterior moments and normalizing integrals (see, for example, reference [45], p. 340–344). A further, but dierent, integral approximation method is where the posterior distribution is integrated across the parameter space; that is, the nuisance parameters are ‘integrated out’ of the model. In that case the method of non-parametric maximum likelihood (NPML) can be employed. This approximates the integral of the posterior distribution by a weighted sum, and leads to consideration of a nite mixture sum that can be estimated via the EM algorithm [39, 40, 43]. Another possibility is to employ linear Bayes methods and Marshall [31] has proposed a variety of estimators of this type. These estimators are based on global or local shrinkage. Linear Bayes methods have the disadvantage that they violate the likelihood principle [48], and no estimate of variability is readily available. 3.3.2. Full Bayes methods. Full posterior inference for Bayesian models has only recently become feasible, largely because of the increased use of MCMC methods of posterior sampling. The rst full sampler reported for a disease mapping example was a Gibbs sampler applied to a general model for intrinsic autoregression and uncorrelated heterogeneity by Besag et al. [35]. Subsequently, Clayton et al. [33, 49, 50] have adapted this approach to mapping, ecological analysis and space-time problems. This has been facilitated by the availability of general Gibbs sampling packages such as BEAM and BUGS (GeoBugs and WinBugs). Such Gibbs sampling methods can be applied to putative source problems as well as mapping=ecological studies. However, specic variations in model components, for example, variation in the spatial correlation structure, cannot be easily accommodated in this general Bayesian package. Alternative, and more general, posterior sampling methods, such as the Metropolis–Hastings algorithm, are currently not available in a packaged form, although these methods can accommodate considerable variation in model specication. More recently, Metropolis–Hastings algorithms have been applied in comparison to approximate MAP estimation by Lawson et al. [32] and Diggle et al. [51] and hybrid Gibbs–Metropolis samplers have been applied to space-time problems by Waller et al. [52]. In addition, diagnostic methods for Bayesian MCMC sample output have been discussed for disease mapping examples by Zia et al. [53]. Developments in this area have been reviewed recently [54, 62].
DISEASE MAP RECONSTRUCTION
437
4. RESIDUALS AND GOODNESS-OF-FIT The analysis of residuals and summary functions of residuals forms a fundamental part of the assessment of model goodness-of-t in any area of statistical application. In the case of spatial epidemiology there is no exception, although full residual analysis is seldom presented in published work in the area. Often goodness-of-t measures are aggregate functions of piecewise residuals, while measures relating to individual residuals are also available. A variety of methods are available when full residual analysis is to be undertaken. We dene a piecewise residual as the standardized dierence between the observed value and the tted model value. Usually the standardization will be based on a measure of the variablity of the dierence between the two values. Within a frequentist paradigm, it is common practice to specify a residual as r1i = {yi − yˆi } or
(9)
√ r2i = {yi − yˆi }= {var(yi − yˆi )}
where yˆi is a tted value under a given model. When complex spatial models are considered, it is often easier to examine residuals, such as {r1i }, using Monte Carlo methods. In fact it is straightforward to implement a parametric bootstrap (PB) approach to residual diagnostics for likelihood models. The simplest case is that of tract count data, where for each tract an observed count can be compared to a tted count. In general, when Poisson likelihood models are assumed with ni ∼ Pois{ Wi g(u)f(u; )du} then it is straightforward to employ a PB by generating a set of simulated counts {nsij } j = 1; : : : ; J , from a Poisson distribution with mean ˆ )du. In this way, a tract-wise ranking, and hence p-value, can be computed by g(u) ˆ f(u; Wi assessing the rank of the residual within the pooled set ) ) s ˆ ˆ ni − g(u) ˆ f(u; )du; nij − g(u) ˆ f(u; ))du ; j = 1; : : : ; J Wi
Wi
s Denote the observed standardized residual as r2i and the simulated residuals as {r2ij }. Note that it is now possible to compare functions of the residuals as well as making direct comparisons. For example, in a spatial context it may be appropriate to examine the spatial autocorrelation of the observed residuals. Hence, a Monte Carlo assessment of degree of residual autocorrelation could be made by comparing Moran’s I statistic for the observed residuals, say, M ({r2i }), to s that found for the simulated count residuals M ({r2ij }). In the situation where case events are available, then it is not straightforward to dene a residual. As the data are in the form of locations, it is not possible to directly compare observed and tted values. However, by a suitable transformation, it is possible to compare measures which describe the spatial distribution of the cases. A model which ts the data well should provide a good t to the spatial distribution of the cases. It is possible to examine ˆ i ), and that predicted from a the dierence between a local estimate of the case density, (x ∗ ˆ tted model, (xi ), that is, at the ith location
ri = ˆ i − ˆ ∗i where i ≡ (xi ).
(10)
438
A. B. LAWSON
This approach has been proposed in the derivation of a deviance residual for modulated heterogeneous Poisson process models [55]. Essentially, a saturated estimate of ˆ i based on the Dirichlet tile area of the ith event can be employed, while a model-based estimate of ˆ ∗i is used in the comparison. This residual can incorporate estimated expected rates. It is possible s }; k = 1; : : : ; m; j = 1; : : : ; J , and the local to simulate J realizations from the tted model {xkj density of these realizations could be compared pointwise with ˆ ∗i . Of course, these proposals rely on a series of smoothing operations. More complex alternative procedures could be pursued. In a Bayesian setting it is natural to consider the appropriate version of (9). Carlin and Louis [42] describe a Bayesian residual as ri = yi −
G 1 E(yi |i(g) ) G g=1
(11)
where E(yi |i ) is the expected value from the posterior predictive distribution, and (in the context of MCMC sampling) {i(g) } is a set of parameter values sampled from the posterior distribution. In the tract count modelling case, this residual can therefore be approximated, when a constant tract rate is assumed, by ri = ni −
G 1 ei (g) G g=1 i
(12)
This residual averages over the posterior sample. An alternative possibility is to average the {i(g) } sample, ˆi say, to yield a posterior expected value of ni , say nˆi = ei ˆi , and to form ri = ni − nˆi . A further possibility is to simply form r2i at each iteration of a posterior sampler and to average these over the converged sample [56]. These residuals can provide pointwise goodness-of-t (GOF) measures as well as global GOF measures, and can be assessed using Monte Carlo methods. Deletion residuals and residuals based on conditional predictive ordinates (CPOs) can also be dened for tract counts [57]. To further assess the distribution of residuals, it would be advantageous to be able to apply the equivalent of PB in the Bayesian setting. With convergence of an MCMC sampler, it is possible to make subsamples of the converged output. If these samples are separated by a distance (h) which will guarantee approximate independence [58], then a set of J such samples could be used to generate {nsij } j = 1; : : : ; J , with nsij ← Pois(ei ˆij ), and the residual computed from the data ri can be compared to the set of J residuals computed from nsij − E(ni ), where E(ni ) is the predictive expected value of ni . In turn, these residuals can be used to assess functions of the residuals and GOF measures. The choice of J will usually be 99 or 999 depending on the level of accuracy required. When the constant tract rate (decoupling) approximation is not appropriate then an integral Poisson expectation must be evaluated. In the situation where case events are examined it is also possible to derive a Bayesian residual as we can evaluate E {(x|(g) )} based on the {i(g) } posterior samples. Hence it is possible to examine: ri = ˆ i −
G 1 ˆ ∗(g) G g=1 i
DISEASE MAP RECONSTRUCTION
439
where ˆ ∗(g) is the tted model estimate of intensity corresponding to the gth posterior sami ple. Further it is also possible with subsampling for approximate independence to use a PB approach to residual signicance testing. 5. A CASE STUDY As a demonstration of some aspects of modelling applied to the disease mapping situation, a variety of models have been tted to the Falkirk example. While this example only includes 26 tracts, this serves well to exemplify the dierences in approach which each model provides. These models consists of a range of ingredients from simple SMR, empirical Bayes (EB), linear Bayes (Marshall local estimator) to full Bayes, based on the approach of Besag et al. (BYM) [35]. The empirical Bayes method used here consists of estimating the relative risks from the gamma–Poisson model using the method of Clayton and Kaldor. The linear Bayes estimator is a linear function of the SMR and computed as the (local) relative risk ˆi = cˆi +
mˆ i ei
bˆ i {’i − cˆi } + bˆ i
nj jvi ej
jvi cˆi =
1 bˆ i =
j∼i ej
(13)
ej {’j − cˆj }2 −
cˆi
jvi ej
jvi
where ’i is the SMR for the ith area, j v i denotes the (rst) neighbourhood of the ith area. The rst neighbourhood consists of tracts=areas which have a common boundary with the ith tract=area. The BYM model assumed in this example is that dened by the original authors log(i ) = ui + vi where ui and vi are tract-specic correlated and uncorrelated heterogeneity terms, and no trend component is included. The prior and hyperprior distributions for these components are dened as follows: 1 1 2 (ui − uj ) p(u|r) ∝ m=2 exp − r 2r i j∈i p 1 p(v|) ∝ −m=2 exp − vi2 2 i=1 prior(r; ) ∝ e− =2r e− =2 ;
; r¿0
where i is the (rst) neighbourhood of the ith area. The posterior sampling for this model was achieved by Metropolis–Hastings update. Convergence was achieved within 1000 iterations. Figures 6, 7 and 8 display thematic maps of the resulting relative risk estimates. The posterior expected relative risks are displayed for the BYM model. Comparison of these model results
440
A. B. LAWSON
Figure 7. Linear Bayes relative risk estimators: Falkirk respiratory cancer mortality.
Figure 8. Posterior expected relative risk estimates under full Bayesian model (BYM): Falkirk respiratory cancer mortality.
can be made. In general, the EB and BYM methods provide overall shrinkage of the relative risks, but maintain the ranking of risk areas. The BYM model produces the smoothest map, as would be expected given its inclusion of spatially-correlated heterogeneity. The overall shrinkage in range found, in order of increasing severity, is SMR (0.30–2.040), EB (0.390– 1.880), BYM (0.773–1.236). The linear Bayes model (LB) appears to yield similar shrinkage as the EB model, but to a higher mid-range. However, the LB model leads to changes in the ranking of risk areas compared to the other methods. Figures 9 and 10 display the Bayesian residual (12) for the BYM model with correlated and uncorrelated heterogeneity, and the
DISEASE MAP RECONSTRUCTION
441
Figure 9. Crude Bayesian residual surface for BYM model based on (12): Falkirk respiratory cancer mortality.
Figure 10. P-value surface computed from the ranking of the observed crude residual amongst 99 residuals generated via a parametric bootstrap based on pseudo-independent realization of the tted model.
p-value surface for the ranking of the residual, based on a parametric bootstrap described in Section 4. The p-value surface displays no unusual rankings (range 0.11–0.77). These results suggest that the BYM model provides a reasonable overall t, although the shrinkage of all relative risks to close to 1.0 may be of concern. Marginal residual analysis could also be performed here. Note that marginal analyses may suggest a feasible model t, but there may still be considerable undetected structure (autocorrelation) in the residuals, which may suggest the possibility of achieving better model
442
A. B. LAWSON
ts by inclusion of extra variates. The examination of the spatial distribution of residuals is an important aspect of residual analysis, often ignored in published accounts of Bayesian disease map modelling (see, for example, reference [59]).
6. DISCUSSION AND CONCLUSIONS This paper has focused on the issue of disease mapping, that is, the production of ‘clean’ maps of the geographical distribution of disease (incidence or prevalence). An example of the application of a variety of methods, ranging from likelihood to full Bayes methods, has been examined. The results of this application tend to highlight a variety of issues which commonly arise in such applications. First, crude SMR relative risk estimates often contain artefacts which relate to the method of estimation as well as unobserved heterogeneity in the data. These artefacts can be partially removed by the inclusion of random eects. Under Bayesian models, these random eects can be posterior-sampled and the associate relative risk estimates can be averaged to produce a posterior-average relative risk. Approximate methods which seek to directly estimate the prior distribution parameters in these set-ups lead to empirical Bayes methods. Variants of these methods lead to linear Bayes methods. Finally the incorporation of both uncorrelated and correlated heterogeneity can be modelled within the full Bayesian posterior sampling framework. It is clear that, while empirical Bayes methods lead to relatively simple estimators of relative risk, other methods provide greater exibility in the choice of model and hence possibilities for analysis. While EB and LB methods can be implemented more easily, the use of full Bayesian methods has many advantages, not least of which is the ability to specify a variety of components and prior distributions in the model set-up. Sampling of posterior distributions is now relatively easily achieved with software such as WinBugs which provides Gibbs sampling for a variety of heirarchical Bayesian models. In public health applications, it may be important to provide some guidance as to the use of such methods and to recommend relatively simple methods to be used by practitioners in such areas. From a variety of studies, it would appear that the simplest approach to the problem lies in the use of empirical Bayes estimators (for example, gamma–Poisson model) as these can be computed relatively easily and appear to be reasonably robust. However, a full Bayesian approach has many benets although may not be easily implemented by nonstatisticians. The use of crude SMR maps cannot be recommended, and if employed, as is commonly the case in practice, should be accompanied by maps of variability. Indeed, it is to be recommended that all relative risk mapping exercises should include small area variability estimates as standard output. Of course, even when good methods of estimation are adopted, there remains the issue of visual interpretation of mapped results which adds a further stage of processing to the spatial information. A number of issues have not been addressed in this review, for the sake of brevity. Developments in spatio-temporal modelling have recently accelerated [50, 52, 53, 60–62] while the issue of edge eects or censoring at map boundaries is also of concern [18, 62]. In addition, the emphasis has largely been placed on analysis of tract count data, while in many situations the residential addresses of cases may be known. Hopefully this tutorial and case study will provide an entry point to the literature in this area, and interested readers are recommended to pursue relevant topics in the cited literature.
DISEASE MAP RECONSTRUCTION
443
REFERENCES 1. Snow J. On the Mode of Communication of Cholera. 2nd edn. Churchill Livingstone: London, 1854. 2. Pickle L, Mungiole M, Jones G, White A. Exploring spatial patterns of mortality: the new atlas of united states mortality. Statistics in Medicine 1999; 18:3211– 3220. 3. MacEachren AM. How Maps Work: Representation, Visualisation, and Design. Guildford Press: New York, 1995. 4. Monmonier M. How to Lie with Maps. 2nd edn. University of Chicago Press: London, 1996. 5. Pickle LW, Hermann DJ. Cognitive aspects of statistical mapping. Technical Report 18, NCHS Oce of Research and Methodology, Washington, D.C., 1995. 6. Walter SD. Visual and statistical assessment of spatial clustering in mapped data. Statistics in Medicine 1993; 12:1275 –1291. 7. Ross A, Davis S. Point pattern analysis of the spatial proximity of residences prior to diagnosis of persons with Hodgkin’s disease. American Journal of Epidemiology 1990; 132 supplement:53 – 62. 8. Lawson AB, Waller L. A review of point pattern methods for spatial modelling of events around sources of pollution. Environmetrics 1996; 7:471– 488. 9. Hardle W. Smoothing Technique: with Implementation in S. Springer-Verlag: New York, 1991. 10. Carstairs V. Small area analysis and health service research. Community Medicine 1981; 3:131–139. 11. Inskip H, Beral V, Fraser P, Haskey P. Methods for age-adjustment of rates. Statistics in Medicine 1983; 2: 483 – 493. 12. Diggle PJ. A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecied point. Journal of the Royal Statistical Society, Series A 1990; 153:349 – 362. 13. Lawson AB, Williams FLR. Spatial competing risk models in disease mapping. Statistics in Medicine 2000; 19:2451– 2468. 14. Lawson AB, Williams FLR. Armadale: a case study in environmental epidemiology. Journal of the Royal Statistical Society, Series A 1994; 157:285 – 298. 15. Bithell J. An application of density estimation to geographical epidemiology. Statistics in Medicine 1990; 9: 691– 701. 16. Lawson AB, Williams FLR. Applications of extraction mapping in environmental epidemiology. Statistics in Medicine 1993; 12:1249 –1258. 17. Kelsall J, Diggle P. Non-parametric estimation of spatial variation in relative risk. Statistics in Medicine 1995; 14:2335 – 2342. 18. Lawson AB, Biggeri A, Dreassi E. Edge eects in disease mapping. In Disease Mapping and Risk assessment for Public Health, Lawson AB, B’ohning D, Lesare E, Biggeri A, Viel J-F, Bertollini R (eds). Wiley, 1999; chapter 6, 85 – 97. 19. Breslow N, Day N. Statistical Methods in Cancer Research, volume 2: The Design and Analysis of Cohort Studies. International Agency for Research on Cancer: Lyon, 1987. 20. Lawson AB, Cressie N. Spatial statistical methods for environmental epidemiology. In Handbook of Statistics: Bio-Environmental and Public Health Statistics, Rao CR, Sen PK (eds). Elsevier, 2000; volume 18, 357– 396. 21. Lancaster P, Salkauskas K. Curve and Surface tting: an Introduction. Academic Press: London, 1986. 22. Green PJ, Silverman BW. Nonparametric Regression and Generalised Linear Models. Chapman and Hall: London, 1994. 23. Ripley BD. Spatial Statistics. Wiley: New York, 1981. 24. Daley DJ, Vere-Jones D. An Introduction to the Theory of Point Processes. Springer-Verlag: New York, 1988. 25. Diggle PJ, Morris S, Wakeeld J. Point-source modelling using matched case-control data. Biostatistics 2000; 1:1–17. 26. Marshall RJ. A review of methods for the statistical analysis of spatial patterns of disease. Journal of the Royal Statistical Society, Series A 1991; 154:421– 441. 27. Lawson AB, Bohning D, Biggeri A, Lesare E, Viel J-F. Disease mapping and its uses. In Disease Mapping and Risk Assessment for Public Health. Wiley, 1999; chapter 1. 28. Manton KG, Woodbury MA, Stallard E. A variance components approach to categorical data models with heterogeneous mortality rates in North Carolina counties. Biometrics 1981; 37:259 – 269. 29. Searle SR, Cassella G, McCulloch CE. Variance Components. Wiley: New York, 1992. 30. Tsutakawa RK. Mixed model for analysing geographic variability in morality rates. Journal of the American Statistical Association 1988; 83:37– 42. 31. Marshall R. Mapping disease and mortality rates using empirical Bayes estimators. Applied Statistics 1991; 40:283 – 294. 32. Lawson AB, Biggeri A, Lagazio C. Modelling heterogeneity in discrete spatial data models via MAP and MCMC methods. In Proceedings of the 11th International Workshop on Statistical Modelling, Forcina A, Marchetti G, Hatzinger R, Galamacci G (eds). Graphos: Citta di Castello, 1996; 240 – 250. 33. Breslow N, Clayton DG. Approximate inference in generalised linear mixed models. Journal of the American Statistical Association 1993; 88:9 – 25.
444
A. B. LAWSON
34. Clayton DG. A Monte Carlo method for Bayesian inference in frailty models. Biometrics 1991; 47:467– 485. 35. Besag J, York J, Molli e A. Bayesian image restoration with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 1991; 43:1– 59. 36. Cli AD, Ord JK. Spatial Processes: Models and Applications. Pion: London, 1981. 37. Cuzick J, Mills M. Clustering and clusters-summary. In Geographical Epidemiology of Childhood Leukaemia and Non-Hodgkin Lymphomas in Great Britain 1966 –1983, Draper G (ed.). HMSO: London, 1991; 123–125. 38. Glick BJ. The spatial autocorrelation of cancer mortality. Social Science Medicine 1979; 13D:123 –130. 39. Bock RD, Aitken M. Marginal maximum likelihood estimation of item parameters: an application of an EM algorithm. Psychometrika 1981; 46:443 – 459. 40. Aitken M. A general maximum likelihood analysis of overdispersion in generalised linear models. Statistics and Computing 1996; 6:251– 262. 41. Aitken M. Empirical Bayes shrinkage using posterior random eect means from nonparametric maximum likelihood estimation in general random eect models. In Proceedings of the 11th International workshop on Statistical Modelling, Forcina A, Marchetti GM, Hatzinger R, Galmacci G (eds). Graphos: Citta di Castello, 1996; 87– 92. 42. Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall: London, 1996. 43. Clayton DG, Kaldor J. Empirical Bayes estimates of age-standardised relative risks for use in disease mapping. Biometrics 1987; 43:671– 691. 44. Devine O, Louis T. A constrained empirical Bayes estimator for incidence rates in areas with small populations. Statistics in Medicine 1994; 13:1119 –1133. 45. Bernardo JM, Smith AFM. Bayesian Theory. Wiley: New York, 1994. 46. Lawson AB. On using spatial Gaussian priors to model heterogeneity in environmental epidemiology. Statistician 1994; 43:69 – 76. 47. Lawson AB. Some spatial statistical tools for pattern recognition. In Quantitative Approaches in Systems Analysis, volume 7, Stein A, Penning de Vries FWT, Schut J (eds). C. T. de Wit Graduate School for Production Ecology, 1997; 43 – 58. 48. Lawson AB, Biggeri A, Boehning D, Lesare E, Viel J-F, Clark A, Schlattmann P, Divino F. Disease mapping models: an empirical evaluation. Statistics in Medicine 2000; 19:2217– 2242. 49. Clayton D, Bernardinelli L. Bayesian methods for mapping disease risk. In Geographical and Environmental Epidemiology: Methods for Small-Area Studies, Elliot P, Cuzick J, English D, Stern R (eds). Oxford University Press, 1992. 50. Bernardinelli L, Clayton D, Pascutto C, Montomoli C, Ghislandi M, Songini M. Bayesian analysis of space-time variation in disease risk. Statistics in Medicine 1995; 14:2433 – 2443. 51. Diggle P, Tawn J, Moyeed R. Model-based geostatistics. Journal of the Royal Statistical Society, Series C 1998; 47:299 – 350. 52. Waller L, Carlin B, Xia H, Gelfand A. Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association 1997; 92:607– 617. 53. Zia H, Carlin BP, Waller LA. Hierarchical models for mapping Ohio lung cancer rates. Environmetrics 1997; 8:107–120. 54. Lawson AB, Bohning D, Lessafre E, Biggeri A, Viel J-F, Bertollini R (eds). Disease Mapping and Risk Assessment for Public Health. Wiley, 1999. 55. Lawson AB. A deviance residual for heterogeneous spatial Poisson processes. Biometrics 1993; 49:889 – 897. 56. Spiegelhalter D, Best N, Gilks W, Inskip H. Hepatitis b: a case study in mcmc methods. In Markov Chain Monte Carlo in Practice, Gilks W, Richardson S, Spiegelhalter D (eds). Chapman and Hall, 1996. 57. Stern H, Cressie N. Posterior predictive model checks for disease mapping models. Statistics in Medicine 2000; 19: special issue: Disease mapping with emphasis on evaluation of methods. 58. Robert CP, Casella G. Monte Carlo Statistical Methods. Springer-Verlag: New York, 1999. 59. Ghosh M, Natarajan K, Stroud T, Carlin B. Generalized linear models for small-area estimation. Journal of the American Statistical Association 1998; 93:273 – 282. 60. Xia H, Carlin B. Spatio-temporal models with errors in covariates: Mapping ohio lung cancer mortality. Statistics in Medicine 1998; 17:2025 – 2043. 61. Knorr-Held L, Besag J. Modelling risk from disease in time and space. Statistics in Medicine 1998; 17: 2045 – 2060. 62. Lawson, AB. Statistical Methods in Spatial Epidemiology. Wiley: New York, 2001.
Part V SIMPLIFIED PRESENTATION OF MULTIVARIATE DATA
Tutorials in Biostatistics Volume 2: Statistical Modelling of Complex Medical Data Edited by R. B. D’Agostino ? 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02370-8 447
448
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
449
450
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
451
452
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
453
454
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
455
456
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
457
458
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
459
460
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
461
462
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
463
464
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
465
466
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
467
468
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
469
470
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
471
472
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
473
474
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
PRESENTATION OF MULTIVARIATE DATA
475
476
L. M. SULLIVAN, J. M. MASSARO AND R. B. D’AGOSTINO SR.
Index acalculia 388 adaptive rejection sampling 88 adjusted likelihood estimator 265 aected sib pair (ASP) 346, 348 lod score 354–5 parametric modelling 347 aggregated information 307 agraphia 388 AIDS 138 Akaike’s information criterion (AIC) 52, 102, 174–5 all-cause mortality curve 228 allele frequencies 340 alleles 328–9, 332, 340, 342, 345 American Board of Internal Medicine (ABIM), Patient Satisfaction Questionnaire 49 American Thoracic Society (ATS) 7 analysis of covariance (ANCOVA) 10, 15–16 with random eects 56–9 analysis of variance (ANOVA) 254, 372, 377 angosia 388 approximate likelihood 290–1, 314 AR(1) structures 148, 166, 170, 197–8 AR(1)+RE 166–7, 179 AR(1)@UN 206 at-risk population 427 autocorrelation function (ACF) 90 autoregressive correlation structure 31 autoregressive with random eect for patient 170–1 autosomal chromosomes 340 autosomal loci 328 autosomal trait 328 autosomes 328
best linear unbiased estimates (BLUE) 160, 184 best linear unbiased prediction (BLUP) 44, 59, 138 beta-binomial likelihood 239–43 beta-binomial model 228 binomial likelihood 237 bivariate approach 299–307 bivariate data, more precise analysis 314 bivariate meta-analysis with covariates 311–14 bivariate random eects model 301–6, 315 bivariate regression model 311 block diagonal structure 72 blocking factor 87 blood types 329 BMDP 145, 154–6 BMDP 5V 144, 160, 188 BMI 95–126 analysis of 96 BMI growth curves based on cross-sectional analysis 121 clinical implication of development 103–4 databases 104–6 discussion of results 117–22 model construction 106–8 models considered 109–13 pooled data 106 practical implication 121 results 113 smoking data 117 statistical analyses 106–13 BMI prediction over time 115 body mass index see BMI Bonferroni correction 404 bootstrapping 91–2, 441 brain, in action 398 brain damage 385–7 brain function 385 localization 385 brain structure 386 brain tissue 391 Brief Psychiatric Rating Scale (BPRS) 131, 142–4, 146–7 Broca’s area 385 Brodmann’s areas 385 BUGS 264, 268, 272–3, 279, 321, 436 output for random-eects lidocaine meta-analysis 285–6 bump function 222–3 BYM model 439–41
bandwidth limited function 396 base pairs 340 Bayesian analysis 234 Bayesian approach 209–10, 216, 264, 434–5 Bayesian Inference Using Gibbs Sampling see BUGS Bayesian information criterion (BIC) 53, 102 Bayesian models 88–91, 432–5 advanced 435–6 empirical 321 hierarchical 321 Bayesian residual 438 Bayesian statistics in meta-analysis 321–2 BCG vaccine against tuberculosis 293 BEAM 436
477
478 cardiovascular death distribution by clinic 243 distribution of probability 244 probability 227, 231 relative risk 232–3 relative risk of estimated baseline risk 231 cardiovascular risk factors 230 distribution of baseline 244 carriers 329 case events, mapping 427 CD4 count 138 Centers for Disease Control and Prevention (CDC) 104 cerebrospinal uid 391 chi-square distribution 345 chi-statistic 13 childhood, leukaemia and lymphoma 427 children pulmonary function development 133 model selection and checking 149–54 cholesterol in CHD risk estimation 449 chromosomes 328, 340 class discovery 364 class prediction 373–6 classication error 374 classication rules 374 cluster analysis 364–5 clustered data 3–33, 129 COMP model 206–8 complementary DNA (cDNA) 362–3, 378 complex traits, genetic mapping 339–59 composite covariance models 199–200 composite hypotheses 216–17 transforming into simple ones 234 compound symmetric (CS) covariance structure 169–70 compound symmetric (CS) structure 166, 178, 198 compound symmetry 197 conditional distributions 89 conditional likelihood for relative risk 235–9 conditional predictive ordinates (CPOs) 438 condence interval 216, 291–2, 296–9 continuous risk factors 463 contrast-to-noise ratio 401 convolution, Fourier transform in 396–7 coronary heart disease (CHD) multivariable models 447–76 risk estimates 447 risk factors 447–8 spatial distribution 428 correlated data
INDEX
regression analysis 127 use of term 129 correlated heterogeneity 433 correlated responses 3–33 correlation, Fourier transform in 396–7 correlation coecients unstructured 32 user xed 32 correlation matrix 140 correlogram 173 cosegreagation 341 covariance, Fourier transform in 396–7 covariance components estimating techniques 43–4 HLM/2L 53–5 hypothesis tests 45 SAS Proc Mixed 53–5 covariance matrix 312–13 sampling distribution 184 3 x 3 89 covariance models 197–206 for nested repeated measures data 187–208 covariance parameters 197 covariance structure 72 comparison of ts 173–5 eects on tests of xed eects, estimates of xed eects and standard errors of estimates 175–80 example data set 160–2 for repeated measures 165–7 modelling 159–85 repeated measures data analysis 183 Cox proportional hazards model 464–74 Cox proportional hazards regression models 449 Cox regression models 133 Cronbach’s alpha internal consistency reliability coecient 49 cross-classication 82–3 cross-covariance function 397 cross-over design 134 cross-validation 374 crude Bayesian residual surface 441 data resolution 4, 8, 12–13, 28 Data Safety Monitoring Committee (DSMC) 233–4 delta method 301 dendrogram from average linkage hierarchical clustering 366 diagnosis of adult-onset diabetes 225 diagnostic test for disease 212–13, 218 diallelic loci 329, 343, 351 dichotomous traits 329 dierence measures, more precise analysis 316 diuse prior distributions 89 Diggle model 141, 149, 151 discrete Fourier transform 397
INDEX
discrete genotypes 332–3 discrete phenotypes 332–3 disease cases, count of 424 disease clustering 424 disease counts within tracts 426, 428–9, 431–2 disease distribution, crude representation 426 disease incidence, status of area 427 disease incidence data, representation 425 disease mapping applications 423 basic models 430–2 basic situations 424 case-event data 430–1 case study 439–42 discussion 442 exploratory methods 430 likelihood models 430 overview 423–4 reconstruction 423–44 restoration 425–36 statistical methods and issues 424 DNA 328 overview 362 DNA sequence 329, 361 dominant trait 329 dose-response curve data 128 double heterozygote 332 doubly repeated measures 188 dummy variables 461 echo-planar imaging (EPI) 392, 397–8 ecological analysis 424 eect size 259 eective population size 344 Elston–Stewart (peeling) algorithm 342 EM algorithm 301 empirical Bayes estimates 44, 138, 141, 151, 154, 315 empirical Bayes methods 116, 129, 434–6, 442 empirical Bayes model 321 empirical Bayes relative risk estimates 435 empirical power spectrum 398 empirical spectral density 398 ERIC (Educational Resources Information Center) 254 estimated standard error 45 estimating techniques covariance components 43–4 xed eects 43 hierarchical linear modelling 42–4 random eects 44 estimation HLM/2L 39–45 SAS Proc Mixed 39–45 Euler characteristic 406 evidential paradigm 210–17, 219
479
F tests 176 F-statistics 368 familial distributions of medical traits 327 family studies 327 fast Fourier transform algorithm 397 FEVU1u 133–4, 138, 142, 149–52, 154, 161–3, 165, 172, 180, 184 Fieller’s method 301 Fisher’s linear discriminant function 373–6 xed coecients 100 xed eects 10, 159 design matrix 137 estimates of 183 estimating techniques 43 HLM/2L 53–5 hypothesis tests 45 SAS Proc Mixed 53–5 xed eects model 253, 260–1, 263, 265–6, 293–6 xed parameter estimates 304 xed parameters 71 fMRI 383–422 aim of 387 basic principals 388–402 binary mask 413 case study 411–17 complots 414 computing summary images 414 design of experiments 418 discussion 417–19 echo-planar imaging (EPI) 392, 397–8 eective spatial resolution 400, 414–15 example 388 eld gradients 393 gradient history space 393 homogeneous stochastic processes 407 image collection and data structures 411–12 image pre-processing 412–14 image quality 398–402 image quality measures 400 k-space 393 low-variance screen 413–14 magnetic elds 392 non-parametric methods 404–5 orthogonal gradient elds 392 outline of tutorial 384 parametric methods 405–11 phase-encode direction 395 practical measures of image quality 401–2 primary imaging parameters 399–400 results of case study 415–17 results of experiments 418 signal generation 389–93 signicance by spatial extent 410–11 spatial and temporal smoothness 407–8 spatial resolution 398 spatial time series analysis 402–11
480
INDEX
3D localization and measurement 389–93 trade–os 401 two-dimensional 395 see also human brain mapping founders/non-founders 330 Fourier coecients 394 Fourier frequencies 394 Fourier methods for image reconstruction 393–7 Framingham Heart Study (FHS) 104, 447–76 frequentist approach 209–10, 434 full Bayes methods 436 full Bayesian model (BYM) 440 full information maximum likelihood estimation (FIML) 100 functional equation approach 306 functional magnetic resonance imaging see fMRI funnel plots 267 gametes 328, 341 Gaussian autocovariance function 408 Gaussian density 399 Gaussian distribution 399 Gaussian random elds 405–6 gene expression, overview 361–3 gene expression data analysis 361–79 discussion 376–8 example 364–76 statistical questions 363–4 gene identication 370–2 genealogy, specication 331 general heterogeneity 433 general linear mixed model (GLMM) 127–58, 184, 314, 320 case studies 130–3, 144–54 design considerations 134–5 development 134–43 equation dening 137 estimation and testing 137–8 extension 141 missing data 142–3 model checking 141–2 model development 135–7 model selection 138–41 modelling the mean 138–40 modelling the variance 140–1 generalized estimating equations (GEEs) 3–33, 120, 206 correlation structures 31–2 models 9, 13, 71 with exchangeable correlation structure 16–17 generalized least squares (GLS) 43, 177 estimator 137 xed eects estimates 160 theory 184 generalized linear models 8, 85–6
generic prior distributions 89 genes 328, 340, 362 genetic background 330 genetic data set 353 analysis 354 genetic drift 344 genetic epidemiology 327–38 aim of 327–8 sampling design and strategy in 335–6 genetic mapping of complex traits 339–59 example 352–5 genetic marker 341 genetic models 340 components of 330, 340 specication 330–3 terminology 328–30 genetic terminology 327, 340–4 genetic traits, distribution of 336 genotypes 328, 330–4, 340, 344–6 Gibbs–Metropolis samplers 436 Gibbs sampling 88–9, 436 goodness-of-t (GOF) 108–9, 310, 437–9 gradient pulses 392 Gram–Schmidt orthonormalization 406–7 grey matter 383–4, 391 growth curve data 128 growth curves development of 95–126 graphical representation 98 model tting 101 see also BMI growth curves growth function, individual and group 99 haemodynamic response functions 408–9 haplotype 340, 351 haplotype frequencies 343 Haseman–Elston regression model 349 hazard function 87 hazard rates versus relative risk 229 Hedges’ g 259 Henry A. Murray Research Center 104 heterogeneity between studies 307 global measure of 433 heterogeneity tests 292, 346 heterozygotes 329 heterozygous genotype 340 hierarchical clustering 365 hierarchical data modelling 69 hierarchical data structures 35–6 three-level 36 two-level 36–7 hierarchical linear modelling 35–68, 95–126 analysis of simulated data 60 applications concerning continuous dependent variables 38 assumptions 39
INDEX
combined model 38–9 estimation techniques 42–4 HLM/2L 40–1, 61–2 hypothesis testing 44–5 level 1 37–9 level 2 37–9 meta-analysis 120 model specications 37–9 motivation 36 notation 37 objective 36 overview 35–6 results 110–11, 118–19 SAS Proc Mixed 41, 61–2 hierarchical models versus marginal models 71–2 high-density lipoprotein (HDL) 449 HLM 74 overview 96 pedagogical exposition 96–103 statistical software 102–3 versus OLS 100–2 HLM/2L 35–6, 39, 46–8 comparison of estimates 62–6 covariance components 52–5 estimation 39–45 xed eects 53–5 hierarchical linear modelling 40–1, 61–2 hypothesis testing 39–45 inputting data 46–7 output 47–8 program specications 47 random coecients regression model 56 homogeneity test 263 homologous chromosomes 341 homozygotes 329 homozygous genotype 340 human brain mapping 383–422 outline of tutorial 384 overview 383–8 see also fMRI hyperparameters 260, 434 hyperprior distribution 439 hypothesis testing covariance components 45 xed eects 45 hierarchical linear modelling 44–5 HLM/2L 39–45 random eects 45 SAS Proc Mixed 39–45 type I and II errors 218–21 identity-by-descent (IBD) 342–3, 346–9, 354 identity link 8 incomplete data 29–30 independence, correlation structure 31 independent variables 101 informative for linkage 341
481
informative prior information 89 inner repeat factor (IRF) 188, 206 Institute for Social Research 104 intercept time 463 International Registry of Vision Trials 254–5 interpolation methods 429–30 intra-class correlation 4 intra-class correlation coecient 31 iterative generalized least squares (IGLS) 11, 73, 81, 108 jack-knife method 374, 376 K-means clustering 365–8 Kruskal–Wallis test 372 L’Abb e plot 300 larynx cancer case event realization 425 Law of Likelihood 210–11, 214, 216, 219, 221, 233–4 left hemisphere 386–7 length of stay (LOS) 250 leukaemia 363 lidocaine 257, 261, 274 diagnostics 276 meta-analysis 273 prophylactic use 249–51 summary of results 276–7 use in heart attacks 275–7 xed-eects results 275 homogeneity of study means 275 random-eects results 275–6 likelihood-based linkage disequilibrium approaches 352 likelihood based methods 210, 232 likelihood functions 213–15, 224, 226, 228, 335 for relative risk 236 S-plus computer code 236–43 likelihood methods for measuring statistical evidence 209–45 likelihood ratio tests 138, 142 likelihood ratios 209–12, 216, 218 Neyman–Pearson hypothesis testing theoryhll 219 likelihood support interval 215 likelihood theory 335 likelihoods of models 333–5 linear Bayes relative risk estimators 440 linear mixed models 159 for repeated measures 163–5 MIXED procedure 167–73 link functions 85 linkage 340–1 complex diseases 348 disequilibrium mapping 350–2 equilibrium/disequilibrium 343–4 informative for 341
482
INDEX
linkage analysis model-based 344–6, 350 model-free 346–50 log-likelihood 77, 292 log-likelihood ratio 335 log-odds 304–5 versus latitude 313 log-odds ratio 290, 295, 297–300, 308 logit link 85 longitudinal data analysis 127–58 use of term 128 longitudinal studies, examples 134 Luria–Delbrck method for estimating mutation rates in exponentially growing bacteria 351–2 mapping case events 427 see also disease mapping; genetic mapping; human brain mapping marginal likelihood 434 marginal models 9, 301 versus hierarchical models 71–2 marginal residual analysis 441 Markov chain Monte Carlo see MCMC maximum likelihood (ML) analysis of general linear mixed model, statistical software for 143–4 maximum likelihood (ML) estimates 43, 73, 188, 224, 297, 299, 301, 304, 335 under Normality 74 MCMC algorithms 88 methods 86, 88–91, 120, 321–2, 352, 436, 438 sampling 90 summary screen 91 medical traits, familial distributions of 327 Mendelian laws of inheritance 340 Mendelian laws of segregation 328–9, 342 Mendelian models 328 meta-analysis 83–5, 249–87 advanced methods 289–324 analysis under homogeneity 291 Bayesian statistics in 321–2 beginning 252–4 bivariate 289, 299–307 dening the study outcome 257–9 denition 249 designs 252 diagnostics 265–8 evaluating retrieved literature 256–7 formulation 252–6 hierarchical linear modelling 120 inference 261–5 literature search 254–6
means and eect sizes 258–9 modelling variation 260–5 objectives 249 odds ratios 257–8 one-dimensional treatment eects 291–9 operational denitions 252 quantitative aspects of literature search 255 relative risks 257–8 reporting 273–5 risk dierences 257–8 software 268–73 sources of information 254 study objectives 252 summary measures 257 summary statistics 257 test statistics 259 use to estimate average eect 250 with multiple outcomes 317–18 meta-regression 307–14 method of moments (MOM) 265, 276–7, 279 Metropolis-Hastings (MH) algorithm 436 Metropolis-Hastings (MH) sampling 88 Metropolis-Hastings (MH) update 439 microarrays 362–3 Mini-Wright portable meter 7 MINITAB 144 misclassication rate 374–6 misleading evidence 217–25 denition and implications 217–18 probability of observing 218, 221 missing at random (MAR) 142, 192 missing completely at random (MCAR) 192 missing data, general linear mixed model 142–3 mixed models 8, 71, 301, 333 BMDP code for 154–6 general linear 142–3 implementation 160 SAS code for 156–7 MIXOR 74 ML3, applications 144 MLA code and output 122–3 mLn 5, 321 MLwin 74–5, 321 equation screen with estimates 76 equation screen with model display 75 equation screen with model unspecied 74 growth data example 75–80 output 124 predicted growth curves 77 model-based standard error 17 monogenic trait 345 Monte Carlo 5 per cent monitoring bounds 225 Morgan (unit) 341 multi-dimensional scaling (MDS) 368–70 multidisciplinary care for stroke inpatients 278–82 diagnostics 280–1 xed-eects results 279
INDEX
homogeniety of study means 279 random-eects results 279 summary of results 281–2 multilevel linear modelling 96 multilevel mixed modelling 3–33 multilevel modelling 5, 8, 10–11, 17–22 nature of 70–1 of medical data 69–93 with exchangeable correlation structure 21 Multilevel Models Project 74 multilocus models 348 multi-parameter models 224 multiple ascertainment of a family 336 multiple logistic regression 244 multiple logistic regression model 453–61 estimates of regression coecients 454 multiple membership structures 82–3 multiple outcomes 317–18 Multiple Risk Factor Intervention Trial (MRFIT) 104 multivariable models coronary heart disease 447–76 interpretation issues 474–5 parameters 461, 464 risk estimates 474–5 multivariate data, use of term 129 multivariate maximum likelihood (MML) estimate 320 multivariate normal mode, estimation 72–4 multivariate response data 80–2 naive pooled analysis, PEF 8, 13–14 naive pooling 4 naive regression model 9 Nanostat 11 National Center for Health Statistics (NCHS) 104 National Cholesterol Education Program (NCEP) Adult Treatment Panel III (ATP III) 449–53 National Health and Nutrition Examination Survey I (NHANES-I) 106 National Health and Nutrition Examination Survey III (NHANES-III) 103 National Health Epidemiologic Follow-up (NHEFS) 106 National Heart Lung and Blood Institute 448 nested modelling 96 nested repeated measures data 188 covariance models for 187–208 neuroanatomy 386 neurons 383 Neurosurgery Clinical Trials Registry 254 Neyman-Pearson hypothesis testing theory, likelihood ratios 219 NHLBI guidelines 103 non-normal mixtures 317 non-specic heterogeneity 433 Normal standardized deviate 13
483
Normality, maximum likelihood estimates under 74 normalized cross-covariance maps 409–10 NTIS (National Technical Information Service) bibliographic database 254 nuclear families 328 nuisance parameters 224–5, 228 null hypothesis 345–7 Nyquist (or folding) frequency 396 obesity see overweight/obese adults observation level covariate 134 oligonucleotides 362–3, 376 optimization algorithm 143 ordinary least squares (OLS) 38, 97, 117, 177 estimate 59 methods 96 modelling 129 procedure 99 versus HLM 100–2 outcome measures 320 outcomes, multiple 317–18 outer repeat factor (ORF) 188, 206 ovarian hormone values 196 ovarian steroid secretion data 187–208 ovarian steroid secretion study 206 overweight/obese adults 95–126 p-values 201, 206, 296, 355, 370–2, 403, 441 pairwise reasoning 217 parametric modelling of aected sib pairs 347 partial autocorrelation function (PACF) 90 peak expiratory ow (PEF) clinical characteristics 14 clinical conclusions 28–9 computing statistics of principal interest 22–4 correlation structure 8 correlation structure modelling 30–1 correlations arising from varying intercept 8–13, 15–21 data preparation 7 extending the correlation structure 21–9 varying intercept gradient 11 gold standard 6–7 interpreting the model 24–6 model criticism 26–7 naive pooled analysis 8, 13–14 notation 7–8 overview 5–6 results 13–29 statistical analysis 7–13 statistics of interest 6–7 pedigree 330, 334, 352 analysis 328 founders 344–6
484
INDEX
penetrance 340 parameters 333 probabilities 332–3 periodogram 398 phenocopies 329 phenotypes 328–9, 332–3, 340 phenotypic data 331 piecewise survival model 87 pneumotachograph spirometer 6–7 point event location 427 point spread function 398 points system 452–3, 474–5 algorithm for 461–3 Poisson point process 431 polygenic model 330, 335 polygenic value 330 polynomial trends over time 180–2 population parameters 331 PORT study data analysis 49–59 description of physician and patient samples 50 distribution of patient satisfaction scores 51 intercept and slopes as outcomes model 52–5 HLM/2L 53 OLS estimates of intercepts and slopes 51 relationship between OLS intercepts and best linear unbiased predictions 59 relationship between OLS intercepts and slopes 52 results 49–62 portable peak ow meters 5–29 posterior distribution 260, 264 posterior expected relative risk estimates 440 precision in search process 255 primary sampling units (PSUs) 37 prior distribution 439 priors 90 prole likelihood 224–5 for a mean 234–5 prole likelihood function 297–8 publication bias 254, 267–8 pulmonary function development in children 133 model selection and checking 149–54 qualitative trait 346–8 quantitative trait 348–50 random coecients 71, 100–1, 136 estimates of 57 modelling 112 regression 121 regression model 55, 96 HLM/2L 56 SAS Proc Mixed 57 random eects 5, 8, 10–11, 136, 138, 141, 159, 432–5
analysis of covariance model (ANCOVA) with 56–9 estimating techniques 44 hypothesis tests 45 models 200, 208, 254, 260, 262–6, 296 multiple outcomes 320 random parameters 71 raw residual 12 raw total residual 12 recall in search process 255 recombination 340–1 recombination fraction 341, 345 red blood cells 329 referent risk factor prole 455, 462, 465–6 regression for dierence measure 307–8 Fourier transform in 396–7 on allocation 310 on latitude 308, 310 on year 310 regression analysis, of correlated data 127 regression coecients 9, 71, 310 interpreting 29–30 regression inferences 10 regression parameters 136 relative risk 228–9 conditional likelihood for 235–9 estimation 428 likelihood function for 236 of disease 423 of disease to siblings and ospring 347 versus hazard rates 229 repeated measures covariance structures for 165–7 linear mixed model for 163–5 repeated measures data 75 repeated measures data analysis 159–85 covariance structure 183 primary distinguishing features 183 residuals analysis of 437–9 spatial distribution 442 summary functions of 437–9 respiratory cancer 441 counts 425 SMR thematic map 429 restricted maximum likelihood (REML) estimation 43–4, 73, 77–9, 137, 142, 161, 188, 264, 276–7, 279, 298 variance, covariance and correlation estimates 172 right hemisphere 386–7 risk estimates 452, 474–5 in coronary heart disease 447 multivariable models 474–5 points system 474–5
INDEX
risk factors cardiovascular 230, 244 categories and reference values 454–5, 461–2, 465 in coronary heart disease 447–8 RNA, overview 362 robust standard error 13, 17, 21 sampling bias 335 sampling design and strategy, in genetic epidemiology 335–6 sampling distribution, covariance matrix of 184 SAS 5, 17, 74, 320, 376–8 code and output 123–4 code for mixed models 156–7 GENMOD 206 output for random-eects stroke meta-analysis 283 Proc GLM, comparison of estimates 62–6 Proc IML 318 Proc Mixed 27, 35–6, 39, 48–9, 143, 160, 163, 165, 167, 183, 188, 206–8, 268–72, 290, 292, 298, 301, 311, 314–15, 318–20 comparison of estimates 62–6 covariance components 53–5 estimation 39–45 xed eects 53–5 hierarchical linear modelling specications 41 hierarchical linear models 61–2 hypothesis testing 39–45 inputting data 48 intercept and slopes as outcomes model 54 output 48–9 program specications 48 random coecients regression model 57 Proc NLMIXED 184 prod mds 369 schizophrenia clinical trial 130–3, 144–8 covariance matrix 147 design considerations 144–5 estimation and eect of missingness 146–8 model selection and testing 145–6 regression coecients and standard errors for three covariance moels 148 schizophrenia clinical trial, case study 139 Schwarz’s Bayesian criterion (SBC) 174–5 SCIL-Image 412 secondary sampling units (SSUs) 37 seemingly unrelated regression 96 segregation analysis 340 segregation variance 330 sensitivity analysis 265–6, 277, 280 serum oestrogen levels 204 urine oestradiol levels 202 serum oestrogen data 200–1 serum oestrogen levels 192, 194
485
analysis results 205 sensitivity analysis 204 serum oestrogen observations 198 sex chromosomes 340 shrinkage 59, 138 shrinkage estimates 270–2 shrinkage factors 12 signal-to-noise ratio 400, 402 simple linear regression model 9 extension 3–33 simulation strategy 61–2 Six Cities Study of Air Pollution and Health 133 smoking eect on urinary oestradiol levels 201 smoking status in CHD risk estimation 449 software codes and their outputs 122–4 SP model 206–8 spatial covariance models 198–9 specic heterogeneity 433 S-plus 5, 9, 16, 74, 268, 320, 376–7, 412 code for xed-eects lidocaine meta-analysis 282–3 code for xed-eects stroke meta-analysis 283 code for likelihood functions 236–43 S-plus function 228 spotted cDNA array 363 spotted cDNA arrays 376–7 SPSS 320 standard regression model 12 standardized mean dierence 259 standardized mortality/morbidity ratios (SMRs) 439, 442 and standardization 427–9 STATA 74, 320 statistical evidence likelihood methods for measuring 209–45 strength of 211–12 statistical inference 13 statistical software 46–9, 102–3, 320 for maximum likelihood analysis of general linear mixed model 143–4 see also specic packages stroke specialist care 252 see also multidisciplinary care for stroke inpatients structural equation approach 306 subject level covariate 134 summary statistics 4–5, 250–1 support intervals 215–16 survival models 86–8 syntenic chromosomes 340 systolic blood pressure 449 t-distribution 296 Tecumseh Mortality Follow-up Study (TMFS) 106 tepee function 222–3
486
INDEX
time trends, modelling 135 Toeplitz structure (TOEP) 167, 172, 178–9 Tolbutamide 225–9, 231, 243–4 transmission/disequilibrium tests (TDTs) 351, 355 transmission probability 340 treatment eect and baseline risk 306 tubal ligation study 195 tuberculosis, BCG vaccine against 293 Tucson Epidemiological Study of Airways Obstructive Disease 128 type I and II errors of hypothesis testing 218–21 Type II Diabetes Patient Outcomes Research Team (PORT) study 35–6 UN@AR(1) model 206, 208 unbalanced repeated measures data analysis 127–58 uncorrelated heterogeneity 433 universal bound 221 University Group Diabetes Program (UGDP) 209–45 analyses within subgroups 232–3 background 225–6 checking for imbalance 229–31 early termination of Tolbutamide arm 226–9 UNIVRES (Directory of Federally Supported Research in Universities) 254 unstructured covariance 164 Index compiled by Georey C. Jones
unstructured structure 167, 172 urinary oestradiol data 200 urinary oestradiol levels 189–91, 193, 195 analysis results 203 sensitivity analysis 202 smoking eect on 201 vaccination eect 300, 304–5, 312–13 variance-covariance matrix 197 varying regression intercept 8 Wald test 13, 138 weighted least squares (WLS) 43 regression 309 Welsh–Allen (PneumoCheck) pneumotachograph spirometer 6 white matter 391 WinBUGS 74 Woolf’s formula 294 X chromosome 328–9 X-linked traits 328 Y chromosome 328–9 Z-Image 412 zygote 328