:I
1
,
- ,',
I 1111111H'J' 07-136
MISSING DATA
I�AlJL DQ ALLISON
I /,1
iversity of Pennsylvania
SAGE PUBLICATIONS International Educational and Professional Publisher
Thousand Oaks
London
New Delhi
---- ---
Copyright
C02()02 by Sage
Publications, Inc.
/\11 rights reserved. No part of this book n1ay be reproduced or utilized hy
or
means,
l'
Icct ronie or mechanical, including photocopying, recording,
any j II form;\ t ion \j nragc and retrieval
I ltll)1 islll'J".
/,/".
in any form
system, without permission
or
by
in writing fron1 the
-- -- ------- -------- ----- ----------------
in(ofllflllioll."
S;II''<" 1'ldllit";I[ll)II�_
.1,!',') 'klkl
Inc.
I�(\;HI
'1ItIlW . 111l1 ();Ik:,_ (';tiil l)[ lIi;1 l)l ) ) () I, 111;lil' \j (I('! ('I :,;I!',CPldu'{ )111 .
4
,'-;,1,1',(' 1'11111[(,;111(\11', Iltl, I1
I
I
1 t( 11111 III
,\ I r { . (
HltI(l]1 ],( " .\
( I II I[ ( . I I I\. I I I ,f ',I I (
,(
,II'( ,I I I
l'ldtlll';i!lllll',
',.1 ,1 '('
' J\1 \, I\J.IILI'I
I
111t11;1
I'vl
[[d.
(11<'.li('l k,III.I'.11 [ I-k \\' I 1(-1111 1]( 1 (I.) �.: IlId I ;j
,
I
\ II"
l I
l l
,
I );]\'i(1. (];II:\ / I I.V
1';1111
I\I[',',[]I!',
Jl
(()u
('Ill,
( );I'�:-")
� 110.
II wi \J dcs
IS B N 1
_
Paul D. Allison.
07-136)
hibl iographical references and index.
0-7h 19-1672-5
(p )
M:d henlatical statistics. 2.
I. Ti tic.
II.
Missing observations (Statistics) Series: Sage university papers series. Ouantitative
II It' socia I sci ences; no. 07-136
()A27()
.A55
applications in
2001
2001001295
()() l.4'22-dc21
This hook is printed on
05
06
acid-free
07
10
paper.
9
8
7
6
C. D e b ora h Laughton
AcquilinR Editor:
J.<.'ditorial Assistant: Production Editor:
Eileen Carr
Production Assistant:
Typesetter:
Denise Santoyo Kathryn Journey Technical Typesetting Inc.
When citing a university paper, ple a se
use
the proper form. Remember to cite the Sage University
Paper series title and include paper number. One of the following formats can be adapted (depend
ing
on
the style manual used):
(1) ALLISON, P. D.
tions in the Social OR
(2001) Missing Data. Sage University Papers Sciences. 07-136. Thousand Oaks, CA: Sage.
(2) Allison, P. D. (2001). l'V1issing Data, (Sage University P ape rs in the Social Sciences, series no. 07-136). Thousand Oaks, CA:
Series on Quantitative Applica
Series on Quantitative Applications
S age .
,
( '( '�NTENTS
.t�"I�ics Editor's Introduction
v
i
lntroduction
1
Assumptions
3
Missing Completely at Random
Nonignorable
3 4 5 5
(�onventional Methods
5
!.
_
-
IVlissing at Random Ignorable
\..
I jstwise Deletion
I )airwise Deletion
I )ummy Variable Adj ustment
Iinputation
Summary '.- IVfaximum Likelihood
R.eview
12
( �onclusion
13 14 15 18 19 21 23 25 26
Multiple Imputation: Basics
27
Single Random Im putation
28 29
of Maximum Likelihood
ML With Missing Data C�ontingency Table Data I j nea r Models With Normally Distributed Dat a
'rhe
EM
Algorithm
I.�M Example
I)irect ML
IJirect ML Example
')
6 8 9 11 12
Multiple Random Imputation
---
-------
Allowing for Random Variation in Estimates
the
Parameter
30
Multiple Inlputation Under the Multivariate Normal
MOULl
32
[)ata Augtl1cntation for the M u l tivariate Normal
C'onvLrgLncc in Data Augmentation
Model
Sequcntial Versus Parallel Chains of Data Augm en tation l Jsing the Normal Mod e l for Nonnormal or Categorical [)ata
r �xploratory
1
Role
38
50
Interactions and Nonlinearities in MI Analysis Model
37
50
(). Multiple Imputation: Complications
Compatibility of the
36
41 41
Analysis
MI Example
34
Imputation Model
and the
of the Depen d ent Variable in Imputation
Using Additional Variables in the Imputation Process Other Parametric Approaches to Mul tiple Imputation
Nonparametric and Partially Parametric Methods
Sequential Generalized Regression Models
52 53
54
55
57 64
Linear Hypothesis Tests and Likelihood Ratio Tests
65
MI Example 2
68
MI for Longitudinal and Other Clustered Data
73 74
MI
Example 3
77
7.. Nonignorable Missing Data Two Classes of Models
78
Heckman's Model for Sample Selection Bias
79
ML
Estimation With Pattern-Mixture
Models
Multiple Imputation With Pattern-Mixture Models
82
83
8. Summary and Conclusion
84
Notes
87
References
89
About the Author
93
SERIES EDITOR'S INTRODUCTION
Problems of missing data are pervasive in empirical social science research. The statistical results reported with most nonexperimental studies rest on sample sizes smaller, sometimes much smaller, than the initial number of selected cases. A relatively few absent observa tions on a handful of variables can quickly reduce the effective N. With an opinion survey for example, it is not unCOffilnon for a mul tivariate analysis to halve the original draw. Suppose Professor Mary Rose, of the Business School, is examining a probability sample of N
==
1,000 re s pondents in a survey of consumer attitudes and behav ,
ior. She estimates a reasonably specified multiple regression model of spending, employing the usual computer option of listwise dele
(i.e., any respondent with data lacking on any model excluded ) . As a result, the actual cases available fall to
tion
variable is N
==
4990
Serious questions arise. Do these 499 still "represent" the popula tion? Are the coefficients possessing of any desirable properties? Is the sample too small for rejection of null hypotheses? In order to keep sample size, should pailwise deletion have been tried? Or, are there altogether new approaches worth considering? These questions, and others, are addressed in this splendid monograph by Paul Allison. "Observations are randomly missing." That is the stock argument for going ahead in the face of data attrition, relying on the cases left. But the assumption
is
vague, and may not be saving. Suppose
the observations are "missing completely at random" (labeled MCAR by Allison ) ? That means that none of the variables, dependent or independent
(X),
has
missing scores
(Y)
related to the values of the
variable itself. For example, with the spending variable above, nonre sponse should be no more likely for big spenders than small spenders. Given the same condition held for the other model variables, then the subs ample of 499 would represent a scientific draw, permitting valid inferences. In particular, it allows the regression estimates to be unbiased and consistent. This variety
of
randomness, MCAR, is the
problem-free sort researchers may like to claim, but it makes very st rong assumptions. v
J
VI •
SOllll'what
l)lore
realistic is the assumption that observations are
""IHissing at randoln" (MAR). Here m i ssing data on a variable, say
Y,
V;dlll" or r" cannot predict the location of the missing scores. So
in
�lrc held to he
the
ra nd o m
if, after controlling for other variables, the
furegoing illustration, occupational status (X) might be corre
I;tl cd \vith Inon:
In i ssi ng
data on spending, with high-status respondents
I ikely to underreport spending. Once X is on the right-hand
side, Illissing observations on Y behave randomly. Under the condi t
iOIl or MAR, the missing data-generating mechanism is ignorable, as
!\ II is()11 puts it. The focus of his monograph is on methods for deal
ing with improved estimation under the MAR cond i t ion, al t hough he
docs
ad d res s
the difficult circumstance of nonignorable missing data
'Ilccilanisms. 1 r the data are MAR, then the quality of estimation rests heavily
on the location of the systematic error. Encouragingly, when missing da ta correlations are confined to the independent variables, then list
wise deletion can still yield unbiased estimates. For instance, in the e
x a m pl e , missing data on occupational status, X, might be related to
1l1issing data on another independent variable, age (Z); e.g., those not reporting age might be older and of higher status. Given that
older age does not relate to the repor ti ng of spending, then no bias is
expected. Indeed, as Allison so artfully demonstrates, of
unde r a range
MAR conditions, the standard listwise deletion option can out
pe r form the traditional missing data correction methods of pairwise
deletion, dummy variable
adjustment, or mean
substitution.
New approaches to handling missing data problems take up the hulk of the monograph. After a review of maXimUlTI likelihood
(ML)
estimation given missing data, he explicates the EM algorithm for inlputation, with a carefully selected data example on graduation rates in American colleges. The final chapte rs go beyond the ML approach, in an explication of multiple imputation
( MI )
cussion of nonignorable missing data. The
methods, and a
d is
work presents a tour de
force of the latest techniques for handling missing data, a topic poorly
tleveloped in almost all stat books. Paul Allison also wisely reminds liS
that the best solution to the missing data problem "is not to have
;ll1y.�' But if you have it and seek a remedy, this is the book to buy.
-Michael S. Lewis-Beck
Series Editor
MISSING DATA PAlTL D. ALLISON
University of Pennsylvania
1. INTRODUCTION Sooner or later ( usually sooner), anyone who does statistical analysis runs into problems with missing data. In
a
typical data set, informa
tion is missing for some variables for some cases. In surveys that ask people to report their income, for example, a sizable fraction of the respondents typically refuse to answer. Outright refusals are on ly one cause of missing data. In self-administered surveys, people often over look or forget to answer some of the questions. Even trained inter viewers occasionally may neglect to ask some questions. Sometimes respondents say that they just do not know the answer or do not have the information available to them. Sometimes the question is inappli
cable to some respondents, such as asking unmarried people to rate
I,
the quality of their marriage. In l ongitudinal studies, people who are
I
I
l , ; I
)
! I
I , ,
l
l ! , ,
r , ,
I •
interviewed in one wave may die or move away before the next wave. When data are co l l ated from multiple administrative records, some records may have become inadvertently lost. For all these reasons and many others, missing data are a ubiquitous problem in both the social and health sc i e nces Why is it a problem? .
Because nearly all standard statistical methods presume that every case has information on all the variables to be included in the anal ys is. Indeed, the vast majority of statistical textbooks have nothing whatsoever to say about missing data or how to deal with it. There is one simple solution that everyone knows and that is usually the default for statistical packages: If a case has any missing data for any of the variables in the analysis, then simply exclude that case from the analysis. The result is a data set that has no missing data and can be ana ly ze d by any conventional method. This strategy is commonly known in the social sciences as listwise deletion or casewise deletion, bu t also goes by the name of complete case analysis. In addition to its simp lic i ty , listwise deletion has some attractive
statistical properties, which will be discussed later. It also has a major 1
2
t ha t is apparent to anyone who has used it: In many applications, listwise deletion can exclude a large fraction of the o rigi nal san1plc. For example, suppose you have collected data on a sample of ] ,000 people and you want to estimate a multiple regression model with 20 variables. E ach of t h e variables has missing data on 5% of the cases, and the chance that data are missing for one variable is inde pendent of the chance that it is missing on any other variable. You could then expect to have complete data for only about 360 of the cases, discarding the other 640. If you merely downloaded the data from a web site, you might not feel too bad about this, a l though you might wish you had a few more cases. On the other hand, if you had spent $200 per interview for each of the 1,000 people, you might have serious regrets about the $130,000 that was wasted (at least for this analysis). Surely there must be some way to salvage something from the 640 in c om plete cases, many of which may lack data on only one of the 20 variables. Many alternative methods h ave been proposed, and several of them will be reviewed in this book. Unfortunately, most of the s e methods have little value, and many of them arc inferior to listwise deletion. That's the bad news. The good news is that statisticians have devel oped two novel approaches to handling missing data maximum like lihood and multiple imputation that offer substantial improvements over listwise deletion. Although the theory behind these methods has heen known for at least a decade, it is only in the last few years that they have become computationally practical. Even now, multiple irnputation or maximum likelihood can demand a substantial invest tllent of time and energy, both in l e arning the methods and in carrying j hern out on a routine basis. But, if you want to do things right, you lIsually have to pay a price. Both maximum likelihood and multiple imputation have statisti c;,1 properties that are about as goo d as we can reasonably hope to ;Ichicyc. Nevertheless, it is essential to keep in m in d that these met h ods, like all the others, depend on certain easily violated ass umptions rOI j heir validity. Not only that, there is no way to test whether or not 11 H' IHost crucial assumptions are satisfied. The upshot is that a l t ho u gh �)()llll' III issing data methods are clearly better than others, none of j 11(" I I I lea liy can be described as good. The only really good solu ti o n j jill" I n issi I1g data pro b l e m is not to h ave any. So in the design and t" \( "("\I t lOll uf res�arch projects, it is essential to put gr e a t effort into
disadvantage
t)
_._-
I
I
I II I
3
minimizing the occurrence of missing data. Statistical adjustments can
,
'
I
'
,
never make up for sloppy research.
2$ ASSUMPTIONS Researchers often try to make the case that people who have miss ing values on a particular variable are no different from those with observed measurements. It is common, for example, to present evi
' ,I, I
,I , , ,
dence that people who do and do not report their income are not
ing at random" without a clear understanding of what that means. Even statisticians were once vague or equivocal about this notion.
,
,
I
i I
I i Ii, ,
significantly different on a variety of other variables. More generally, researchers have often claimed or assumed that their data are "miss
1
I, .
"i ::'! I , I i I !:
I
"
I '
However, Rubin (1976) put things on a solid foundation by rigorously defining different assumptions that might plausibly be made about missing data mechanisms. Although his definitions are rather techni cal, I will try to convey an informal understanding of what they mean.
" I I,
I'
Missing Completely at Random
j:
II
!
'i 'I I I '
, I i � I I1 I
,
Suppose there are missing data on a particular variable Y. data on Yare said to be missing completely at random (MCAR) if
:
!
,
the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variables in the data set. When this assumption is satisfied for all variables, the set of individuals with complete data can be regarded as a simple random subsample from
I I ': I ,!
the original set of observations. Note that MCAR does allow for the
I
;1
,I
possibility that "missingness" on Y is related to "missingness" on some other variable X. For example, even if people who refuse to r eport their age invariably refuse to report their income, it is still possible that the data could be missing completely at random. The MCAR assumption would be violated if people who did not report their income were younger, on average, than people who did report their income. It would be easy to test this implication by divid ing the sample into those who did and did not report their income, and then testing for a difference in mean age. If there are, in fact, no systematic differences on the fully observed variables between those with data present and those with missing data, then the data are said
I
I
; I I
I
I I'
,
I
, I I, I I , , I ,
!
4
at random.
to be observed
On the other hand,
MCAR
pass this test doe s not mean that the Still there
Inust
just
because the data
assumption is satisfied.
be no rel ationship between m issin gness on a particu
lar variahlc and the values of that variable. a
Although M CAR is
when
rather strong assumption, th e r e are times
it is reason able, especially when data are
research design.
Such
missing
as part of the
designs are often attractive when a particular
variable is very expensive to measure. The strategy then is sure
variable
the expensjve
to
mea
only for a random subset of the larger
sanlple, im p lying that data are missing completely at random for the remainder of the sample.
Missing at Random
A considerably weaker assumption is that the d ata are
random (MAR).
Data on Yare said
probab ili ty of missing da t a controlling
for
on
to
be
miss in g at
missing at
ran dom if the
Y is unrelated to the value of Y, after
other variables in the analysis. To express this more
there are o nly two v ari ables X and Y, where and Y somet imes is missing. MAR m ea ns t hat
formally, suppose always is obselVed
Pr(Ymissing
Y,
X)
==
X
Pr(Ymissing X).
probability of the p ro b ab i li ty of
In words, this expression means that the conditional
missing d ata on
Y,
given both Yand X, is equal to
missing data on Y given X alone. tion would be satisfied
if
For
example, the
MAR
assump
the probability of missi ng data on income
depen ded on a person's marital status, but within each marital status category, the p robabi lity of missing income w..:s unrelated to income. I n general, data are
In
not
m
i ssin g
at ran dom if those individuals with
i ss ing data on a particu1ar var iable ten d to have lower (or
higher)
values on that variable th an those with data present, controlling for other observed variables.
I t is impossible to test whether the
MAR
co n ditio n is satisfied, a n d
1 he reason should be intuitively clear. Because we do not know the
v;tiucs of the missing data, we can not compare the values of t h o se
\vil h and 1 11:11
without
va riahle.
m i ssi n g d ata to see if they differ systematical1y on
,
j
)
I�n()rable
'I'he missing data mechanism is said to be ignorable if (a) the dala � n' MAR and (b) the parameters that govern the missing data process ,Ill' unrelated to the parameters to be estimated. Ignorability basically IIlcans that there is no need to model the missing d ata mechanisnl as p;lrt of the estimation process. However, special techniques certainly .1 rc needed to utilize the data in an efficient manner. Becaus e it is ILlrd to imagine rea1-world applications where condition (b) is not \;d isfied, I treat MAR and ignorability as equivalent conditions in this 1)( )()k. Even in the rare situation where condition (b) is not satisfied, Illcthods that assume ignorability work just fine, but you could do ,·ven better by modeling the missing data mechanism. .
Nunignorable
the data are not MAR, we say that the missing data mecha I I iSiTI. is nonignorable. In that case, usually the missing data mech ;Illism must be modeled to get good estimates of the parame ters (,1" interest. One widely used method for nonignorable missing data is r-Ieckman's (1976) two-stage estimator for regression models with ��L'lcction bias on the dependent variable. Unfortunately, for effective {"�Jimation with nonignorable missing data, very good prior kno wl ('dge about the nature of the missing data process usually is needed, I )ccause the data contain no information about what models would I H.� appropriate and the results typically will be very sensitive to the choice of model. For these reasons and because models for nonignor ;Ihlc missing data typically must be quite specialized for each appli cation, this book puts the major emphasis on methods for ignorable , n i s s i n g data. In the last chapter, I briefly survey some approaches to Iia ndling nonignorable missing data. In Chapter 3, we will see that I istwise deletion has some very attractive properties with resp ect to <. '�rtain kinds of nonignorable missing data will be evident. If
3. CONVENTIONAL METHODS Although many different methods have been proposed for handling rHissing data, only a few have gained widespread popularity. Unfor t unately, none of the widely used methods is clearly superior to
,
6
listwise deletion. In this section, I briefly review some of these meth ods, starting with the simplest . In evaluating these m e tho ds I will be particularly concerned with their performance in regression analysis (including logistic regression and Cox regression), but ma ny of the comments also apply to other types of an alysi s as well . ,
Listwise Deletion
As already noted, listwise deletion is acco mpl is h ed by d ele tin g from the sample any observations that have missing data on any variables in the model of interest and then applying conventional methods of analysis for complete data sets. There are two obvious advantages to listwise deletion: (1) it can be used for any kind of statistical analysis, from structural equation m o d e li ng to log-linear analysis; (2) no special computational methods are required. Depending on the missing data mechanism, listwise deletion also can have some attractive statistical properties. Specifically, if the data are MCAR, then the reduced sam ple will be a random subsample of t he origi nal sample. This implies tha t, for any parameter of interest, if the estimates would be unbiased for the full data set (with no missing data), they also will be unbiased for the listwise deleted data set. Furthermore, the s tan da r d errors and test stat istics obtained with the listwise deleted data set will be just as appropriate as they wou ld have been in the full data set. Of cou rse the standard errors generally will be larger in the listwise deleted data set because less information is utilized. They also will tend to be larger than standard errors obtained from the optimal rnethods described later in this book, but a t least you do not have to worry about making inferential errors because of the missing data ·a hig p r o ble m with most of the other co m m only used methods. ()n the other hand, if the data are not MCAR, but only MAR, listwise d e l etio n can yield biased estimates. For example, if the prob ahility of missing data on schooling depends on occupational status, regression of occupational status on schooling will produce a biased cst ilnate of the regression coefficient. So, in general, it appears that listwise deletion is not robust to violations of the MCAR assum pt i o n Sllrprisingly� however, listwise deletion is the method that is most robust to violations of MAR among independen t variables in a regres sion analysis. Specifically, if the probability of m i s s in g data on any of the independent variables does not dep e n d on the values of the dependenf variablc� then re gr e ss io n estimates using listwise deletion ,
.
7 \vi II
be unbiased (if all the usual assumptions of the regression model 1 1 ; re satisfied). For example, suppose that we want to estimate a regression model I ( ) predict annual savings. One of the independent variables is income, f()f which 40% of the data are missing. Suppose further that the prob ;,hility of missing data on income is highly dependent on both income ;lnd years of schooling, another independent variable in the model. As long as the probability of missing income does not depend on savings, j he regression estimates will be unbiased (Little, 1992). Why is this the case? Here is the essential idea. It is well-known I hat disproportionate stratified sampling on the independent variables in a regression model does not bias coefficient estimates. A missing (lata mechanism that depends only on the values of the independent variables is essentially equivalent to stratified sampling, that is, cases are being selected into the sample with a probability that depends on l he values of those variables. This conclusion applies not only to lin �ar regression models, but also to logistic regression, Cox regression, Poisson regression, and so ono In fact, for logistic regression, listwise deletion gives valid infer t�nces under even broader conditions. If the probability of missing data on any variable depends on the value of the dependent variable hut does not depend on any of the independent variables, then logis tic regression with listwise deletion yields consistent estimates of the slope coefficients and their standard errors (Vach, 1994). The inter cept estimate will be biased, however. Logistic regression with listwise deletion is problematic only when the probability of any missing data 2 depends on both the dependent and independent variables. To sum up, listwise deletion is not a bad method for handling mi�s ing data. Although it does not use all of the available information, at least it gives valid inferences when the data are MeAR. As we will see, that is more than can be said for nearly all the other com monplace methods for handling missing data. The methods of maxi mum likelihood and multiple imputation, discussed in later chapters, are potentially much better than listwise deletion in many situations, but for regression analysis, listwise deletion is even more robust than these sophisticated methods to violations of the MAR assumption. Specifically, whenever the probability of missing data on a particu 1 a r independent variable depends on the value of that variable (and not the dependent variable), listwise deletion may do better than nlaxi mum likelihood or multiple imputation.
•
caveat to these claims about listwise dele t ion for regression analysis. The regression coefficients are assumed tt) he' the salHe for all cases in the sample. If the regression coef ficicllt s v;lry across subsets of the population, then any nonrandom n 'sl riel iOIl or the saluple (e.g., through listwise dele t i on ) may weight I he regression coefficients toward one subset or another. Of course, if slich variation in the regression p ar amete rs is suspected, either sepa L I t c regressions s hould be d o ne in different subsamples or appropri ;11 L' i Iltcractions should be i n cl u ded in the regression model (Winship (� R.adbill, 1994). '('here is olle inlportant
P�lirwise Deletion
Also known as available case a nalysi s pairwise deletion is a simple a I ternative that can be used for many linear models, including lin ear regression, factor analysis, and lTIO re complex structural equation rnodels. It is well known, for example, that a linear regression can be estimated using only the sample means and covariance matrix Of, equivalently, the means, standard deviations, and c o rrela ti on matrix. The idea of pairwise deletion is to compute each of these summary statistics us in g all the cases that are ava ilab le For example, to com p u te the covariance between two variables X and Z, all the cases that have data present for both X and Z are used. Once the SUffi11lary measures have been computed, they can be used to calculate the parameters of interest, for examp le re g re ssion coefficients. There are ambig u iti es in how to implement this prin cip l e When computing a covariance that requires the mean for each variable, do you compute the means u s i ng only cases with data on both variables or do you compute them from all the available cases on each vari able? There is n o poin t to dwelling on such que s tion s because all the var ia ti o ns lead to estimators with similar properties. The general conclusion is t h at if the data are MCAR, paiIwise deletion p r od uce s parameter estimates that a re c onsis tent (a nd therefore, approxi lnatcly unbiased in large samples). On the other hand, if the data ar� only MAR, but not observed at random, the estimates may be ,
.
,
.
,
seriously biased. If the data
are indeed MCAR, pairwise deletion might be expected to he nlore e ffi c ient than listwise deletion because more information is III il izcd. By more efficient, I mea n that the pairwise estimates wou ld ILtVl.' less sampling va ria b i li ty (smaller true standard errors ) than the
()
estimates. This is not always true, however. Both analytical ; Illd simulation studies of linear regression models indicate that pa i r- \vise deletion produces more efficient estimates when the correlations ;1I11ong the variables are generally low, whereas listwise deletion docs I)�tter when the correlations are high (Glasser, 1964; Haitovsky, 1968; K.im & Curry 1977). The big problem with pairwise deletion is that the estimated stan dard errors and test statistics produced by conventional software are hiased. Symptomatic of that problem is that when you input a covari ance matrix to a regression program, you must also specify the sample size to calculate standard errors. Some programs for pairwise deletion use the number of cases on the variable with the most missing data, whereas others use the minimum of the number of cases used to com pute each covariance. No single number is satisfactory, however. In principle, it is possible to get consistent estimates of the standard errors, but the formulas are complex and have not been implemented in any commercial software. 3 A second probl em that occasionally arises with pairwise deletion, especially in small samples, is that the constructed covariance or cor relation matrix may not be "positive definite," which implies that the regression computations cannot be carried out at all. Because of these difficulties, as well as its relative sensitivity to departures from MCAR, pahwise deletion cannot be generally recommended as an alternative to listwise deletionG
I
"
I i�..;twise
,
Dummy Variable Adjustment There is another method for missing predictors in a regression anal ysis that is remarkably simple and intuitively appealing (Cohen & Cohen, 1985). Suppose that some data are missing on a variable X , which is one of several independent variables in a regression analysis. We create a dummy variable D that is equal to 1 if data are missing on X and equal to 0 othelWise. We also create a variable X* s u ch that
X *==
x c
when data are not missing, when data are missing,
where c can be any cons tan t We then regress the dependent variable Y on X*, D, and any other variables in the intended model. This .
• ,
I
I
•
I I
10
issing more tha n
tcchnique, known as d u m my var iable adjustment or the
m
�
method, can be extended easily to the case of one i n d epen de nt variable with missi n g data. The app arent virtue of the dummy variable adjustment method is that it uses all the information that is available ab out the missing dat a Substitution of the va lu e c for the missing data is not properly regarded as i m puta t ion because the coefficient of X* is invariant to the choi ce of c. Indeed, the on ly aspect of the model that d epe nds on the choice of c is the coefficient of D, the missing value indica tor. For ea s e of interpretation, a convenient choice of c is the me a n of X for n onm iss i n g cases. Then the coefficient of D can be interpreted as the pred icted value of Y for individuals with miss i ng data on X minus the pre d icted value of Y for individuals at th e mean of X, controlling for other variables in the modeL The coe ffi cien t for X* can be regarde d as an est i mate of the effect of X among the subgroup of those that have da ta on X. Unfortunately, this method generally produces biased estimates of the coefficients, as p rov en by Jo nes (1996).4 A simple simulation illus trates the prob le m. I generated 10,000 cases on three variables, X, Y, and Z, by sa m pl i n g from a trivariate normal distribution. For the regression of Y on X and Z, the true coe ffi c i e nts for each vari ab le were 1.0. For the full sample of 10,000, the least-squares regres sion coefficients, shown in t h e first column of Table 3.1 are not surprisingly quite clos e to the true values._ I then randomly made some of the Z values m i ss in g with a prob ability of 1/2. Because the probabili ty of missing data is unrelated to any other variable, the data are MCAR. Th e second column in Tab le 3.1 shows that listwise d elet io n yields estimates tha t are very c l o s e to those ob ta in e d when no d a ta are missing. On the other h a nd the coefficients for the dummy variable adjustment method are indicator
.
,
Re gression
in
TABLE 3.1 Si mulate d Data
for Three
Methods
-- -------�
( 'ocf/icielll of ----._ -- . -
X
Full Data
Listwise l)eletioll
DllJnmy Variable
0.98
0.96
1.28 0.87 0.02
AdjuS{fnent
--
--
/
1.01
j) --
-
-
1.03
-------
L . .....
�;
.
_"r.'::"!'!IV ... .'!!:I!\ ,
'.
I
11
clearly biased
too high for the
X coefficient and too low for the Z
coefficient.
handled by creating a set of dummy variables, one variable for each of the categories except for a reference categoryo The proposal is variable
and an additional dummy
for those individuals with missing data on the categorical
variables. Again, however, we have an intuitively appealing method that is biased even when the data are MCAR (Jones, 1996; Vach and Blettner,
1991).
Imputation
Many missing data methods fall under the general heading of impu tation. The basic idea is to substitute some reasonable guess (impu tation) for each missing value and then proceed to do the analysis as
, i , ,
A closely related method has been proposed for categorical inde pendent variables in regression analysis. Such variables are typically
simply to create an additional category
,
if there were no missing data. Of course, there are lots of differ
ent ways to impute missing values. Perhaps the simplest is marginal mean imputation: For each missing value on a given variable, substi tute the mean for those cases with data present on that variable. This method is well known to produce biased estimates of variances and covariances (Haitovsky, 1968) and generally should be avoided. A better approach is to use information on other variables by way of multiple regression, a method sometimes known as conditional mean imputation. Suppose we are estimating a multiple regression model with severa] independent variables� One of those variables, X, has missing data for some of the cases. For those cases with complete data, we regress X on all the other independent variables. Using the estimated equation, we generate predicted values for the cases with missing data on X. These are substituted for the missing data and the analysis proceeds as if there were no missing data. The method gets more complicated when more than one indepen dent variable has missing data, and there are several variations on the general theme. In general, if imputations are based solely on other independent variables (not the dependent variable) and if the data are MCAR, the least-squares coefficients are consistent, implying that they are approximately unbiased in large samples (Gourieroux & Monfort, 1981). However, they are not fully efficient. Improved esti-
,
12 m ators can
1 975 )
be obtained
u s ing we ig hted least squares
o r generalized l e ast sq ua res (Gouricroux
&
( Beale & Li tt le
,
Monfort, 1 9 8 1 ) .
U nfortu nately, all of these imp ut a ti o n methods suffer from a fund a
n1cntal
prob l em : Analyzing im p u ted data as th o u gh it were co mpl e t e
data p rodu c es standard errors that are underestim ated and tes t statis
ti c s that are over e st i m ated . Conventional a na lytic
m ethods
s im ply do
not a dj us t for the fact th a t t he i m pu t ati o n p rocess involves uncer
ta inty about the
m i s s ing
values .s In later c hap ters , an appro ach to
i mputa t ion that overcomes t hese difficulties
is examined.
Summary
All the common me thods for sa lv a g in g information from cases w i t h
m issin g d ata typically make thing s worse : They introduce substantial bias, m a k e the analysis more sensi t ive t o depart u res from MCAR, or y i e ld s tandard error estim a tes that are incorrect
( u su ally
too low) .
In light of these shortcom in g s , listwise del e tion does not look so bad. However, better
m ethods
are available . I n the next c h a p ter , maximum
li k elihood methods that are available for m a ny common modeling
o bj ective s are examine d . In Chapters
5 and 6,
mul t ip le imput a ti o n ,
which can be u s ed in a l most any s et tin g , is considered . Both methods h ave very
g oo d
p ropert ie s if the data are MAR. In principle, these
methods a l so can be use d for nonignorable missing data, but that
r eq u i r es a correct model of the process by which data are
m is sin g
so m e t hin g that us u a lly is difficult to come by.
4. MAXIMUM LIKELIHOOD Maximum likelihood (ML) is a very gener al app r o ac h to statistical e stim ation that is wi dely used to h and l e m any o thelWise difficult esti mation pr o b l e ms Most readers will be familiar with ML as the p re .
fer r ed me thod fo r estimatin g the l ogistic regre ss io n m ode L le a s t s qua res line a r regr ess ion -
is
a ls o an NIL
met
ho d
Ordi nary
when the error
t er m is assumed to be nor m a l ly distribute d. It turns out that ML is particularly adept
at h and ling
m i s s in g d ata problems . In this ch ap t er ,
I beg in by reviewing some general p rop erties of ML estimates. Then
I pr es e nt the basic pri nc ipl es of ML esti ma ti o n under the assump tion that the miss ing data mechanism is ignorable. These principles
I , , I
I i
I
,
/
,i
-------
_.-
I \
are illustrated with a simple contingency table example. The re l U ; 1 i I I der of the chapter considers more complex examples where t he go ; t I is to estimate a linear model, based on the multivariate normal d i s t r i o bution. Review of
Maximum
Likelihood
The basic principle of ML estimation is to choose as estimates those values that, if true, would maximize the probability of observing what has, in fact, been observed. To accomplish this, we first need a formu1 a that expresses the probability of the data as a function of both the data and the unknown parameters. When observations are independent (the usual assumption), the overall likelihood (probability) for the sample is just the product of all the likelihoods for the individual observations. Suppose we are trying to estimate a p arameter 8 . If f (y () ) is the probability (or probability density) of observing a single value of Y given some value of 8, the likelihood for a sample of n observations is n
L (O ) Il f(Yi 8 ) , ==
i= l
where n is the symbol for repeated multiplication. Of course, we still need to specify exactly what fe y 8) is. For example, suppose Y i s a dichotomous variable coded 1 or 0, and 8 is the probability that Y 1 . Then ==
L( 8)
n
==
Il OYi ( l - 8)1-yi • i=1
( )nce we have L( () ) which is called the likelihood function there ; I rc a variety of techniques to find the value of (J that makes the l i kelihood as large as possible. ML estimators have a number of desirable properties. Under a I ; � i rly wide range of conditions, they are known to be consistent, ; , symptotically efficient, and asymptotically normal (Agresti & F inlay, I <)97). Consistency implies that the estimates are approximately unh i . ! sed in large samples. Efficiency implies that the true standard errors ; 1 1 C at least as small as the standard errors for any other consisten t
14 es t i m ato r s . Th e a symptotic
part means that this statement is only approximately true , and the approximation gets better as the s ample size gets larger. Fi nally a symp totic normal i ty mean s that in repeated ,
s ampling , the estimates h ave an approxim ately n o r mal distribution
)
(again , th e approxirnation inlprove s with incre asing s ample size . This
justifies
the use of a normal table to construct confidence intervals or
compute
ML
p
va l ues
.
With Missing Data
W h at
happe n s when data are missing for some of the observations?
)
When the mi s sing data m echanism is ignorable ( and hence MAR , we can obtain the like lihood simply
by
summing
the usual likelihood
over
p o s s ible values of the missing data. S u pp ose for example, that we attempt to coll e c t data on two variable s X and Y, for a s ampl e of n indepe n dent observations. For the first m observations, we observe both X and Y, b ut for the remain i ng n m observation s , we are only a b le to mea s ure Y. For a s ingle obselVation with c om p le te data, let us represent the likelihood by f ( x , y 8 ), w here () is a set of unknown parameters that govern the distribution of X and Y. As s uming that X is discrete, the likelihood for a case with mi s sing data on X is j ust t h e "marginal" distribution of Y : all
,
,
-
g( y 8)
==
L f (x , Y 8 ) . x
(When X is con t inuou s , the su mmation
The likeli hood for
the
repl ac e d by an integral. )
en ti re samp le is j ust III
I� ( 0)
is
==
n
n f(Xi ' Yi 8) n i= 1
i=m + l
g(Yi
8).
The problem then becomes one of finding v alues of likelihood as larg e as po s s ib l e
.
A
8
to
m ake t his
variety of me thods are available to
solve this op t imiz at ion problem , and a few of t h em will be con s ide re d later.
ML is particularly
easy
when the pattern of mi ss ing data
is
mono
tonic. In a mo notonic p attern, the vari ables c an be arranged in an orde r such that for any ob s e rvati o n in the ing on a particular variable, they a l s o mus t
s ample if d at a are miss be m i s sing for all variables ,
,
!
,
�" , i
15 t h at come l ater in the order . Here is an example with four variables,
X2 ,
"Y 1 '
X3 , and X4 . There are no missing data on Xl - Ten percent
o f the cases are missing on X2 . Those cases that are missing on X2
a l so have missing data on X3 and X4 . An additional
20%
have missing data on both X3 and X4 , but not on X2 .
of the cases
A
monotonic
pattern often arises in panel studies, where people drop out at various points in time and never return.
If
only one variable has missing data, the p attern is necessarily
t nonotonic. Consider the two-variable case with data missing on X on ly . The j oint distribution f( x, y) can be written as
h( x I y )g(y),
where g(y) is the marginal distribution of Y (previously defined) and
h(x l y )
is the conditional distribution of X given Y . This enables us
t o rewrite th e likelihood as
m
L ( A , 4»
==
n
n h(Xi Yi ; A) n g(Yi ¢ ) . i= l
r rhis expression differs from the earlier one in two important ways. I ;irst, the second product is over
all
the observations, not just those
I
,I;
,
with missing data on X. Second, the parameters have been separated
into
two parts: A describes the conditional distribution of X given Y
a nd ¢ describes the marginal distribution of
Y.
These changes imply
th a t we can maximize the two parts of the likelihood separately, typi
cally using conventional estimation procedures for e ach part� Thus , if
)( and
Y
h ave a bivariate normal distribution, we can calculate the
lucan and variance of Y for the entire sample . Then, for those cases
with d ata on X, we can calculate the regression of X on
ing for
Y.
,
i
The result
parameter estimates can be combined to produce ML estimates any other parameters we might be interested in, for example, the
correlation coefficient.
,
I
Contingency Table Data These features of ML estimation can be illustrated very concretely with contingency table data. Suppose for a simple random sample of
200 people, we attempt to measure two dichotomous variables, X and Y, with possible values of 1 and 2. For 150 case s, we observe both X and Y , and obtain the results shown i n the following contingency •
, I
,
I ' ,
i '
I'
I'
l in r
t h e ot h er
50
Y == l
Y=2
X == 1
52
21
X=2
34
43
cases, X is
m i ssing
c; d l y , we h ave 1 9 cases with Y
==
X == 2
I
and we observe only Y; specifi
1 and 3 1 cases with Y
po p u l a tion, the rel ationship between X and
X == 1
J
Y is
Y == l
Y=2
P i .!
P 12
P21
==
2. In the
described by
-
P')? "-
j . If all we had were wh e re P i} i s t h e p robability that X == i and Y the 150 observations with complete data, th e likelihood would be ==
subj ect to the constraint that the four probabilities must sum to 1 .
The ML e stim ates of the four prob abi l ities would be th e simple pro portio n s in e ach cell , that is, A
P ij
==
n ij n
,
where n ij is the numb er of cases that fall into cell would get
PIl P2l P1 2 fi22
==
(i, j).
So we
.346 ,
==
.227 ,
==
. 140 ,
==
. 287 .
However, this will not do b ecause we have additional observation s on Y alone that need to be incorpora ted into the likelihood. Assumi ng
that the missing data mechanism is ignorable, the likeli hood for cases
1. 1 i s j ust P l l + P 2b the margin a1 probability that Y Similarly, for cases with Y 2 , t h e likelihood is P 1 2 + P22 ' Thus, our
with Y
==
==
==
like lih oo d for L
==
(
Pil
the e n tire s amp le
)52(
P12
)21 (
P2 1
)34(
is
P22
)43 (
Pl 1
+
P21
)19 (
P 1 2 + P22
)3 1 .
I
- --
--
- --
-_
.
----
-
-'
-- - ---
--
- ..._
-
--
17
How can we find values of Pij that maximize this expression? For
most applications of ML to missing data problems, there is no explicit solution for the estimates. Instead, iterative methods are necessary. In this case, however, the pattern is necessarily monotonic
(because
there is only one variable with miss ing data ) , so we can estimate the
X
conditional distribution of
given Y and the marg in al distribution
of Y separately. Then we combine the results to get the four cell
probabilities. For the 2
x
2 table, the ML estimates have the general
form
Pij
==
pCX
==
il Y
==
The conditional probabilities on the
j)pCY
right
-
==
j).
ha nd side are estimated
using only those cases with complete data. They are obtained in the
usual way by dividing the cell frequencies in the 2
x
2 table by the
column totals. The estimates of the marginal probabilities for Yare
obtained by adding the column frequencies to the fre q u encie s of Y
for the cases with missing data on X and then dividing by the sample size. Thus, we have
52
A
Pll
-
==
A
P21
==
A
P12
==
A
P22
86 + 19
==
86
200
34
86 + 19
86
200
21
64 + 31
64
200
43
64 +31
64
200
==
==
==
==
.3174,
.2076, .1559,
.3191.
Of course, thes e estimates are not the same as if we had used only the
cases with complete information. On the other hand, the cross-product ratio, a commonly used measure of association for two dichotomous variables, is the same whether it is calcu lated from the ML estimates or the estimates based on complete cases only. In short, the observa-
18 tions with missing d ata on X give us no additional information about the cross-product ratio. This example was included to illustrate some of the ge n e r al fea tures of ML estim ation with missing data. Few readers will want to work through the hand calculations for their p articular applications, however. What is n ee d e d is general-purpose software that can handle a va r i e ty of data types and missing data p a tte rn s Although ML esti mation for the analysis of contingency tables is n o t computationally d i ffi cult (Fuchs, 1 982; Schafer, 1997), there is virtually no commer cial software to handle this case. Freeware is available on the Web .
,
however: .. J eroen K. Vermunt's l EM program for Windows (http ://WW\V . kub . nl/ fac
ulteit en/fsw/organisatie/departementen/m to/softw are2 .html) estimates a
wide variety of categorical data models when some data are missing. @
Joseph Schafer's CAT program (http ://www . stat .psu . edu/'''-'j ls) will esti
mate hierarchical log-linear models with missing d ata, but is currently available only as a library for the S-PLUS package .
e
Duffy's
D avid
LOGLIN program will estim ate a variety of log-line ar
models with missing d at a (http ://www2 . qimr . e du . au/davirlD) .
Linear Models With Normally Distributed Data
be used to e s t im ate a variety of line ar models under the ass u m p ti o n that the data come from a multivari ate no rm al ML can
distribution . Possible models include ordinary linear regression, fac
tor analysis, simultaneous equations, and s t ruc t u ral e qu a tions with l aten t variables . Although the a s s umpt ion of multivariate normal ity is a strong one, it is completely innocu ous for t hose variables with no missing data. Furthermore, even wh e n some variables with m issing data are known to have distributions that are not normal (c .g. , dummy var i ab l e s ) ML estimates under the multivariate nor mal a s sump t i o n often have good properties, e s p e ci a l ly if the data are ,
MCAR.6
The re are seve r al appro aches to ML estimation
for
multivariate
normal data with an ignorable missing d ata m e c h a n i sm When n1issing data follow a monotonic p att e rn use can b e made of .
,
a pp ro a c h
the the
describe d earlier of factoring the likelihood into conditional
a nd marginal d i s t rib ut i o ns that can be estimated by conventional soft
\v a re (M arini, O l se n ,
& Rubin, 1979) .
How e v e r , this approach is very
,
19
"st ricted in terms of potential applications, and it is not easy to get I ',nod estimates of standard errors and test statistics. (Jeneral missing data patterns can be handled by a method called I h � expectation-maximization (EM) algorithm (Dempster, Laird, & ) { l l bin, 1977), which can produce ML estimates of the means, stan la rd deviations, and correlations (or, equivalently, the means and the ( "ovariance matrix). These summary statistics then can be input to ,; 1 andard linear modeling software to get consistent estimates of the pa rameters of interest. The virtues of the EM method are (1) it is easy 1 0 use and (2) there is a lot of software, both commercial and free \"vare , that will do it. The two disadvantages are (1) standard errors ; l lld test statistics reported by the linear modeling software will not be .:orrect, and (2) the estimates will not be fully efficient for overiden t i fied models (those that imply restrictions on the covariance matrix) . A better approach is direct maximization of the multivariate nor tnal likelihood for the assumed linear model. Direct ML ( sometim,es called raw maximum likelihood) gives efficient estimates with correct standard errors, but requires specialized software that may have a steep learning CUNeo In the remainder of this chapter, we'll see how t o use both the EM algorithm and direct ML. I t
t
rfhe
EM
Algorithm
The EM algorithm is a very general method for obtaining ML esti mates when some of the data are missing (Dempster et aL, 1977, McLachlan & Krishnan, 1997). It is called EM because it consists of two steps: an expectation step and a maximization step. These two steps are repeated multiple times in an iterative process that eventu ally converges to the ML estimates. Instead of explaining the two steps of the EM algorithm in gen eral settings, I'm going to focus on its application to the multivariate normal distribution. Here the E step essentially reduces to regression imputation of the missing values. Suppose our data set contains four variables, Xl through X4, and there are some missing data on each variable, in no particular pattern. We begin by choosing starting values for the unknown parameters, that is, the means an d the covariance matrix. These starting values can be obtained by the standard formu las for sample means and covariances, u sing either listwise deletion or pairwise deletion. Based on the starting values of the parameters, we can compute coefficients for the regression of any one of the X s on
20 other three . For ex am pl e suppose that some of the p re s en t for Xl and X2 , but not for X3 and X4 . We
any s ub s et of the
cases have data
,
use the s t a r t i n g values of t he covari ance m a trix to g e t the regres s i on
of X3
on
Xl and X2 and the regre ssion of X4 on Xl an d X2 . We then
use these re gre s sio n coeffi cients to gene rate imputed va lues for X3 and X4 b ased
on
obs e rv e d values of Xl and
ing data on only o n e vari able, we use
on
X2 .
For cases with miss
r egr e ssi o n
imput ations based
all t hr e e of t h e other variables .
After all the m is s ing data h ave been i mpu t ed , the M step consists
of calculating new val ues for the means a n d the covariance m a trix, u sing the imputed data along with the nonmissing d ata For m e ans, .
we just use the usual formul a. For variances and covariances, modified ,
formulas must be use d for any terms that i nvo lv e m i s sin g data. Spe c ific a l ly , terms must be added that correspon d to the resid ual vari ances and residual covari ances , b a s ed on the the imputation process. For example , i , X3 was ilnputed using Xl and X2 .
equations u sed in s up pose th at for observation Then, whenever (Xi3 )2 w o u l d r e gres s io n
h ave been used i n th e conventional v a r i ance formula, we substitute
(Xi] )2 on
+
S�.2] '
where
s�.21
is the residual v a ri a nc e from regressing X3
X l a n d X2 . The a ddi tio n
of
the resi dual terms corrects for t h e
usual underestim ation of variances th at occurs in more convention al impu tation schemes. Suppose X4 is also mi ssing for o b se rv a t j o n i . Then, whe n X i3 X i4
computing
the covari ance betwe e n X3 and X4 , wherever
would h ave been used i n
we substitute
x i3 x i4
+
S3 4 .2 1 '
the
conv en t io n al covariance formula,
The l ast term
is
the residual covariance
between X3 and X4 , controlling for �¥l and X2 .
new esti mates for the means and cov ar i a n ce over with the E st e p. Th at is, we use the new estim ates
Once we h ave go t te n
ma trix, we st a r t
to
produce new r eg r e s s io n imputations for the m issing values . We
keep cyc li n g through the
E
and M steps until t h e estimates converge,
that is, they hardly change from on e
Note th a t the ve n t io n a l
EM
iteration
to the next .
algorithm avoids on e of the difficulties wi t h con
regression imputation
deciding which
variables to
use as
predictors and coping with the fact that different mis s i n g data p at
of ava i lable predictors . Because EM always w it h the full c o vari a n c e m a trix it is p o s s ib l e to get r eg r es s i o n
terns h ave diffe rent sets st arts
,
es t i m ates for any set of predictors, no matter h ow few cases there 111 ay be in
a
t
p articul ar missing d a a pattern . Hence , EM always uses
a l l t h e ava ilable vari ables as predictors
for
imputing the missing d at a .
21 E lM Example
I
Data o n 1 ,302 American colleges and universities were reported in
he U. S. News and World Report Guide to America 's Best Col leges 1 994 .
' I h es e data can be found on the Web at http ://lib .stat . cmu .edu/datasets/ '
l '(
t
)lleges� We consider the following variables : GRADRAT
Ratio of graduating seniors to number that enrolled four years ear1 ier ( x 1 00) .
CSAT
Combined average scores on verbal and math secti ons of the
SAT.
I
LENROLL PRIVATE
Natural logarithm of the number of enrolling freshmen .
I
1 = private; O = public.
, , !
STUFAC
Ratio of students to faculty ( x 1 00) .
RMBRD
Total
annual
costs
for
room
II I
,
and
board
(thousands
dollars) . 7 ACT
',
of
Mean ACT scores .
( )ur goal is to est im a t e a linear regression of GRADRAT
on
the next
l i ve variables . Al though ACT will not be in the regre ssion model, i t
i s included in the EM estimation because of its
hi gh
correlation with
C:SAT, a variable that has substantial missing data, which allows us to get better imputations for the missing values. Table 4. 1 gives the number of no n m i ssi ng cases for each
vari
a ble, and the means and standard devi ations for those cases wi th data present. Only one variable,
PRIVATE,
has complete dat a . The
TABLE 4 . 1
Descriptive St atistics for College D ata Based on Va ria hIe
Nonmissing Cases
Mean
I,
Available Cases
: iI I
,'
Standard
I)evialion
,
I'
II
CiRADRAT
CSAT
( ENROLL __
P,RIVATE
STUFAC RMBRD
ACT
1 ,204
60 . 4 1
1 8 . 89
779
967 .98
1 23 , 5 8
1 , 297
6. 1 7
1 ,302
0 . 64
1 .00
0 .48
1 ,300
1 4 .89
783
4. 15
5, ] 9
714
22 . 1 2
1
.
'
i' , I
.
1.17
2 . ') 8
, , ,
!
.
22 TABLE 4 , 2
({egre ssion That Pred icts GRADRAT Using Listwise Deletion - ----
-- --- --- -------
I i l l iah/I '
Coefficient
Standard Error
- 35 , 028 0.067 2 .4 1 7
7 , 685
t
Sta tistic
p
Va lue
--_._---- -----
I NTI
': I{( ' L P ( 'SAT 1 , 1 ,: N R ( ) LL P R I V;\T L�
S' I ' l J ) AC '
0.006
1 0 .47
0.95 9
2,5 2
13 . 5 8 8
1 .946
6.98
- 0 . 1 23
0. 132
- 0 . 93
2 . 1 62
0.714
3 . 03
R fvI B R D --
-4,56
0 , 00 0 1 0.000 1 0.0 1 2 1 0.000 1 0.3 5 1 3
0 . 0026
_ .. ,- ----
dependent vari able GRADRAT h as missing data on l eges. CSAT and
45 %
RMBRD
of the col
are e ach missing 40% and ACT is missing
of the c ases . Using listwise deletion on all variables except
ACT yields a s ample of only
tion.
8%
455
cases, a cle arly unacceptable reduc
Nevertheless, for the purposes of comparison, listwise deletion
regression estimates are presented in Table
EM
Next we u se the
4.2.
algorithm to get estimate s of the means, stan
dard devi ations, and corre l ations. Among major commercial pack ages, the EM algorithm is available in B M DP, SPSS, SYSTAT, and SAS. However, with SPSS and SYSTAT, it is cumbersome to s ave the results for input to other linear modeling routines. For the col
lege data, I used the SAS procedure MI to obtain the results shown in
Ta bles
4.3
and
Like other EM software, this procedure automates
4.4.
all the steps d escribed in the previous section. Comp arison of the me ans in Table
4.3
with those in Table 4. 1
indicates that the biggest differences are foun d
not surprisingly-
TABLE 4 . 3
Means and Standard Deviations From the EM Algorithm Varia ble
G RA.D RAT
CSAT
LENROLL
Mean
Standard [)eviation
59 .86
1 8.86
957.88
1 2 1 .43
6.17
0. 997 0 . 48
PRIVATE STUFAC
1 4 . 86
5.1 8
ACT
22.22
2.71
H.MBRD
0 . 64
4.07
1.15
I !
,
i
, ,
I
• ,
Correlations
GRADRAT
CSAT
TABLE 4 .4
From
the EM Algorithm
LEN R OLL PRIVATE STUFAC RMBRD ACT
1 . 0 00
GRADRAT CSAT
0.591
LENROLL
- 0 . 027
1 .000 0. 192
0. 398
0. 1 6 1
STUFAC
- 0 .3 1 8
-0.315
RMBRD
0 .478
0 . 479
ACT
0.598
0. 908
PRIVATE
1 .000
0. 61 9 0.267 -0.0 1 6 0 . 1 74
1 .000
-
-
1 . 000
0. 3 6 8 0 . 34 0 0 . 2 24
- 0 . 282
1 .000
- 0 . 293
0.484
1 . 0 00
among the variables with the most missing data: GRADRAT,
CSAT,
RMBRD,
and ACT. However, even
for
these variables, none of the
differences b etween listwise deletion and
Table 4.5
EM
exceed s
2%.
shows regression estimates that use the EM statistics as
coeffici e n ts are not markedly different from those i n Table 4.2, which used listwise d e l e t i on the reported standard errors are much lower, leading to higher t statistics and lower p values. Unfortunately, although the coefficients are true ML estim ates in this case, the standard error estimates are undoubtedly too low b ecause l nput. Although the
,
they assume that there are complete data for all the cases. To get correct standard error estimates, we will use the
di rect
which is described next. 8
ML method,
Direct ML
As we have just seen, most software for the EM algorithm produces
estimates of the
means and an unrestricted correlation
TABLE 4 .5
Regression That Predicts GRADRAT Based
Variable I NTERCEP
( �SAT
I ,ENROLL
the EM Algorithm
Coefficient
Standard Error
t Statistic
- 3 2 . 395
4.3 55
- 7 . 44
0.0 0 4
1 7. 1 5
0. 539
3 .86 1 1 .26
0 .067
STUFAC
2.083 1 2 .9 1 4 -0. 1 8 1
R MBRD
2.404
1 1 RIVATE
on
(or covariance)
1 . 1 47
0 .0 8 4 0 . 40 0
-2. 16
6.01
p
Value
0.0001 0.0001 0.000 1 0.000 1 0. 03 1 2
0.000 1
,, , I
I I I I
II I I
I ',
I I.' , I I ,
\
\ \: 1 J ( " I I I l le s e
P I ( ») ',L I I I I S,
s um m a ry
st atistics are input to other linear mod
t i le resulting standard error e stimates will be b i a se d
,
I I I v dow nward. To do better, we need to maximize the l i ke l ihoo d
I I II IC (
i t ) 1 1 for t he
mo
de l
of interest d i rectly This can be accomplished .
\ v i j i l a ny one of several software packages for est irna ti ng structural e q u a t i on models (SEMs )
with latent
va ri a bl e s
.
W hen the re are only a small number of missing data patterns, lin models can be estimated with any SEM p r ogr am that will ha n
ear
dle m ultiple grou p s (Allison, 1 987; Muthen, Kaplan,
&
Hollis , 1 987),
including LI SREL and EQS . For more general patterns of missing
data, there are c urrent ly four programs that pe rform d irect ML esti
m a tio n
of linear models :
Amos
A commercial program for SEM modeling that is available as a stand-alone package or as a module for SPSS . Information is available at http: //www . smallwaters .co m .
Mplus
A stand-alone commerci al program. Inform a tion is available a t
LINeS
A commercial module for Gauss. Information is available at
http ://www . statmode l . com.
h ttp://www. aptech .com#-3 party. h tml .
Mx
A freeware program available for download at http ://views .vcu. e d u/rnx.
Before proceeding to an example , let us cons id e r
a
bit of the under
lying theory. Let f(x l p , l: ) be the multivariate normal density for an observed vector comp lete
data
x,
mean vector J,L, and covariance
for a sample of i
==
1,
.
.
.
matrix I .
If we
had
, n observations from this
multivariate normal pop u l a t i o n , the likelihood funct i on would be
L(p., I)
n f (Xi IJi, I) .
==
I
Su pp ose we do not have complete data . If data are
variables fo r case i, we now let
Xi
d e letes the missing elements from
that
m i ss
�
on so m e
be a sm aller vector that simp ly
x.
Let
IL i be the subvector of J.L
deletes the corresponding elements that are m i s s i n g
let :1 j be a su bm a tr ix of
i ng
from X i
and
formed by deleting the rows and column
th at correspond to missing values of be comes ,
L(I-!, I)
==
x.
Our l i kel ihoo d
n f( Xi JL i ' Ii ) ' l
function then
25
, \ l though this function looks simple enough, it is considerably more d i fficult to work with than the likelihood function for complete data. Nevertheless, this likelihood function can be m axi m iz ed using co nven j i( )nal approaches to ML estimation. In particular, we can take the I ( 19arithm of the likelihood function, differentiate it with r e s p e c t to l i t e unknown parameters, and set the result equal to O . The resulting t·q uations can be solved by numerical algorithms like the Newton I�aphson method, which produces standard errors as a by-product . It I S also possible to impose a str"ucture on It and I by letting them I )c functions of a smaller set of parameters that correspond to some ; Issumed linear model, for example, the factor model s e t s
�
==
A
A'
+ '\}I ,
A is a matrix of factor loadings, ell is the cov ari a n ce m atrix o f the latent factors, and 'It is the covariance matrix of the error components. The estimation process can produce ML estimates of t hese parameters along with standard error estimates .
\v here
ilirect ML Example
I estimated the college regression model using Amos 3 .6, which has both a graphical user interface and a text interface. The graph ical interface allows the user to specify equations by drawing arrows among the variables. Because I can not show you a real-time demon stration, the equivalent text commands are shown in Figure 4 . 1 . The data were in a free-format text file call ed COLLEGE.DAT, with miss ing values denoted by -9. The $mstructure command tells Amos to estimate means for the specified variables, an essential part of esti mating models with missing data. The $structu re command specifies the equation to be estimated. The parentheses immediately after the equals sign indicate that an intercept is to be estimated. The (1 ) e rror at the end of the line tells Amos to include an error term with a coef ficient of 1 . 0 . The last line, act < > error, allows for a correl ation between ACT and the error term, which is possible because ACT has no direct effect on GRADRAT. Amos automatically allows for co r relations between ACT and the other independent variables in t h e regression equation. Results are shown in Table 4.6. A comparison of these re s u l t s w i 1 h the two-step EM estimates in Table 4.5 shows t hat the coc r l i c i c l l i
,
I ,
'1' � : ' \ I I J I , l c ' : : i z e
:1>111 : I;
I
I ; �; :
i I t t=;
r i pII L
=
-9
=
1302
v ar i ab 1 e s
g c adr at c s at
l enr o l l pr i vat e s t uf ac rmbrd act $ r awdat a $ i n c lud e = c : \ c o l l e ge . dat $m s t ru c t ur e c s at l enr o l l pr i v at e s t uf ac rmbr d act $ s t ru c t ur e gradrat
==
()
+ c s at + l enr o l l + pr i vat e + s t uf ac
+ rmbr d +
(1)
error
a c t < > e rr o r
Figure 4 . 1 . Amos Commands for the Regression Model That Predicts GRAD RAT
estimates are identical, but the Amos standard errors are notice ably larg er
which is j ust what we would expect.
bit smal l er than t hose in Tab le
4. 2
They
are sti ll quite a
th at were obtain ed with listwise
deletio n .
Conclusion
Maximum
likelihood
can
be
an effective and practical method for
at random . In this s i tu a t i o n ML esti nl ates are kn o wn to be opt im al i n large samples . Fo r linear m odels t h a t fan within the g e n e ral class of structural equation models esti
h an dling data that are
m i ssi n g
,
rn atc d by programs like LI S REL, ML esti mates are easily obtained
I
27 TABLE 4.6
I {egressions That Predict GRADRAT Using Direct ML With Amos , ;Iriable
Coefficient
Standard Error
- 3 2 . 3 95
4 .863
0.067 2 . 0 83
0.005 0.595 1 .277 0 09 2 0.548
I I"J ' I 'ERCEPT
t ' SAT I
I ': NROLL 1 ' lOVATE
1 2. 9 1 4
'-; I ' U FAC
-0. 181 2 4 04
I\ M BRD
.
.
t
p
Statistic - 6 .66 1
0.000000 0.000000 0.000467 0.000000 0.049068 0.000012
1 3 . 949 3
.
Value
4 99
] 0. 1 14
- 1 .968 4.386
software packages. Software is also avail ; I hle for M L estimation of log-linear models for categorical data, but j he implementation in th is setting is somewhat less straightforward. ( ) ne li m i t a t io n of the ML approach is that it requires a model for I he j oint distribution of all variables with m i s si n g data. The multivari ; I t c normal model is often convenient for this purpose, but may be u nrealistic for many applications . I )y several widely avail able
5. MULTIPLE IMPUTATION: BASIC S
advance over conventional ap p roache s to m iss in g data, i t has its limitations. As we have seen, ML theory and software are readily available for linear models and l og -li n ea r models, but b eyon d that, either theory or software is ge ne r a l ly l acking. For example, if you want to estimate a Cox proportional h azards model or an o r dered logistic regression model, you will h ave a tough time implementing ML methods for missing d at a Even if your mo d e l can be estimated with ML, you will need to use spec ia 1 ized software that may lack diagnostics or graphical output that you p art icu l ar ly w an t Fortunately, there is an alternative appro ac h multiple imputation that has the same o pti m a l properties as ML, but rem oves some of these limitations. More specifically, mUltiple i mp u ta t i on (MI) , wh e n used co rrec tly , produces estimates that are consistent, asymptotica l ly efficient, and asymptotically normal wh e n the data are MAR. Un l i k e ML, m u l t ip l e imputation can be us e d with virtually any kind of d � ' L ' an d any kind of model, and the a nalys i s can be done with u n ll10d i fled, conventional software. Of course MI has its own drawbacks_ I t A l though ML represents a m ajor
.
-
.
-
28
can be cumbersome to imp l e ment and it is easy to do it the wrong way. Both of these problems can be substantially alleviated by using good software to do the imputations A more fundamental drawback .
is t hat MI pr oduce s different estimates (hopefully, only slightly dif fer e nt) every time you use it. That can lead to awkward sit uat i on s in which different researchers get different numbers from the same data using the same methods.
Single Random Imputation
The reason that MI does not p rod uce
a u nique
set of numbers is
that random variation is deliberately introduced in the i m p utation
process. Without a random component,
deterministic imp utation
methods g e ne rally produc e underestimates of variances for variables with
m issing
data and, sometimes, covariances as well As we saw in .
the p r evi ou s chapter, the EM algorithm for the multivariate normal
model solves that problem by using residual variance and covariance estinlates to correct the conventional formulas. However, a good a l te r n ati ve
is to Inake random draws from the
r esidual
distribution
of each imputed variable and add those random numbers to the imputed values. Then, conventional formulas can be used to calculate variances and covariances. Here is a simp l e example. Suppose we want to estimate the corre lation between X and Y, but data are m i ssing on X for,
s ay ,
50% of
the cases. We can impute values for the missing X s by regressing X on Y for the cases with
co m p lete
data and then us ing the res ultin g
regression equation to generate predi cted values for the cases that are missing on X. I did this for a simulated sample of 10,000 cases, where X
a nd
Y were drawn from a standard, bivariate normal distribution
with a correlation of 0.30. Half of the X valu es were assigned to be missing (completely at random ) . U sing the predicted values from the regression of X on Y to substitute for the missing values, the corre lation between X and Y was estimated to be 0.42. Why the overestimate? The sample correlatio n is just the sam ple covariance of X and Y divided by the produc t of their
s am pl e
standard deviations. The regression i m putation method yields unbi ased estimates of the covariance. M oreover , the standard deviation of Y(with no missing data) was correctly estimated at abou t 1.0. How ever, the s ta n d ard deviation of X (including the i mpute d valu es ) was only 0.74, wh er eas the true sta n dard deviation was 1.0, resulting in
,,
2<) ,II'
( )verestimate
of the
d I e p rob l em is that
t
co rre la t i on An al e r native way to think about .
th e imp uted value
of X for t h e
5,000 cases with t h er e by i n fl a ting the
I l l I ssing data is a perfect linear fu nct i on of Y, I Tc lat ion between the two variables . We can correct this bias by taking random draws from t h e resid I L I I d i s t r ib u t i o n of X and then adding these random numbers to t h e I I I c d i c t ed values of X. In t his example, the residual d istr ib u t i o n of X ( regressed on Y) is n orm al with a mean of 0 and a standard devi a I i ( H1 ( e stim a ted from the listwise deleted least-squares r eg ressi o n ) of I ) , <)525 . For case i, let U i be a random draw fr om a standard normal I i s tr i b u tion and let x i be the predicted value from t h e reg res si o n of .\' on Y. OUf modified i m p u t e d value is t he n X i Xi + O . 9525 U i . I io r all observations in which X is mi s sing , we s ub s t i t u te x i and then t o m p u te the correlation . When I did this fo r t h e s i m u la t ed sam p l e t ) f 10,000 case s, the correlation between X (with modified i m p u t e d values ) and Y was 0.3 16, only a little higher than the true value
I
( )
;
, ,
'
.
'
t
==
'
o f 0.300.
i
, ,
I
"
Multiple Random Imputation
el i m ina te the biases that are endemic to de terministic i mpu ta t ion but a serious p roblem remains. If we u se i lnputed data (either random or de t e rm inis tic) as if it were r e a l data, the resultin g stand ard error estimates ge nerally will be too low and t est statistics will be too high . Conventional methods for standard error estimation cannot account a de qu ately for th e fact that the data are imp u te d The solution, at l ea st with rand om im pu tatio n , is to repeat the i mp ut at i on process more than onc e pr o d uc i ng multiple co m p let ed data sets. Because of the random component, the estimates of t h e parameters of in teres t will be sli gh tly different for e a ch imputed data set. This var iab il ity acr o s s i m puta tion s can be used to adju st the stan dard errors u pward . For the simulated s amp le of 10,000 cases, I rep eate d t h e ran dom i m p u tat io n process eigh t times, yielding th e estimates in Table 5. 1 . Although these estimates ar e app r ox im ate ly unbiased, the stan dard errors are downwardly biased because they do not take t h e i m p u tat io n9 into account . We can co mb i n e the e ig ht corre lation esti mates into a si ngl e estimate simply by taking their mean, which is 0.3125 . An improved estimate of the stan d a rd error takes three st e ps Random i m p u t at i o n can ,
.
,
"
,I
!' 11 j' j:
I
"
,, • ,
;
.
I
,I
, I
, ,,
I
;
I
TABLE 5 . 1
I
Correlations and Standard E r ro r s
for Randomly
Imputed Data
COlrelation
0 00 900 0.00903
0.3 1 59 0.3 1 08
.
0.3 135 0.3210
0. 00902
0.00897
0 . 00903
0.3 1 1 8 0.3022 0 . 3 1 89 0.3059
0.00909
0.00898 0 .0 090 6
1 . S quare the estimated standard errors results across the eight replicatio n s .
2 . Calculate the variance of the catio n s .
(to
correlation
ge t variances) and average the estimates across the eight repli
3 . Add the results of steps 1 a n d 2 togethe r (applying a small correction factor
To let
to
the variance in step 2) and take
the
s quare root.
put this i nt o one formul a, let M be the number of
rk
be the correlation in replication k , and l et
standard e r r or of
error in replication k . Then r ( th e m ean of t h e correlation
S . E . ( r)
==
1
Th i s formul a can i m p u ta ti o n , wi th
�
estimates) is 1
M- l be
rk
used
for any
interest (Rubin, 1 987) . App lying this formul a to th an the
mean of
L(rk r)2 .
[5. 1]
-
k
p arameter estimated by multiple
d enoting the k th estimate
ple, we get a standard error
be the estimated
the esti mate of the standard
1
J
Sk
replications,
of 0.0 1 123 ,
of the
the
param eter
of
corre l ation exam
wh i ch is about
24 %
the standard errors in the eight sam ples .
h ig h er
Al l owing for Random Variation in the Parameter Esti mates
is
Alth ough the method I j ust de scribed for imputing m i ssing values
pretty
good, it is
not ideal.
To g en e r ate the imputations
for
X,
I
31
egressed X on Y fo r the cases with complete I e g r e ss i o n equation I
data
to produce
cases with missing data on X, the imputed values were l a ted as 1 1'or
Uj
the
c al c
a random draw from a standard normal distribution and ,\'x .y is the estimated standard deviation of t h e error term (the root Inean squared error). For the simulated data set, we had a -0 .00 15, h 0.3 101, and sx.y 0.9525 . These values were used to produce i rnputations for each of the eight completed data sets . The problem with this approach is that it treats a , b, and sx .y as though t h ey were the true parameters, not sample estimates. Obvi o us ly, we cannot know what the true values are, b u t for "proper " m ul t i p l e imputations (Rubin, 1 987), each imputed data set should be based on a different set of values of a , b, and sx .y . These values should be random draws from the Bayesian posterior distrib ution of t h e parameters. Only in this way can multiple imputation con1pletely e mb ody our uncertainty with regard to the unknown p ar a me te rs This claim naturally raises several questions. What is the Bayesian posterior distribution of the parameters? How do we get random draws from the posterior distribution? Do we re ally n ee d this addi t i on al complication? The first question really requires another book and, fortunately, there is a good o ne in th e Quantitative Applications in the Social Sciences series (Iversen, 1985). As for the second ques tion, there are several different ap p ro a c he s to getting random dravJs from the posterior distribution, some of them embodied in easy-to use software. Later in this chapter, when we consider MI under the m ult ivari at e normal model, I will explain one method c a l l e d data aug mentation (S c hafer 1997) . Can you get by wi th out making random draws from the p o st e rior distribution of the parameters? It i s important to answer to this ques tion, because some random imp u ta ti on software l i ke the missing data module in SPSS does not randomly draw the parameter val u e s In many cases, I think the answer is ye s If the sample is large and the p rop o rt i o n of cases with m i s sin g data is small, MI without where
is
u
==
==
==
.
,
.
.
I I
I
TABLE 5 . 2 Co rrelations and Standard Errors fo r Ran d om ly
I m p u te d
D ata Using the Data Auglnent ation Cone Iat ion
this extra step
Method S. E.
0.30636
0 . 00906 1 4
0.3 ] 3 1 6
0 . 0090 1 9 3
0 . 3 1 837
0 . 00 8 9864
0 .3 1 1 42
0.00 90302
0 . 3 2086
0 . 0 08 9705
0 . 2 9 760
0.00 9 1 1 43
0.3 270 1
0 . 0089306
0 . 3 0826
0 . 0090498
typ ic a l ly
will yield results th at are very close to those
ob t ained with i t . On the other h and, if the s a mpl e is sm all or if the
p r o porti o n of cases with missing data is l a r ge , the add ition al vari ation can m ake a noticeable difference .
Continui n g our correlation ex a mp l e , I imputed eight new data sets using th e data augment ation method to generate random draws from
the
po st e rior distribution of the p aram eters . "Tab le
5.2 gives the corre
l ation betwee n X and Y, an d its s t andard error for e ach data set. The
mean of the correl ation e stimates is 0 . 3 1 28 8 . Using formul a e sti m ated standard e r r o r is
obtain e d with the cruder
0.01 329,
imputation
sli gh tl y large r t h a n the
5.1, the 0.01 1 23
method . In general, the stan
dard e rr o r s w i l l be somew h at larger when the p arameters used in the
imputation are drawn r a n domly
.
Multiple Imputation Under the Multivariate Normal Model
To do
multiple
impu tatio n, y o u need a model to gen erate the impu
t at i o n s Fo r the two-variable exa mpl e j ust co nsi d e re d , I employed a .
.
simpl e regression model with norm ally distributed err o rs Obviously,
lllore compl ic ated situations require more complicate d models. As we saw in Chapter
4,
m ax i m u m likel ihood also r eq u i r e s a model . How
eve r, MI is probab ly l ess sensitive than M L to the choice of model
because the model is used only to impute th e missing d ata, not to e s t i m a t e other parameters .
33 I deally,
the imp u tat io n model would be specially constructed to rep I l'sent the pa rticu l ar features of each data set. In practice, it is more convenient to w or k with "off-the-shelf" m od e ls that are e asy to use ; l l l d p rovi de re a s o nably good approximations for a wide range of data
sets.
The most popular mod e l for MI is the multivariate n or m a l mo d e l \v hich previously was used in Ch apt e r 4 as the b as i s for ML est i m ati o n t ) 1' l i n e a r models with m i ss i ng d ata. The m u l t iv a r i at e normal model i l n p l ies that ,
®
All vari ables have normal distributions
8
Each variable can be represented as a li nea r function of all the oth er
vari ables, together with a Donnal, homoscedastic error term
Although these are s tro ng conditions, in practice the multivariate nor I n al model s e e ms to do a good job of imputation even when some of t h e variables have distributions that are ma n ife s tly not normal ( Schafer, 1997) . It is a completely innocuous assumption for those var i ab les that have no missing d ata. For those variables that do have I n i ss in g data, n o rm ali z ing transformations can greatly improve the quality of the i mput atio n s In essence, MI under the multivariate normal model is a generaliza tion of the method used in the two-variable example of the previous section. For each variable w ith m i ss i ng data, we estimate the lincar regression of that variable on all other variables of i nterest. I d eally the regression parameters are random draws from the Bayesian pos terior d is tr ib ut i o n The estimated r egres s i o n e qu a t i o n s are then used to generate predicted val u e s for the cases wi t h m i ssi n g data. Final ly, to each predicted va l ue we add a random draw from the resi dual normal distribution for that v ari abl e The most complicated part of the imputation p ro cess is ge t t i ng random draws from the posterior distribution of the regression p ar am e t e r s As of this writing, two a lgo r i t h m s that accomplish t h i s have bee n i m ple m e n t ed in r ea dily available software: data a ugn1cn t ati on (Schafer, 1997) and s ampl i n g importance/resampling (SIR; Rubin, 1987). Here are some computer programs that im plement these methods: .
,
.
,
.
.
;
I
-
1 1 1 I · I I 1 { ' I I II I lIon ,
I - -J , P H 1\1
A freeware pa c k a g e developed by Sch afer and described in his
1 997 book. Availabl e in either a stand-alone Windows version
or as an S -PLUS library (http://www .stat .psu.edu/rvjls/) .
SOLAS
A stand-alone commercial p ackage that includes both data
augmentation (version 2 and l ater) a n d a propensity score
method. The l atter method is invalid for many applications (Allison , 2000) (http://\V\VW . statsolusa . com)
PROC MI
A SAS procedure av ai l a bl e in release 8 . 1 and later (http://
www .sas .com) .
i ,
I ,
i , ; , ,
,
Sampling ImportancelResampling AMELIA
A freeware p ackage developed by
King,
Honaker, Jos eph,
Scheve, and Singh ( 1 999) . Available either as a stand-alone
Windows program o r as a module for Gauss h arvard . e du/stats. shtml) .
(http://gking.
A SAS macro written by C . H. Brown and X. Ling ( http : //
SIRNORM
yate s . coph. usf. edu/re search/psmg/web . html) .
Both
a l g ori thm s have s om e theoretical j ustification. Propo n e n ts of
S IR (King, Hon aker, J o se ph ,
&
S ch eve ,
2001)
far less comput er time . However, the relative
c l aim that it req ui r e s s u peri o ri ty of these two
m e thods is far from se ttle d . Because I h ave much more experience '",
i th
d ata a ugm en t a t i o n I will fo c u s on this m ethod in the rem ainder ,
of this chap t e r
.
Data Augmentation for the Multivariate Normal Model
Data augmentation (DA) is a type of Markov chain Monte Carlo
(MCMC) algorithm,
a general method for
finding
t ions that h as become increasingly popular in Bayesian
this sec ti o n I wi ll describ e how ,
model.
Although th e
ations automatically,
it
distribu st at is t i c s In
posterior
.
works for the m u lt iv aria te normal
available software performs most of the oper
it
is helpful to have a general idea of wh at is
going on, espe cially whe n things go wrong. The g e n e r al structure of the iterative algor i t h m is m uch like the
EM algorithm for the multivariate normal model, described in the l ast ch ap t er , except that random draws are made at sub se quently.
Before
beginning DA, it
is
two
points, described
neces s a ry to choose a set of
35 variables for the imputation process. Obviously th i s should include all variables with missing data, as well as other variables in t he model to he estimated. It is also wor thwh i le to in c l ud e additional variables (not i l l the int e nded model) that are h i ghly correlated with the variables I h at h ave missing data or that are associated with the probability tha t I hose variables have missing data. Once the variables are chosen, DA consists of the fol lowing steps. , i, i i ,;, , ,
1 . Choose starting values for the parameters . For the multivariate norm al
model, the parameters are the means and the covariance matrix. Starting
values can be gotten from standard formulas using listwise deletion o r pairwise deletion. Even better are the estimates obtained with the EM algorithm described in the last chapter.
2 . Use th e current values of the means and covari ances to obtain
esti
mates of regression coefficients for equations in which each variahle
with missing data is regressed on all observed variables . This is d one for each p attern of missing data.
3 . Use the regression estimates to generate predicted values for all the
missing values . To each predicted value, add a random draw from the residual normal distribution for that variable .
4 . Using the " completed" data set, with both observed and imputed values, recalculate the means and covariance matrix using standard forIll ulas.
5 . B ased o n the newly calculated means and covariances, make a random draw from the posterior distribution of the means and covari ances.
6. Using the randomly drawn means and covariances, go back to step 2 and continue cycling through the subsequent steps until convergence i s achieved . The imputations that are produced during the final iteration
are used to form a completed data set.
Step 5 n e eds a little mor e exp l anation . To get the posterior di s tribu tion of the paramete rs we fir st need a " prior " distribution. Al th o ug h this can be based on prior b e liefs about those parameters, the usual practice is to use a "noninformative" p rior that is, a prior d is t ribu tion t h a t contains little or no information about the pa r amete rs . Here i s how it works in a simple situation. Suppose we have a sample of size n wi th measurements on a single normally distributed variable Y . The sample mean is y and t h e sample variance is S 2 . We want random d raws from the pos te r i or distribution of JL an d 0" 2 . With a noninformative prio r ] O we can get 0.2 , a random draw for the vari ance, by sampling from a chi-square distribution with n 1 degrees ,
,
,
,
-
( l r r "rcc d o ll1 , t aki ng th e reciprocal of the drawn value , and i I l g t h e result by
ns2 .
m ul t i p ly
We then get a random draw for the mean by
sa i l l p l i ng from a n o r m a l distribution with I f there were no missing
data, t h ese
me an y
variance (j2 jn .
and
would be random draws fro m
t h e true pos t erior distrib uti o n of the p arameters, but if we have
any mis sing d ata what we actu ally h ave are random draws the p o s t e ri or dis tribution t ha t w ou l d result if the i m p uted data the true d at a . S i m i la rl y wh en we randomly impute m issing data
imputed
fro m were
,
,
in step 3 , wh at we h ave a re random draws from th e post e r i or dis trib uti o n of the m iss i ng dat a, gi v en the current parameter values .
However, because the current va lues m ay not be t h e t ru e val u es , the imput e d data m ay not be random draws fro m the true poster i or dis
tribution . That is why t h e p r oce du r e m ust be i te r at ive By co ntinu ally .
m ov ing
back and forth be twe e n ran dom draws of p arameters (condi
tio nal on b oth observe d
and im p u t e d
data )
and
ra ndom d r aw s of th e
m i s sing d ata (cond itio nal o n the current pa ram eters ) , we eve ntu a lly ge t random draws from th e j o i n t p osterior d istrib u tion of both d a t a
and parame ters , c on d i tion i n g only
on
the
o b se rv e d
d at a .
Convergence in Data Augme ntation When you do data a u g m e n ta t i o n , you must specify the number of
i t e r at ions Howeve r, that rai s e s a tough .
are ne c e s s ary to get
co
n v erge n ce
que s t i o n :
How m any i te rations
to the j o i n t posterior d is tribution of
missing d at a an d param eters ? With i te rative es ti m a t i on for maximum l i ke lihood, as in the EM algorith m , the es tim ates conve rge to a singl e set of val ues. C o n ve rgen c e can the n b e e asily ass e s sed by s ee
c h e c k i ng
to
how much ch ange t he re i s i n the parame ter e s t i m ate s froIn one
i te rat ion t o
the
n ext For d a t a augmentation, on t h e o t her hand, the .
a lgo ri thm conv erg e s to a prob ability di stri bu tion, not a si ngl e se t of
va lu es. Th at makes it ra t h er difficult to deter m i n e wh ether conver gence has, in fact, bee n achi eved . Al though some diagnos tic st atis tics
are available for assessing converge nce (Sch afe r, 1 9 97) , they are far from defini tive . In most applicat ions, the choice of number of i t era t ion s will be a stab in th e dark . To give you som e idea of t he range of possibilities,
S c h a fe r
(1 997) used
an yw h e r e
between 50
and
1 ,000
iterations for
the better, b u t each iteration can be computat ionally inten sive , e s p e c i a lly fo r l arge samp l es with l ot s of t h e exampl es i n his book. The
m ore
I ,
I
\
37
variables. Specifying a large number of iterations can leave you staring at your monitor for painfully long periods of time. There are a couple of principles to keep in mind. First, the higher is
the proportion of missing data (actually, missing information, which is not quite the same thing), the more iterations will be needed to reach convergence. If only 5% of the cases have any missing data, you can probably get by with only a small number of iterations. Second, the rate of convergence of the EM algorithm is a useful indication of the rate of convergence for data augmentation. A good rule of thumb is that the number of iterations for DA should be at least as l a rge as the number of iterations requ i re d for EM. T h at is another reason for always running EM before data augmentation. (The first reason is that EM gives good starting values for data augmentation.) My own feeling about the iteration issue is that it is not all that critical in most applications. Moving from deterministic imputation to randomized imputation is a huge improvement, even if the parameters are not randomly drawn. Moving from randomized imputation with out random parameter draws to randomized imputation with random parameter draws is another substantial improvement, but not nearly as dramatic. Moving from a few iterations of data augmentation to many iterations improves things still further, but the marginal return is likely to be quite small in most applications. An additional complication stems from the fact that multiple impu
tation produces multiple data sets6 At least two are required, but the more the better. For a fixed amount of computing time, one can either produce more data sets or more iterations of data augmentation per data set. Unfortunately, data sets with lots of missing information need both more iterations and more data sets. Although little has been written about this issue, I tend to think that more p riority should be put on additional data sets.
Sequential Versus Parallel Chains of Data Augmentation We have just seen how to use data augmentation to produce a single, completed data set. For mUltiple imputation, we need s eve r a l data sets. Two methods have been proposed to do thi s : 1. Parallel. Run a separate chain of iterations for e ach of the desired data sets. These might be from the same set of starting estimates) or different starting values.
va
l u es (say, the EM
I, I
I:
Ii . I.
.
Ii
,
J , �. ,
•
S' -'II I I '/I /lt d. Run one long ch ain of data augmentation cycles. Take the 1 1 1 1 ] HI L i t iOBs j I I t'
Wl"
p r odu c ed in every k th i teration, where k is chosen to give
d l' s i rcd number of data sets . For example, if we want five data sets,
could first run 5 00 iterations and then use the imputations prod u ced
I Jy every l OOth iteration. Th e larger num b e r of 5 00 it erations before the Erst i mput atio n constitute a "burn -in" period th at allows the process to
converge t o the co rr ect distribution .
Both methods are acceptable . An advantage of the s e qu e n t ial approach is that convergence to the tru e post e ri o r distribution is m ore likely t o be achieved, especially for later data sets in the sequence. However, whenever you take multiple data sets from th e same chain of i ter atio n s i t is questionable whether t ho se data sets are statistically independent, a requirement for v al id inference. The closer t o ge t he r two d ata sets are in the s am e c ha in the more likely there is to be some dependence. That is why you cannot j us t run 200 iterations to reach convergence and then u se th e next five iterations to produce five data sets. The par allel method avoid s the p roble m of dependence, but makes it m or e questionable whether co n v e rgenc e has been achieved. Fur t her mor e both Rubin ( 1 987) and Schafer ( 1 997) sugge s t that instead of using the same set of starting values for each sequence, the start ing values should be drawn from an "overdispersed" prior distribution with th e EM estimates at the ce nt e r of that distribution, something that is not alw ays str aightforward to dO . 1 1 For the vast majority of applications, I d o not think it makes a sub stantial difference wh e ther t he s eq ue ntial or p ara llel method is cho sen. With an equal number of iterations, the two methods sh ou ld give approximately equivalent results. When u sing the parallel method, I believe that acce p t ab le results can be obtained, in most cases, by using the EM e st i ma t e s as starting values for each of the iteration chains. ,
,
,
Using the Normal .M odel for Nonnormal or Categorical Data
that can be clo se ly approximated by t h e multivariate nor mal distribution are rare indee d . Typically, th ere wi l l b e some vari ables with highly skewed distributions a nd other v a r i a bl es that are s t r i c t l y c a t e go r i c al I n s u ch cases, is th e r e any v alue to the norm al based methods that we h ave just been considering? As mentione d Data sets
.
I -a rtier, there is no problem whatever for variables that have no miss- I I I I!, data because nothing is being imputed for them. For variables with I I I issing data, there is a good deal of evidence that these imputation I l )�thods can work quite well, even when the distributions are clearly I l ot normal (Schafer, 1997) . Nevertheless, there are a few techniques t hat can improve the performance of the normal model for imputing I l( )nnormal variables. For quantitative variables that are highly skewed, it is usually help I t . I to transform the variables to reduce the skewness before doing t he imputation. Any transformation that does the job should be OK. I\fter the data have been imputed, the reverse transformation can be ;lpplied to bring the variable back to its original metric. For exam1 ,le, the log transformation greatly reduces the skewness for most i ncome data. After imputing, just take the antilog of income. This i s particularly helpful for variables that have a restricted r a n ge If you i nlpute the logarithm of income rather than income itself, it is impos si ble for the imputed values to yield incomes that are less than zero . Similarly , if the variable to be imputed is a proportion, the logit trans formation can prevent the imputation of values greater than 1 or less t han O. Some software can handle the restricted range problem in another 'Nay. If you specify a maximum or a minimum value for a particular variable, it will reject all random draws outside that range and sim ply take additional draws until it gets one within the sp ec ifie d range. Although this is a very useful option, it is still desirable to use trans formations that reduce skewness in the variables. For quantitative variables that are discrete, it is often desirable to round the imputed values to the discrete scale. Suppose, for example, that adults are asked to report how many children they have . This typically will be skewed, so one might begin by applying a logarithluic or square root transformation. After imputation, back transformation "vill yield numbers with non integer values. These can be rounded to conform to the original scale. Some software can perform such round ing automatically. What about variables that are strictly categorical? Although th e re are methods and computer programs designed strictly for data se t s with only categorical variables, as well as for data sets with m ixtu re s of categorical and normally distributed variables, these m e t h o d s ( l rc typically much more difficult to use and often break down con1plc tcly. Many users will do just as well by applying the normal meth o d s w i t i l .
r
40 SO Ill C minor a l tera ti ons Dichotomous vari ables, such as sex, are .
a l l y represented b y d ummy (in dicator) vari ables with val ues of
us u 0 or
1 . Any transformation of a dichotomy will sti l l yield a d i c h o tomy , so
there is
no
value in t rying to r edu c e skewness . Instead, we can silnply
variable just like any other variab l e . 'Then round the to 0 or 1 , a c c ord i ng to whe ther the imputed value is
in1pute the 0- 1
imputed value s
above or b e l ow . S . While most i m put a tions wi ll be in the (0, 1 ) inter
va l , they
w ill
sometim e s b e outside that range . This presents no prob
lem in this case because we s i mpl y assign a value of
0
o n which is closer to the imputed value.
or 1 , dep e nding
Variables with more than two categories are usu a lly represented
with sets of dummy variables. The re is no need to do a n ything spe cial at the data augmen t at i on s t age, but care m ust be taken when as s i gn i ng the final values . The problem is that we need to assign indi viduals to one and on ly one category, with appropriate cod ing for th e dummy va r iab l e s . Suppose the variable to b e i mp u ted is m arital sta tus, which h as three
c
a t egories :
never
m ar r
i ed
,
c urrently m arrie d, and
form erly married . Le t N be a dummy variable for never marrie d and
let IV! be a dum my variable for currently
married.
The i m p ut a t i on is
done with these two variables and the imputed values are used to pro duce final codings. Here are some possible imputations and resulting eodings : Final Values
Imputed Values N
M
. /
.2
,...
.3
.5
.2
.2
.6
.8
- . 'J
.2
....,
The essential
r
u le
N
M
.1
1
0
.2
0
1
.6
0
0
- .4
0
1
1 - N - M
1
0
0
i s this. In add i t i o n to the two imputed values, also
calculate 1 minus the sum of the two imputed valu e s , which can be regarded as the imp uted v a lue for the reference category. Then deter In ine
which c a tegory h as the h ighest
i mp ut ed
value. If that value cor
re sponds to a category with an explicit du mmy variable, assign a value of 1 to that vari able . If the highest value corresponds to the refer
ence category, assign a
0
to both dummy variables. Again, although
negative values may appear awkward in t h is context , the rule still can
he app lie d . The exte nsion to four or more categories sho u l d be
s t raigh tfolWard .
I
'
,
,
'
4J Exploratory Analys is
Typ ically much of data analysis consists of exp l o r at ory work in w hi ch the an a lys t exper i m ent s with different methods and models. For anyone who has done t h i s kind of work, the process of mu l ti p l e imputation may seem high ly problematic. Doing exploratory a nalys i s on several d a t a sets simultaneously would surely be a very cumber some process . Furthermore, analysis on each data set could suggest a slightly different model, but mUltiple i m putat i o n requires an identical model for all data sets. The solution is simple, but ad hoc. When generating t he m u l ti p le data sets, just p r oduce one more data set than you need to do the mul t ip l e im p u t at i on analysis. Thus, if you want to do multiple i mput ation on three data sets, t h e n generate four data sets. Use the extra data set to do exploratory an a ly si s Once you have settled on a s i ngle model or a small set of models, reestimate the models on the re m aini ng data sets and ap ply the methods we have discussed for c o mbin i n g the results. Keep in mind that although the parameter estimates obtained from the exploratory d at a set will be a p p r oxi m ate ly unbiased, all the standard errors will be biased downward and the test statistics will be b i a sed upward. Consequently, it Inay be desirable to use somewhat more conservative criteria than usual (with complete data) to j udge the adequacy of a given model. ,
.
MI Example 1
e nough background to consider a realistic examp le of multiple imputation. Let us revisit the examp l e used in Ch apter 4, where the data set c on sis t ed of 1 ,302 American colleges with mea surements on seven variables, all but one of which have some missing d at a As before our goal is to estimate a linear regression model th at pre dicts GRADRAT, t he ratio of the number of graduating seniors to the number who enrolled as fr e s h man four years earlier. I n d e pen d ent variables include all the others except ACT, the mean of ACT scores. This variable is inclu d e d in the imputation process to get better pre dictions of CSAT, the mean of the combined SAT scores. The latter variable has m issing data for 40% of the cases, but is highly corre lated with ACT for the 488 cases with data present on b o t h variables We now h ave
,
.
(r
==
.9 1 ) .
___
I
__ _ _ _•
_
.-"- _
J
� _
' rhe first step wa s to examine the distributions of the variable s to
c h eck for norm a l i ty. Histograms and normal p rob ability plots sug
ge sted th at al l t he v ari ables except one h ad distributions that were re aso n a b ly cl ose to a norm al distribution . The excepti on was e n ro l l men t ,
wh ich
was
highly skewed
to the righ t. As in the ML example, I
worked with the natural logarithm of enrollment , wh i ch h ad a distri bution with very little skewn ess. To do the data au gment ation, I used PROC MI in SAS . Th e first step was to estimate the us i ng the
EM
means, standard
devi atio ns, a n d correl ations
algori thm , which were alre ady displayed in Table
4.4.
The EM algorithm took 32 iterations to conve rge . This i s a moder ately large number, which probably refl ects the large percentage of missing data
for
some of the variables . However, i t is not so l arge as
to sugge st seri ous problems data augmentation . He re is a
m ini m
al set
in
applying e i ther the EM
algorithm
or
of SAS s tateme nts to produce the multiple
imputations :
p r o c m i dat a= c o l l e ge o ut = c o l l imp ; v ar gr adrat
c s at l enr o l l p r i vat e s t uf ac rmbrd act ;
run ,'
college ing
is t he n ame of the i
values)
and
collimp
np ut data set
(wh i ch had periods for miss
is the n ame of the output data set (wh i ch con
tains observe d and imputed values) . 'Th e
var
statement gives names of
va ri ab le s to be used in the imputation process. The default in PROC
MI
is to pro duce five completed d ata sets b ased on a s e quential ch ain
of iteration s . Wi th EM estim ates used as starti ng values, there are
200
"burn-in"
iterations befo re the fi rst imputation. This is followe d
by 1 00 iterations between successive imputations. The five data sets are wr i tten into one large SAS d ata set
(collimp )
to facil i tate later
an alys i s . The out p ut data set co ntains a new vari able
wit h values of wi th
1 ,302
1
j m p utation_
through 5 to indi cate the differe n t d ata sets. Thus,
observation s in the original dat a set, the n ew d ata set h as
6 ,5 1 0 observat ions . Rather than re lyi ng on the defaults, I actually use d a somewh at more comp licated
program :
__�
•
,
. "I
:"
43
pr o c mi dat a m inimnm
==
maximum
==
r ound
==
==
0 600
.
100 1410
1111 .
var gradr at
MCMC nb i t e r run ;
seed
my . c o l l eg e out
111 ;
c s at
==
.
.
0
.
==
1260 1 1
c o l I imp s e ed
==
140 1
1 0 0 8700 3 1
l enr o l l pr i v at e s t uf a c rmbrd act ;
500 niter
==
200 ;
1401 sets a "seed" value for the random number generator so that the results can be exactly reproduced in a later run. The max i lnum and m i n i m u m options set maximum and minimum values for each variable. If a randomly imputed value happens to be outside these hounds, the value is rejected and a new value is drawn. For g radrat and stufac, the maximum and minimum were the theoretical bounds of 0 and 100. For lenro l l and private, no bounds were specified. For csat, rrrl b rd , and act , I used the obselVed maximum and minimum for each variable. The round option rounds the imputed values of all variables to integers, except for the STU TAC, which is rounded to one decimal place. The M CMC statement permits more control over the data augmen tation process. Here I have specified 500 burn-in iterations before the first imputation, followed by 200 iterations between successive impu tations. Were these enough iterations to achieve convergence? It is hard to say for sure, but we can try out some of the convergence diagnostics suggested by Schafer ( 1 997) . One simple approach is to examine some of the parameter values produced at each iteration and see if there is any trend across the iterations. For the multivariate normal model with 7 variables, the parameters are the 7 means, the 7 variances, and the 21 covariances. Rather than examine all the parameters, it is useful to focus on those that involve variables with the most missing data because these are the most likely to be problematic. For these data, the variable CSAT has about 40% missing data. Because the ultimate goal is to estimate the regression that predicts G RAD RAT, let us look at the bivariate regression slope for CSAT, which is the covariance of CSAT and G RAD RAT divided by the variance of CSAT Figure 5 . 1 graphs the values of the regression slope for the first 1 00 iterations of data augmentation. After the first iteration, there s e e nl S ==
-
- -._-
---
44 Cs.; Q
t
B
09 8 1
�
c.
() 9 7
C
e9G
C 0
.
\
11
(J' q c... _
i, I
,
I
,
,
I
_
l
•
o S' 4
,
I
o . o c:u
\
0 . 092 I
I) . e n 0 . 090 0 . 0 8 <) Ci . 0 8 8 G
087
o
086
o
085
o
]0
20
60
40
70
80
� t e r·a t i o n
Figure 5 . 1 . Estimates of Regr e s s i o n Slope of GRADRAT on CSAT for the
First 100 Iterations of D ata Augm entation
to be no
p articu l ar trend in
the estimates of the s lop e coefficient,
which is reassuring.
Another recommended d i agno s t i c is the set of autocorrelations for the parameter of interest at various lags in the sequence of itera t i o n s The obj e ctive is to have enough iterations .
so that the autocorrelation goes to iterations, Fi gu r e
5.2 graphs
O.
b e twe e n
imputations
Using th e entire series of 1 ,300
t he autocorrelation between values of t he
bivariate regression slope for diffe rent lags . Thus, the leftmost value of 0.34 re p r e se nts the
correlation
between parameter values th at ar e
one iteration ap art . The s econd value, 0. 1 3, is the correlation between values that are sep ar at ed by two iterations . Although these two initial values are high,
the
autocorrelations quickly se ttl e down to relatively
low values (within 0. 10 of 0) that h ave an a ppa ren t ly r a n d om p at te rn
To g e the r these two di agnostics suggest that we could have
.
m a n aged
with far fewe r th an 200 iterations s e p ar a t i n g the imputations .
Never
theless, it does not hurt to do more, and the diagnostics used here do not guarantee that convergence has been attained.
have transform ed form, but because I
After producing the completed d a ta sets, I could
the logged enrollment variable back to its original
.+;:... Ul
CSAT LENROLL PRIVATE STUFAC RMBRD
Intercept
( 1 . 1 5 7) (0 .0 8 4) (0 04 08 )
1 2 .274 -0.213 1 . 657
( 1 . 1 26 ) (0.082 ) (0. 392 )
1 2.840
- 0. 1 1 6
2 .4 1 7
( 1 . 1 24 )
(0. 0 83 )
(0. 3 9 0)
1 1 .632
- 0 . 145
2.95 1
2. 1 0 3
- 0. 1 42
1 3 .468
- 0.23 1 2.612
(0.391)
12. 1 9 1 (0 . 083 )
(1.121)
(0.3 9 3 )
(0.0 84)
(1.141)
(0.538) .
97 1 1
(0. 5 32)
2. 1 87
(0 .5 46 )
1 .852
(0 .526 )
2. 0 23
(0 . 5 3 4)
1 .55 0
( 0 . 004) 0 . 0 65 (0. 004 )
( 4 .9 24 )
0. 069
( 4.869 )
(0. 004)
.
3 4 727
0 . 07 1
(0.004)
0 . 067
(0. 004 )
0 . 069
-
_.
(4 . 3 0 6 )
.
- 3 1 . 256
( 4 . 25 0)
- 3 3 . 23 0
( 4. 272 )
-33. 219
- 29 . 1 1 7
-" - ' �- ' - - --'''--''---- --.-.-- - .
Regression Coefficients (and Standard Errors) for Five Completed Data Sets
TABLE 5 . 3
-
�
. _ ___ ,___
_'
_
__
.
_'_
-,j.,Iot;l;-H#
, I
,
'
I
, "
T.
U S I>l
'J . 1:' .1 ...
,---
_____ ___
. ________________________ __-----,
,
I�
o .
"l 2
" .30
0 .) 4
�) . ! t\ .., . l�
lb
:1 . J '; o
_
1
:2
n .10 C .Ja 0 .06 0 . 1)4 o . 0 7:
-0 -
_
8
1-'. ')
.
.
04
' U _ ('I --' O "
- 0 .u8 - c1
,
1 ()
- 0 . l2
1 ;;
34 S " 7 8 9 1 1 1
1 1 1 � 1 1 :;';22 22 2 22 2 22 3 3 3 3 3 :1 3 3 3 3 4 44 -1 4 4 4 4 !' 4 5 5 5 5 '> 5 5 5 5 S 6 6
:Jl 2 ,4 => 6 7
S 66 fj 66 6 67 7 77 7 n 7 Tn] 8 8 e B 8 8 88 F. 9:J
9 9 �l 9 '3 9 01 9 1
8 90 1 23 4 56 '1 89 0 12 3 4 5 6 "1 8 9 C 1 ::> 3 4 5 6 7 8 9 0 I :2 .3'1 5 6 7 a 9 0 � ;:> 3 4 '0 6 j 8 90 :' 23 4 :,(, 7 B9 C 12 � 1 5 6 7 8 ') 0 1 2 J 4 to 6 7 8 9 G :1
Lan"
Figure 5 . 2. Autocorrel ations for Regression Slope of GRAD RAT on CSAT
for Lags Varying Between 1 and 1 00
e xp e c ted diminishing returns in t h e effect of
enrollmen t
on gr ad u ati o n
rates, I deci ded to l e ave the vari able in logarithm i c form, j ust as I
di d for the regression models estimated by ML in Chapter 4 .
So
the five compl eted data sets. Th is i s facilitated
in SAS
by
the
e ach the use
next step was simply to estimate the regressi on mo del for
of of
a BY statement, wh ich avoids the necessi ty to sp ec ify five different
re g r essi o n pr o c
mo dels :
r e g dat a
�
m o de l gradrat
c o l l imp out e s t �
by _imput at i on_ ;
�
e s t imat e
c o v o ut ;
c s at l enro l l p r i vat e s t uf a c rmbrd ;
run ;
This
I
set of state ments tells
SAS to
e stima te a separate regr e ss i o n
model for ea c h subgroup defined by the five values of the
jmputation _
vari able . Qutest== esti mate re quests that the re gre ssion estimates b e
e sti mat e and covout requ ests th at the c ova r i an c e matrix of the r e gr e ss i on parame ters be included in this data set. This makes it ea sy to combine the es t i m a te s in the n ext step . Results for the five r e gr e ss io n s a re shown in Table 5 . 3 . Clearly t h e re i s a gre at deal of stability from one re g re s s io n to the n ext, but
written into
a
new data set calle d
,
·1 1 I here is also noticeable variability, which is attributable to the ra n d O l l i
(omponent of the imputation .
The results from these regressions are integrated into a single of estimates using another
SAS
procedure called
i nvoked with the following statements : pr o c m i analyz e dat a v ar int e r c ept
=
sc t
MIANALYZE. I t is
e s t imat e ;
c s at lenr o l l pr i vat e stuf ac rmbrd ;
run ;
rrhis procedure operates directly on the data set esti mate, which con tains the coefficients and associated statistics produced by the regres sions runs. Results are shown in Figure
5.3.
The column labeled "Mean" in Figure 5 . 3 contains the means of
the
coefficients in Table
formula
5.1,
5.3.
The standard errors, calculated using
are appreciably larger than the standard errors in Table
5 . 3 , because the between-regression variability is added to the within
regression variability. However, there is more between-regression vari ability for some coefficients than for others. At the low end, the stan dard error for the l en rol l coefficient in Figure
5.3
is only about
larger than t he mean of the standard errors in Table end, the combined standard error for rm b rd is about
5 .3. At
70%
10%
the high
larger th an
the me an of the individual standard errors. The greater variability in the rm b rd coefficients is apparent in Table range from
1 .66
to
5 .3,
where the estimates
2.95.
The column labeled " t for H O : M ean = O" in Figure
5.3
is j ust the
ratio of each coefficient to its standard error. The immediately pre ceding column gives the degrees of freedom used to calculate the p
value from a t table. This number h as nothing to do with the num ber of observations or the number of variables.
It
is simply a way to
specify a reference distribution that happens to be a good approxi mation to the sampling distribution of the t-ratio statistic . Although it is not essential to know how the degrees of freedom is calculated,
I think it is worth a short explanation . For a given coefficient, let U
be the average of the squared, within-regression standard errors. Let B be the variance of the coefficients between regressions . The
increase in variance due to missing data r ==
is defined as
(1 + M-1 )B -----
U
'
relative
, _. __ ,
I,
_____ _ _ _____ _ __ _ __ __ _ _____ _._
.-,
;,d,lA ; , ,�J,",
, , ,
.
\ i
M u l t i p l e � I m p u t a t i o n P a ramet e r Est ima t e s F r act ion Mis sing
t f o r HO :
St d E r ro r
I n f o rma t io n
Mean
Mean
OF
32 . 3 0 9 795
5 . 63 9 4 1 1
72
- 6 . 5 96 9 9 5
< . 000 1
0 . 2 5 5 7 24
( : �-; a t
0 . 068255
0 . 004692
39
1 4 . 5 4 73 8 8
< . 000 1
0 . 3 5 64 5 1
l_ e rl r o l l
1 . 9 1 6654
0 . 5 95 2 2 9
1 10
3 . 2 20027
0 , 00 1 7
0
p r iv a t e
1 2 . 4 8 1 05 0
1 . 3 67858
40
9 . 1 24524
< . 000 1
0 . 3 44 1 5 1
st ufac
- 0 . 1 69484
0 . 09933 1
42
- 1 . 706258
0 . 09 5 3
0 . 32 9284
2 . 348 1 3 6
0 . 6 7 0 1 05
10
3 . 5 04 1 3 2
0 . 0067
0 . 708476
\1 . 1 1
j
;dJ I
I ll!
('
rcept
i�
rm b rd
-
Me a n =
Pr > I t I
0
.
2062 1 0
Figure 5 . 3 . S elected Output From PRO C MIANALYZE
M is, as before, the number of completed data sets used to produce the estimates . The degrees of freedom is then calculated as
where
df
==
(M
1 ) (1 + r 1 ) -
-
'1 -'- •
v ari a t io n is relative to the within -regression variation, the l arger is the degrees of freedom. Sometimes the calculated degrees of freed o m will be substanti ally g rea t e r than the numb er of obse rv a t i o n s. This is nothing to be co n cerned a b ou t , because any number greater t h a n about 1 5 0 w i l l yield a t table that is es s e n t i a l ly the same as a s t a n d a r d norm al distribu tion. However, some so ftware (including PROC MIANALYZE) can p r o d uc e an a dj u s t e d degrees of freedom th at ca n n o t be greater than t h e s ample size (Barnard & Rubin, 1 999). The l a s t column, " Fracti o n M i ssi ng I nfo rm ati o n , " is an estimate of how m u c h information about each coefficient is lost because of miss ing d a t a It ra ng e s from a low of 21 % for l en rol l to a high of 7 1 % for rmbrd . It 's not s u rp risi n g that the m i s s i ng information is high fo r rmbrd, w h ich had 40 % missing data, but it is surprisingly high for p rivate, wh ich had no m is s i n g data, and stufac, which had less than 1 % m issi ng d at a To u n de rs ta n d thi s, it is i mpo r t a n t to know a co u pl e of things. Fi rs t , the amount of mi ssing i n fo rm at io n for a g ive n Thus, the smaller the
betwe en -regression
.
.
( -ocfficient dep ends not only on the missing data for that pa rticu l a r v a r iab le but also on the percentage of missing data for other variables I hat are correlated with it. Second, the MIANALYZE pr oc e d u r e h a s 1 1 0 way to know how much missing data there are on each vari ab l e . I nstead, the m issing information estimate is based entirely on the rcl ; , tive variation within and between regressions. If t h e re is a lot of variation between regressions, that is an indication of a lot of miss i n g i nformation. S ome times denoted as y , the fraction of missing informa tion is calculated from two statistics that we j ust defined, r and df. S-1 eifi cally ,
(}
,
y == A
r + 2j(df + 3) •
r + 1
Keep in mind that the fraction of missing information reported in the table is only an estimate that may be subject to considerable sampling v ar i ab i l ity .
As noted earlier, one of the troubling things about multiple impu tation is that it does not produce a determinate result . Every time you do it, you get slightly different estimates and associated statis tics. To see this, take a look at Figure 5.4, which is based on five data
M u lt i p l e - I m p u t a t i o n P a r a m e t e r E s t im a t e s
Std E r r o r
t
Fract ion
f o r HO :
Miss ing
Mean
Mean
OF
- 3 2 . 474 1 5 8
4 . 8 1 634 1
1 24
- 6 . 742496
< . 0001
0 . 1 92 4 2 9
c s at
0 . 066590
0 . 0 05 1 87
20
1 2 . 838386
< . 000 1
0 . 48934 1
l e n r oll
2 . 1 732 1 4
0 . 546 1 7 7
2 1 57
3 . 978955
< . 000 1
0 . 043949
p r i vate
1 3 . 1 25 024
1 . 1 7 1 4 88
1 1 91
1 1 . 2 03 7 1 9
< . 000 1
0 . 059531
stufac
- 0 . 1 90 03 1
0 . 0 99 0 2 7
51
- 1 . 9 1 8 9 88
0 . 0607
0 . 307569
2 . 3 5 7444
0 . 5 9 934 1
12
3 . 933 39 6
0 . 0 0 20
0 . 623224
Variable int e r c e p t
rmb rd
Figure
5 .4.
Imputation
Output
From
M e a n=O
MIANALYZE
for
Pr
>
ItI
Replication
I n f o rmat ion
of
M u l t i p le
-- .- . -.
· .
i)()
scls J he
Hu·
i ly new run of data augmentation. Most of resuits are quite si mil ar to those in Figure 5.3, although note that fr;tctions of missing information for lenroll and private are much produced by
an ent re
lu\vcr t han before. When the frac t ion of missing information is high, more than the
rccollllncnded thre e to five completed data sets may be necessary to get stable estimate s . How many might that be? Mu lt iple imputa t ion \vi than infinite number of data sets is fully efficient (like ML), but MI wi th
a
( J <JH7)
finite
number of data set s does not achieve full efficiency. Rubin
showed that the relative efficiency of an estimate
based
on
M
uata sets compared with an estimate based on an infinite number of data se ts is given by (1 + 'Y / M)-l, wh ere 'Y is the fraction of missing
information. This imp lies that with five data sets and information, the efficiency of the estimation proc edure
10
data sets, the efficiency goes up to
95%.
Equivalently, us i ng only
five data sets would give us standard errors that are when an
infinite
50% missing is 91 %. With
5%
larger than
number of data sets is used. Ten data sets would yie ld
are 2.5% larger than an infinite number of data sets. The bottom line is that even w ith 50% m i ssi ng information, five data sets do a pretty good job. Doubling the number of data sets cuts the excess sta n d ard error in half, but the excess is small to begin with.
standard errors that
Before leaving the regression example, let us compare the MI
5.4 with the ML results in Table 4.6. estimat e s are quite similar, as are the standard errors Certainly the same conclusions would be reached results in Figure
The coefficient and t stati s tic s . from the two
analyses.
6& MULTIPLE IMPUTATION: COMPLICATIONS Interactions and Nonlinearities in MI Although the
ng
e s t ima t i
met
hods
we have jus t described are very good for
the main effects of the variables with
m issing
data, they
may not be so good for estimating interaction effects. Suppose, for example, we suspect that the effect of SAT scores (CSAT) on grad
for public and p ri vate colleges . One way to test this hy pothesis ( Method 1) would be to take the previ ously impute d data, create a new var i ab l e that is the product of CSAT
uation rate (G RAD RAT) is different
TABLE
6.1
Regressions With Interaction Terms Method 1 J
f.lnable
I NTERCEPT
( .'SAT
I , ENROLL
STUFAC
P RIVATE
RMBRD PRIVCSAT
Coefficient
p
Three Methods Method j
Method 2
Value
Coefficient
p
Value
Coefficient
J) J 1 1 /1 1 .
- 50.2
.( 100
-39. 1 42
.000
-48.046
.000
0.073
.000
0 .085
.000
0.085
2.383
.000
1 .932
.00 1
1 .950
- 0 . 1 75
. 205
- 0 . 204
.083
-0. 1 52
20.870
.023
35 . 1 28
.00 1
36. 1 1 8
2. 1 34
.002
2 .448
.000
2.641
- 0.008
.388
- 0 .024
.022
- 0.024
.OOI l .( ) I
\
.( )( ) I
.00 . ' .003 . ()2 II
and PRIVATE, and include this product term in the regression equa tion along with the other variables already in the model. The leftmost panel of Table 6. 1 (Method 1) shows the results of doing this. The variable PRIVCSAT is the product of CSAT and PRIVATE. With a p value of .39, the interaction is far from statistically significant, so we conclude that the effect of CSAT does not vary between public and private institutions. The problem with this approach is that although the multivariate normal model is good at imputing values that reproduce the linear relationships among variables, it does not model any higher-order , moments. Consequently, the imputed values display no evidence of interaction unless special techniques are implemented. In this exam ple, where one of the two variables in the interaction is a dichotomy (PRIVATE), the most natural solution (Method 2) is to do separate chains of data augmentation for private colleges and for public col leges. This allows the relationship between CSAT and G RAD RAT to differ across the two groups and allows the imputed values to reflect that fact. Once the separate imputations are completed, the data sets are recombined into a single data set, the product variable is created, and the regression is run with the product variable. Results in the mid dle panel of Table 6 . 1 show that the interaction between PRIVATE and CSAT is significant at the .02 level. More specifically, we find that the positive effect of CSAT on graduation rates is smaller in private colleges than in public colleges. A third approach (Method 3) is to create the product variable for a l l cases with observed values of CSAT and PRIVATE before imputation ,
52 the n imp ute t he product va riable j ust like any other variable with miss ing data, and, finally using th e imp u ted data, estimate the regression ,
model that includes the product variable. This method is less appeal ing than Method 2 becaus e the product variable typ icall y will have a distribution that is far from normal, yet normali ty is assume d in the imp u tatio n process. Nevertheless, as seen in the righ t - hand panel of Tabl e 6.1, the results from Method 3 are very close to those obtained with Method 2 and
y much closer th an those of Method 1.
c ertainl
The results for Method 3 are reassuring because Method 2 is not feasible when both variables in the interaction are measured on quan titative scales. Thus, if we wish to estimate a
model
with the interac
tion of CSAT and RMBRD, we need to create a product variable for the 476 cases that have data on both these variables. For the
ainin g 826 cases we must impute the product tern1 as part of the data augmentation process. This method (or Method 2 when possi b le ) should be used whenever the goal is to esti m ate a model with no nli n ea r relationships that involves vari ab le s with missing data. For rem
,
example, if we want to estimate a model with both RMBRD and RMBRD squared, the squared term should be imputed as part of the dat a augmentation. This
req u i rement
puts some b urd e n
on
th e
imp u ter to anti c ipate the desired functional form before beginning the imputation. It also means that we must be cautious abo u t e s tim ating nonlinear models from data th at have been i m puted by others using st r ictly linear models. Of co urse if the percentage of missing data on ,
a given variable is small, we may be able to get by with imp uti n g
a
variable in its ori ginal form and then co n st ructing a non linear trans formation later. Certainly for the variab le STUFAC (student/faculty ratio), w i t h only two cases missing out of 1,302, it would be quite ac ceptable to p u t STUFAC squared in a regres s ion model after sim· ply
s quaring
the two imputed values rather th an i mp u ting the squared
values.
Compatibility of the Imputation Model and the Analysis Mod.el
The problem of interactions illustrates a more general issue in mul tiple imputation Ideally, the model used for imputation should agree .
with the model used in
analysi s ,
and both should c orre c tl y represent
the data. The basic for m u l a (Equation 5.1) for co mputing standard errors depends on this
c om patibility
and co rrectn es s .
-----
----- -
- --
I
,
,
53 I,
I I
W h at happens when the imput ation and an alys i s models differ? I I L l l d e p en ds on the n ature of the difference and which mode l is H I eet (Schafer, 1997) . Of p a r ti cu l ar interest are cases in which one 1 1 1 ( ( 1 �1 is a special case of t h e other. For example, the i mput a t i o n H leI may allow for interactions, but the a naly s i s model may no t , j H ' he a n a l ys is model m ay allow for interactions, but the imputation I I I < Hlel m ay not. In e i t he r case, if the additional restrictions i mp o s e d b y I l ll" s i m p l er mo del are correct, then the pro c e d u r e s we h ave d i scu ss e d I i I f i nfer ence under mUltiple i mpu t a ti o n will be val id . However, if the , t d d itional restrictions are n o t correct, inferences that usc the standard I l l cthods may not be valid . Methods that are less sensitive to model choice h ave been p r o p o se d I ( ) r es t i mati n g standard errors under mu ltipl e imputation (Wang & I {obins, 1998; Robi n s & Wang, 2000) . S p e c ifically these methods give va lid st an dar d error e stim ate s when the im p u t at i on and analysis mod ( - I s are incompatible and when both models are in co r r ect Neverthe l e ss, i n co rr e ct models at either stage still may give biased parameter e s ti m a t e s and the alternative methods require specialized softw a r e I hat is not yet re adily av a i labl e /
t
)
J I ((
,
.
,
.
«tole of the Dependent Variable in Imputation
one of the variables included in the data augmentation process, th e d epen de nt variable was implicitly use d to impute missing values on the independent variables. Is this l egit i m a te ? Does it not tend to p rodu ce spuriously l arge regression co e ffic ie n ts ? The answer is that not only is this OK, i t is essential for getting unbi ased estimates of the re gr e ss ion coefficients. With deterministic impu tation, using the d epe n d ent variable to im p ute the m is si ng values of the independent vari ab les can, indeed, produce spu ri ou s ly l arge regression coefficients, but the introduction of a r an dom component into the impu t a tio n process counterbalances this tendency and gives us approximately unbiased estimates. In fact, l e aving the dependent variable out of the i m p u t a t i on process tends to p ro du ce regression coefficients th at are spur iously sm all, at least for those variables that have missing data ( L a nde rman L a n d & Pieper, 1997) . In the co l l eg e example, if GRADRAT is n o t used in the imputations, the coefficient s for CSAT and RMBRD , both wi th la rge fractions of m i ssi n g data, are reduced by about 25 % and 20% , respectively. At the same time, the Because GRADRAT was
,
,
,
, , .
,
' .
54
coefficient for larger.
Of
LENROLL, w h ich
only h a d five missing values, is 65 %
course, including G RAD RAT in the data augmentation process
also means t h a t any missing values of GRAD RAT also were
imputed.
Some authors have recommended against imputing nlissing dat a on
the dependent variable (Cohen & Cohen, 1 985). To
follow
this advice,
we would have to delete any cases with missing data on the dependent
imputation.
variable before beginning the
There is a valid rationale
for this recommendation, bu t it applies only in special cases. If there are missing dat a on the depend ent variable but not on any of the
independent variables, m aximum likelihoo d estimation of a regres
sion model
(whether l inea r or nonline ar) does not use any information
from cases with missing data. Because ML i s optimal, t here is nothing to gain from imputing the missing cases under m u l tip l e imputation. In
wou l d not lead to any However, the situation
fact, although such imputation
bias, the stan
d ard errors would be larger.
changes
there
is
wh e n
also missing data on the indepen dent variables. Then cases
with missing values on the dependent variables do h ave some infor mation to contribu te to the estimation of the regression coefficients, a l t ho ugh prob ably not
a
great deal.
The
upshot is that in the typical
case with missing values on both dependent and independent vari ables, the cases not be d eleted.
with
missing values on the dependent variable should
U sing Additional Variables in the Imputation Process
As alre ady noted, the set of variables used in data augmentation certainly should include all variables t h a t will be used in the planned analysis. In
the
college
example,
we also included one additional
variable, ACT (mean ACT scores) , because of its high correlation
with CSAT, a variable that had substanti al missing data. The goal was to im p rove the imputations of CSAT to get more reliable estimates
of its regression coefficient. We might have done even better had we inc luded s till other variables that were correlated wit h CSAT. A somewhat s i mp le r example illustrates the benefits of additional predictor vari ables. Suppose we want to estimate the mean CSAT score across the 1 ,302 college s . As we know, d ata are missing on CSAT for 523 cases . If we c a l c ul at e the mean for the other 779 cases with values on CSAT, we get the results in the first line of Table 6 .2 . The
55 TABLE 6 .2
M eans (and Standard Errors) of CSAT With Different Used in Imputation Standard
Variables %
Missing
t
. I l Il/hles Used in Imputing
Mean
Error
[
1 4 1 1 1e
967 .98
4.43
40 . 1 a
956.87
3 .8 4
26.5
959.48
3 . 60
1 3 .3
958.04
3.58
1 1.3
\( \(
\�
'"
(
'
" I : PCT25 v I: PCT25, GRADRAT
Informa tion
. \d ual percentage of missing data.
line shows the estimated mean (with standard error) using mul t i ple imputation and the ACT variable. The me an has decreased by I ) points, whereas the standard error ha s decreased by 1 3 % . Although /\Cf has a correlation of about 0.90 with CSAT, its usefulness as a p redictor variable is somewhat m arred by the fact that values are ( lbserved on ACT for only 226 of the 5 23 cases missing on CSAT. If we add an additional variable, PCT25 (the percentage of students in t h e top 25 % of their class) , we get an additional reduction in stan dard error. PCT25 has a correlation of about 0.80 with CSAT and is available for an additional 240 ca s e s that have missing data on both (�SAT and ACT. The last line of Table 6.2 adds in GRAD RAT, which has a corre lation of about 0 . 60 with CSAT, but is only available for 17 cases not already covered by PCT25 or ACT. Not surprisingly, the decline in the standard error is quite smalL When I tried to introduce all the other variables in the regre ss i on model in Figure 5.4, the s t a n d a r d error actually got larger. This is likely due to the fact that the other variables have much lower correlations with CSAT, yet additional vari ability is introduced because of the need to estimate their regression coefficients for predicting CSAT. As in other forecasting problems, imputations may get worse when poor pred ic tors are added to the model I l ext
Other As
Parametric Approaches to Multiple Imputation
we have
seen,
multiple imputation under the multivariate n or
mal model is re asonably straightforward un d er a wide variety
of data
�,n d m is s i ng data patterns. As a routine met ho d for han dling In issin � d ata, it is probably the best t h a t i s curre n t ly available . There a r c , h(1wever, several alt e r n at ive app ro a c hes that may be preferable lyp e s
i n SOIn,e circumst ances .
() n e of the m o s t obv ious limitations of the lllultivaria te no rm a l t110 d e l is t h a t it is designed only to impute mis si n g values fo r quanti
tative v a r iab le s. As we have seen, c atego r ic a l variables can be accom lllodat vd by using some ad hoc fixups. However, sometimes you may w a n t t () do b e t te r For situations in which all variables in the i mp u t at i on process are ca teg or i c al a more attractive model is the unre stri cte e;1 mu lt in omi a l m o del (w h ic h has a parameter for every cell in t h e CO lltingency table) or a log-linear model that allows restrictions o n t he multinomial parameters. In Chapter 4, we discussed ML esti mat io n of these models . S chafer ( 1997) showed how these m ode ls a l so c an be used as the basis for data augmentation to p r od u c e mul t ip le irJ1putations a nd he developed a freeware program called CAT to i mp jement the method (http://www . stat.psu.edu/''--'j ls/). Ano } her Schafer pro g r a m (MIX) uses data augmentation to gen erate i fll P u ta ti o n s when the data consist of a mixt ure of categoric al and qll antitative variables. This method presumes t h at t he cate go ri cal va riables h ave a m u lt in omial d istribution, possibly with log-line ar r estrictions on th e parameters. Within each cell of the contingency tab le �reated by the categorical v ari ab l e s the quantitative vari ables are as� ume d to have a multivariate norm al distribution. The means o f the �e variables are a l l ow e d to vary across cells, but the covariance mat rix is assumed to be constant. At t ltis wri ting both CAT and MIX are available only as libraries to the S- PLUS statistical package, although stand-alone versions are p r o m i �� d In both cases, the underlying models potentially have many mo re t1 a r am e t e r s than the mu ltivariate normal modeL As a result, effec t i\(e use of these methods typically r e q u i res more knowledge and inpu t f:1 o m the person performing the i mpu t ati o n together with larger sampl e sizes to achieve stable estimates. I f d C\ta are missing for a single ca tego r ica l variable, multiple im p u t a tion u r�d er a logistic (logit) regre s s io n model is reasonably straightfor w ard (]tu bi n 1987) . Suppose data are missing on marital status, coded into fiV"� c a t egor ie s , and there are several po t e n t i al predictor vari ab l e s h o th c c;:) n t in u o u s and categorical. For th e purposes o f i mp u t a ti o n we csti ma lle a m u ltino mial l ogit model for marital status as a fu n cti on of t h� p r �dictors, using cases with complete data. This pro d u ce s a set of .
,
,
,
.
,
!
,
,
,
57 co efficient estimates j3 a n d an estimate of the covariance matrix V( S). 10 allow for variability i n the parameter estimates, we take a random draw from a normal distribution with a mean of /J and a covariance tnatrix V ( f3 ) . ( S c h a fe r [1997] g av e practical suggestions on how to do this efficiently.) For each case with missing data, the drawn coeffi cient values and the observed covariate value s are substituted in t o the multinomial logit model to generate predicted probabilities of falling into the five marital status categories . Based on these predicted prob abilities, we randomly draw one of the marital s ta tus categories as the fi n al imputed value.1 2 The whole process is repeated multiple times to generate multiple completed data sets . Of course, a binary va ri ab l e would be just a special case of this method. This approach also can be used with a variety of other parametric models, i n c lu d in g Poisson regression and parametric failure-time regressions . .-...
"-
Nonparaloetric and Partially Parametric Methods
Many methods have been proposed for doing multiple imputation under less stringent assumptions than the fully parametric methods we have just considered . In this section, I will cons id er a fe\tv repre sentative approaches, but keep in mind that each of these approaches has many different variations. All these methods are most natu rally applied when there are missing data o n only a s ingl e variable, . although they often can be generalized without difficulty to multiple v ariables when data are missing in a monotone pattern (described i n Chapter 4) . See Rubin (1 987) for details on monotone generaliza tions. These methods can sometimes be used when the missing data do not follow a monotone pattern, but in s uch settings they typ ic ally lack solid theoretical justification. When choosing between parametric and nonparametric methods, there is the usual trade-off between bias and sampling variability. Parametric methods tend to have less sampling variability, but they m ay give biased estimates if the parametric model is not a good approximation to t h e phenomenon of interest. N o np a r am etri c meth ods may be less prone to bias under a variety of situations, but the estimates often have more sampling variability.
Hot Deck Methods The best-known approach to non p arametric imputation is the " hot deck" method, which is frequently used by t he U . S . Census Bureau to
58
p rod uce
imputed values for p ubl i c u se -
data sets. The basic
idea is that
we want to i mpu te miss in g values for a p artic u lar vari able Y, which
may
a
be either q u n t i t at iv e or categorical . We
missing d a ta )
X variables (with no
forn1
wi t h
a
fin d
a
s et
of categorical
that are associated with Y. We
continge n cy table base d on the X variables . If there
m is
s ing Y values
are cases
withi n a p artic ul ar ce ll of the contingency table,
we take one or more of the no n mis s i ng case s in the same cell an d
use
thei r Y values t o im p ut e the m iss i n g Y values .
Obviously there are a lot of complication s that may arise. The crit ical qu e s ti o n is how do you choose whi ch " donor" values to assign
Cl e a rly the to avoid bias.
to the cases wit h miss ing values ?
choice of donor cases
should be random ized somehow
This leads n a t u r ally to
multiple i m p u ta t i o n b ecause any randomize d method can be applied
m o re
than once to produce different imp u t e d values . The trick is to
do the randomization in such a way that all t he natural
variabi li ty
is
preserved . To acco m p lis h this, Rub i n p ro p o se d a method he c o in ed
(Rubin, 1 987; Rubin & Schenker, is done. S u ppo se that in a part i cu lar cell of the
the app r o xim ate Bayesian b o o t strap 1 99 1 ) . Here is how it
co n t i ngency table there are n l cases with complete c as es with missing dat a on Y. Fol low these s t ep s : 1 . From the set of
n1
data on Y an d
no
cases with complete data, take a random sample
(with replacement) of n 1 cases .
2. From this sample, take a random sample (with repl acement) of
n o cases .
3 . Assign the no observed values of Y to the no cases with missing data
on Y .
4 . Repeat steps 1 to 3 for every cell i n the contingency table .
These four steps produce one co m p l e t e d data set wh e n ap p lie d to all cells of th e contingency tabl e . For m ulti pl e imputation, the whole process i s repeated
m u l ti p le
After the desire d analysis is per r e s u l t s are combined u si ng the s ame
times .
for m e d on e ach data set, the
formul as we used for multivari ate normal imputations .
Although it might seem that we could c h o o se n o ·
donor cases from
among
the
n1
skip
cases
s t e p 1 and directly
with compl ete data,
does not produce sufficie n t va r i a bil i ty for estim ating stand ard errors. Ad d i t i o n a l var iabil i ty comes from th e fact that s ampling in this
step 2 i s with replacement.
59 Predictive Mean Matching
A major attraction of hot deck im p ut a t io n is th at the imp u te d val l ies are all actual observed values. Consequently, there are no impo s sible " or o u t o f r a nge v a l ues an d th e s h ap e of the d i s t r ib u t ion tends t o be preserved. A disadvantage is that the p r edi ct or variables a l l Inust be c at e g o r i c a l ( or t reat e d as s uch ) which imposes serious lim i tations on the number of p ossibl e p red i c t o r variables. To remove this l i m it a t i on Little (1 988) p ropos ed a p a rt i al ly p ar a me t ri c method called predictive mean matching. Like t h e multivariate normal para metric m e t ho d t hi s app r o ac h be g i n s by regre ss i n g Y, the v ar i ab le to be i mp u t ed , on a s et of pre d ict o rs for cases with com p le te data . This r e gres s i o n is then used to ge nerat e predicted v al u e s for both the miss ing and the nonmissing c ases Then, for each case with m i ss i n g data, we find a set of cases ,vith com pl e te data that have p r e di ct e d values of Y th a t are "close " to the predicted value for the case w i th missing data. From t h is set of cases, we randomly choose on e case whose Y "
-
-
,
,
,
,
.
v alu e i s donated to the missing
case. For a single Y variable, it is straightforward to define closeness as the absolute difference between predicted values. However, then we mu s t decide how many of the c l o s e p re dic ted values to include in the donor pool for each missing ca se o r equiv al e ntly what should ' be the cu t o ff point in closeness for fo rIn i ng the set of possible donor v a lue s ? If a small donor pool is chosen, there will be more s amp l i ng vari ab i l i ty in the estimates. On the other h an d too large a donor pool c an lead to possible b i as because many donors m ay be unlike the re ci p ien t s To deal with this amb ig u i ty , Schenker and Taylor ( 1 996) developed an a d ap t iv e method" that varies the size of the donor pool for e ach missing c a se based o n th e d ens i ty of co mp l e t e cases w i t h close p r e d ict e d values. T h ey found that their method did somewhat better than methods with fixed size donor pools of either 3 or 10 clo se s t cases. H owever the differences among the t h re e methods were sufficiently small that the adaptive method hardly seems worth the extra com p u t a t io n a l cost. In d o i n g p r e d ict iv e mean mat c h in g , it is also important to adjust for the fact that the regr e s s i o n coefficients are on ly estimates of t h e true coefficients . As in the p ar ame tr ic case, this can be acco m p li s he d by ra n d o m ly drawing a new set o f r eg r e s s i o n parameters from t h e i r p ost e rio r distribution before c a lcu lat in g p re d icted values for each ,
,
,
.
"
"
,
"
•
to do it:
i l l l P l l 1 c d dat a s e t . Here is how
1 . Regress Y on X (a vector of covari ates) for the
n 1 cases with no missing
data on Y , producing regression coefficients b (a k x
residu a] vari ance estimate
52 .
1 vecto r) and
2 . M ake a random d raw fro m the posterior distribut ion o f t h e res idual
varian ce (assuming a noninformative prior) . This is accompli shed by calcul ating ( n 1
-
k) 52/X2,
where X2 represents a random draw from a
chi-s quare distribution with n 1
-
first such random draw .
k
degrees of freedom. Let
s[l ]
be the
3 . Make a random draw from the posterior distribution of the regression coefficients. This is accomplish ed by drawing from a mul tivariate normal distrib ution with m ean b and covariance matrix
srlJ (X'X) -- l , where
X is
an n 1 x k matrix of X values . Let h [ 1 J be the first such random draw .
See Sch afe r ( 1 997) for practi cal suggestions on how t o do this .
For e a ch new set of regression p a rameters, p redicted values are
genera t ed for all
c as e s
.
Then, for
each
case
with m i ssi n g
data on Y,
we form a donor pool b a s ed on the predicted values and randomly choose one of the observed value s of Y
from
the donor pool . Th is
appro ach to predictive mean m atc h ing can be generalized to
with mis s ing data, al t hough the co m plex (Little, 1 988) .
th an one Y v a ri a ble
m ay become r a ther
more
computations
Sampling on Empirical Residuals I n the
data au gmentation
m eth od , residu al values are sampl ed
from a standard normal distribution and then adde d to the r egr es s io n
method
p re d ic t ed
values to get the final impu t e d values. We can modify this be l ess dependent on
parametric assumptions by m aking random draws from the actual set of r e s idu als produce d by the lin ear r e g r e ssion. This can yield imputed values whose distribution is more like t h a t of the ob se rv ed variable ( Rub in 1 987), although it is still pos s ibl e to get i mpu t ed values that a r e outside the permissibl e to
,
range . As
with
other app ro a c h e s to multiple imp u t a t i o n t h e re ,
arc
some
i mpor t an t s ubtleties involved in doi n g this properly. As before, let Y be the variable with missing data to be i m p ut e d for
observed data on ing
a
n1
co n s t an t ) with
ca s e s .
Let X
be a
kx1
no missing d ata o n the
no
c a s es with ,
v ec t or of variables (inclu d n1
c a s es We begin by per .
forming the preceding three steps to obtain the li n e a r r eg r es s i o n of
" �., , ,
I
:
61 Y on X and generate random draws from the posterior distribution ( )f the parameters. Then we add the following steps: 4. Based on the regression estimates in step 1 , calculate standardized resid uals for t h e cases with no missi n g data:
5 . Draw a simple random sample (with replacement) of n 1 residuals calcul ated in step 4.
no
values from the
6. For the no cases with missing data, calculate imputed values of Y a s
where e i rep r esen t s the residuals drawn in step 5 , an d b[ l
and Si l l a re the first random draws from the post e ri o r distribution o f the para m e t ers . I
These six steps produce one completed set of data. To get additional data sets, simply repeat steps 2 through 6 (except for step 4, w hic h should not be repeated). As Rubin (1987) explained, this methodology can be re ad i ly extended to data sets with a monotonic missing pattern on several variables. Each variable is imputed using as predictors all varia hlcs that are observed when it is missing. The empirical residual me tho d also can be modified to allow for heteroscedasticity in the im pu ted values (Schenker & Taylor, 1 9 9 6 ). For each case to be impu ted, t h e pool of residuals is restricted to those observed cases that h ave pre dicted values of Y that are close to the predicted value for the case with missing data. Example
Let us try the partially parametric methods on a subset of the col lege data. TUITION is fully observed for 1 ,272 colleges. (For SlIl1p1 ic ity, we shall exclude the 30 cases with missing data on this variable.) Of these 1 ,272 colleges, only 796 report BOARD , the annual ave r age cost of board at each college. Using TUITION as a pred icto r, our goal is to impute the missing values of B OARD for the other 476 colleges, and estimate the mean of BOARD for all 1 ,272 college s . First, we apply the methods we have used before . For the 796 col leges with complete d ata (listwise deletion), the average BOARD is $2,060 with a standard error of 23 . 4 . Applying the EM algorithm to
f
I
I
I
{ I I' I ' I ( ) N a n d BOARD,
a mean B OARD of 2,032 (but no - , L I I I ( I : l r<.l e rror) . T he EM estimate of t h e correlation between BOARD . l I l d ' l ' lJ I TION wa s 0.555. M u l t i p l e i mp ut atio n under the m ultiv a r i at e I I ( ) l' I n a J m ode l lls i og data augmentation gave a mean B OARD of 2 , 040 \v i l h an estim ated standard error of 2 1 . 2 . I lec ause BOARD i s h i g h ly skewed to the r ig h t there i s reason to suspect t h at t h e multivariate norma l model may not b e appropriate. Qui te a few of t he va l u es i mpu te d by data augmentation were less th an the mi ni mum observe d value of 5 3 1, an d one impute d value was n egative. P erhaps we can do better by sampling on t he e mp ir ical residuals. Fo r the 796 cases w i t h data on both TUITI ON an d BOARD , th e o r d i n a ry least-squares ( O LS) regression of BOARD on we
ge t
,
TUITION was
B OARD
==
1 , 497.4
+
67 .65
*
TUITIONj l , OOO
a ro o t me an s qu a re d error ( r ms e ) estimated at 542.6. Standard ized re s i d u al s fr orn t his regression were calculated for the 796 cases. Th e est i ma te d r e g r e s s io n parameters were used t o make five ran dom draws from the po s t e r i o r distribution of the parameters as i n steps 2 a n d 3 ( assuming a noninfo rmative prior) . The drawn val
with
ues we r e Slope
Intercept
rnlse
1 5 36.40
66.6509
53 1 . 99 0
1 5 03 . 65
7 1 .5 9 1 6
5 5 2 . 708
1501 .61
66.975 6
554. 800
1 486 .84
66.9850
548 . 400
1 5 04 .23
6 1 . 2308
5 3 4 . 895
first compl eted d ata set, 4 7 6 residual values were randomly drawn with replacement from among the 796 c as e s . These stan d ard ized residuals were a s s i g ne d arb i tr ari ly to the 476 c as e s wit h missing d a t a on BOARD . L e t t i n g E be the a s s i gn e d residual fo r a given case, the imputed values for B OARD were generated as
To cre ate the
B OARD This
�
1 , 536 .40
at
ea ch
st ep.
66.6509
*
TUITION/ l , OOO
+
5 3 1 .990
*
E.
e at e d for the four re m a i ni n g d ata sets, with new res i d u a l s a n d new values of the regre ssion parameters
pro ces s w a s
sampl i ng o n the
+
re p
63
the five data sets were produced, t h e mean and s t an da rd e rro r \vcre computed for each d ata set, and the results were combined using formula 5 6 1 for the st and ard error. The final e stimate for the mea n of BOARD was 2,035 with an e s t i m a te d standar d error of 20.4 , whi ch i s ( I uite close to multiple imputation based o n a normal distribution o Now let us try predictive mean matching6 Based on the coeffi ci e nts from the OLS regression of BOARD on l'UITION, I generated five n ew random draws from the posterio r distribution of th e regres s io n parameters: Once
Intercept
Slope
rmse
-
1 465 .89
67.8732
557.5 3 1
1 548 . 98
64. 5 723
5 3 9.95 2
1 428.82
67. 390 1
5 1 2.38 1
1469 .34
67 . 375 0
5 5 0 . 945
1 5 1 7 .92
66. 1 926
5 34 .804
of parameter values, I generated p re d icte d values of BOARD for all cases, both observed and m i s s in g . For e ach ca s e with mi ss i n g data on BOARD, I fo u n d the five observe d cases ,,,ho se predicted values were closest to the predicted value for the cas e with missing data. I randomly chose one of those five cases an d assigne d its observed value of BOARD as the imputed value for the mi ssing c ase This process was repeated for eac h of t h e five sets of param e te r values to produce five complete data sets . (It is just coincid e nce that the number of data sets is the s ame as the number of observe d c ases matched to each missing case.) The mean and stan d a r d error w e re then computed for each data set and the result s wer e comb ined in t h e usual way. The com bi ne d mean of B OARD was 2,028 wit h an es t im ated standard error of 23.0. All fo u r imputation m e t ho d s produced similar estimates o f the me an s and all were noticeably lower th an the mean base d on lis t wise deletion. Schenker and Taylo r (1996) s ugge ste d t h at a l th o ugh p arametric and partially parametric imputation me t h od s tend t o yield quite similar e s ti m at e s of mean structures (including regress ion coe f ficients) , th ey may pro duce more d iver g ent results for the m arg i na l distribution of the impu t e d variables. Their simulations in d ica te d t h at for applications where the marginal distribution is of m aj or inte rest , partially parametric models have a distinct adva n t ag e This wa s es p e c ially tru e when the regressions us e d to g e n e r a t e p redicted val u e s For t he
first set
.
.
were misspecified in various ways.
I
i'
I • I
64 Sequential Generalized Regression Models
() n c of t h e
attractions
of d at a augmentation
is
that, unli ke the non
p a r a nl c tric and semiparametric methods h a n d l e d at a
sets
with
just discussed , it e asily can a substantial nU111ber of variables with rnissing
uata. Unfortunately, this m e thod re quires specifying a multivari ate
u i striblltion for
all the variables, and that is not an e asy thing to d o
when the vari ables are of many d iffe rent types, for example , continu
ous, binary, and count data. Another appro ach has been
proposed
for
handling missing data in large , complex data sets with several different variabl e typ e s . I nstead of fitting a single comprehensive model (e .g., the multivariate normal ) , a sep arate regression model is specified for each vari able that has any missing data. Fo r e ach dependent variable,
The method imputing missing
the regress ion model is chosen to reflect the type of data.
i nvolves cycl i ng through s everal regression model s, values at each step .
Although this app roach i s very appealing, it does not yet have as strong a theoretical justification as the other methods we have consid ered. At this writin g, the only detailed accou nts are the unpublished reports of Bran d
(1 999), Va n
Buuren and Oudshoorn
( 1 999) , an d Solenberger ( 1 999) . In
Raghunathan, Lepkowski , Van Hoewyk, an d
the Raghun ath an et al . version of this method, the available mode ls include normal linear regression, mi al logit, and
Poisson
b i n a ry
logistic regression , multi no
regression. The regression models are esti
m ated i n a particular order, beginning with the dependent variable
with
the least missing data and p roceeding to the dep e ndent vari
ab le with the most missing data. Let us denote these vari ables by Yl
through
Yk
an d let
X
denote the set of variables with no missing dat a .
The first " round" of estimation proce eds as follows . Regress Y1 on
X and generate
imputed values using a method simil ar to
that
described previously for th e m ultinomial logit model in the section " O ther Parametric Approaches to
Multiple Imputation . " Bounds
and restrict ions m ay be pl aced on the imputed values. Then regre ss
Y2 on X and Y1 , including the imputed val ues of Y1 , imputed values for Y2 ' Then regress y:, on X, YJ , and
imputed values on both
Ys ) .
now
subsequent
prcspecificd
rounds repeat this pro
each variable is regressed
using any i m p u ted values from previous fo r a
Y2 (including
Continue until al l the regressions have
been estimated. The second and ce ss, except that
and generate
steps.
on all
o ther variab les
The p rocess continu es
number of rou n ds or until stab l e imputed values
65
occur. A SAS macro for accomplishing these tasks is available at http ://www . isr.umich.edu/src/smp/ive. For their version of the method, Van Buuren and Oudshoorn coined the name MICE (for multiple imputation by chained equa tions) , and they developed S-PLUS functions to implement it (avail able at http ://www . multiple-imputation.com/) . The major differences between their approach and that of Raghunathan et al. is that MICE does not include Poisson regression, but does allow more options (both parametric and partially p arametric) in the methods for random draws of imputed values. Linear Hypothesis Tests and Likelihood Ratio Tests
To this point, our approach to statistical inference with multiple imputation h as been very simple . For a given parameter, the standard error of the estimate is calculated using formula 5 . 1 . This standard error is then plugged into conventional formulas based on the normal approximation to produce a confidence interval or a t statistic for some hypothesis of interest. Sometimes this is not enough. Often we want to test hypotheses about sets of parameters, for example, that two parameters are equal to each other or that several parameters are all equal to zero . These sorts of hypotheses are particularly relevant when we estim ate several coefficients for a set of dummy variables. In addition, there is often a need to compute likelihood ratio statistics by comparing one model with another, simpler model. Accomplishing these tasks is not so straightfolWard when doing multiple imputation. Schafer (1997) described three different approaches, none of which is totally satisfactory. I will briefly describe them here and we will 1 00k at an example in the next section.
Wald
Tests Using Combined Covariance Matrices
When there are no missing data, a common approach to multiple parameter inference is to compute Wald chi-square statistics based on the parameter estimates and their estimated covariance matrix. Here is a review, which, unfortunately, re quires matrix algebra. S uppose we want to estimate a p x 1 parameter vector � . We have estimates p and estimated covariance matrix C& We want to test a linear hypothesis expressed as LI3 c, where L is an r x p matrix of constants and c is an r x 1 vector of constants . For example, if we want to test the A
==
66 I
hypothesis that the first two ele ments of we need L
is
=
co m p u t e d
[1
-
O . The Wald test
l (L� - c ), lLCL/ J - (L� - c ) ,
[6. 1 ]
0
==
OJ
are equal to e ach other, c =
1
0
0
.
as w
..
[!
and
which has an approximate chi-square distribution with r degrees of fre edom under the null hypothesis . 1 3
Now we � eneralize this method to the multiple imputation setting. _ Instead of il, we c an us e (3 , the mean of the estimates across several
completed data sets, that is,
Next we need an estimate
of
the covari ance m atrix that combines t he
within-sample variability and the betwe en-sample variability. Let
Ck
be the e stimated covariance m atrix for the parameters in data set k -
and let C b e the average of those matrices across the
The between-sample variability is defined as
M
data sets.
k
The comb ined estimate of the covariance matrix is then
c
==
C+
(1
+ 1 / M )B ,
which is j ust a multivariate generalization of formula 5 . 1 without th� square root. We get our "-
substituted for (J and
M
C.
t e s t statistic by
formula
6. 1
with
-
p
and C
Unfortun ately, this does not work well in the typ ical c ase where
is
5
or less. In such cases, B is a rather unstab le e stimate of the
between-sample covariance , and the resulting distribution chi- squ are .
Schafer
of W
is not
( 1 997) gave a more stable estimator for the covari
ance matrix, b ut this required the unreasonable assum p t io n that the
fraction of missing information is the same for all the elements of "...
fl . Neve rthe l ess, some s i m u l atio n s show that this a l t e r n at e m e t h o d
\vorks well even when the assumption is violate d . This method has been incor p orated into the
SAS
proce dure MIANALYZE .
I ,
,, 67 I jkelihood Ratio Tests
of interest is estimat ed by m aximu m likelihood an d 1 h e re are no missing dat a , mu l tip aram eter tests are ofte n p erfo rmed I )y co mp ut in g likelihoo d r at i o chi-square s. The p roce dure is quite simple . Let 10 be the lo g- l ik el i h o od for a model that imposes the
If the model
, '.
h ypothesis and let /1 be the log-likelih ood for a model that relaxes 10 ) ' t h e hyp o the s i s The likelihood ratio statistic is just L == As before, ou r go al is to generalize th i s to mUltiple imputation.
2( /1
.
-
, [he first step is to perform the desi red likel ihoo d ratio t e s t in each be the mean of the likelih ood ( )f the M com p let e d data
sets. Let L
-
ratio chi-squares comp u t e d
across those M data
sets . Th at is t he e asy
par t. Now comes the hard part . To get those chi-squ ares, it was neces sary to estimate two models in each data set, one wi th the hyp othesis b e the mean of i mp os e d and one with the hypothesis relaxed. Let
il o
par am et e r estimate s when the hyp ot h e s i s is im p os e d and let p ar ameter est imates when the hypothe sis is of the mean b h e t e Bl relaxe d . In e a ch data set, we then comp ute the log-likelihood for a
M
the
with p aramete r values forced to be 130 and a g a in for a mo del with p ara meters set at P l ' Th is obviou sly r equ i res that the software be ab l e to calculate and report l o g like l ihood s for user- specifie d
model
-
(
-
B a s ed on these two
log-likelih�ods, ratio chi- squa re is co m pu t e d in each data set. Let L b e the se chi -squ are sta tist ics across t� lV! sam p le s .
parameter values.)
The final test statistic is then
a l ikel i hoo d
the mean of
L/(r + ( ��i )([ - L) ) , �
where
r
is
re st ri c tio ns imposed by the hypothesis . Under the nul l hypothesis, this statistic h as ap p roxim ate ly an F d i s t r ibut i on with nume r ato r degrees of freedom e qu a l t o The denom inator degrees
the number
of
of
r.
freedo m
and let
(d.d.f) is somewhat awkward to calculate . Let t
q ==
If
d d .
>
t .
f
.
==
4, t he
(
t l +
d . d . f.
==
l/r) ( l +
M+l M-1
4 + (t 1/q) 2 j2 .
-
4)[1
==
r eM 1 ) -
,......
L-L
-
r +
(1
•
-
2 / t )/ q]2 . If
t :::;
4,
t he
Combining Ch i-S q ua re Statistics Both the Wald test and the likelihood ratio test lack the appealing s i mplic i ty of th e single-par ameter methods use d e arlie r. In particular,
I I
I
, 68 t hey re q u i re t h a t t h e analysis software h ave specialized options and
out p u t , sOIn c t h ing we h ave generally tried to avoid. I now discuss
a
t h i rd Ill Ct hod th at is easy to compute from s tand ard outp ut, but m ay
not
be as a ccura te as the other two methods ( Li, Me ng, Raghunathan,
L� l{ub i n ,
1 99 1 ) . Al 1 t h at is nee d e d i s
the
conve ntion al chi-square
s t a t i s tic ( either Wal d or likelihood ratio) calcul ated in e ach of the
M
co rnpl c te d data sets, an d the associated degrees of free dom. Le t
d�
-
be a chi-square s t atistic with r degrees of freedom calculated
i n data set k . Let
sets and let
s3
d2
be the mean of these statist ics over t he
be the sample variance of the square roots of
squ are statistics over the M data sets, that is,
2 Sd
==
1
M
the
data chi
- 2
"
Lt (dk d) . M-1 k
The proposed t e st statistic is D
=
Under the null hypothesis, bution with
r
;]2 /1' - ( 1 - 1/M)s3 . 1 + ( 1 + l /M)s� this s t ati s t i c
has approxim ately an F dis t r i
as the nume rator degrees of freedom. The denominator
degree s of fre edom is approximated by
M-l r3/lv1 a
I h av e written
M
1 + ----(M + l /M)s�
•
SAS m acro (COMBCHI) to perform th ese com
pu tations a n d compute a p value . It is avai lable on my Web site
(http://www . ssc .upen n .cdu/''-'all ison) . To use it, aJ l you need to do
is e nter several chi -square values and the degrees of freedom .
macro returns
a
p va l ue
The
.
MI Example 2
Let us consider another detailed empirical example that i l l ustrates some of the techniques discussed in thi s chapter. The data set con sists of 2, 992 respondents to the 1 994 Gene ral Social Survey (D avis Smith, 1 9 97) . O ur dependent variable is SPANKING,
a
&
response to
69 t
he qu e s t i o n ,
" Do you strongly agree, agree, dis agre e or strongly dis agree that it is s om e ti me s necessary to d i s c ip lin e a chil d with a good , h ard spanking?" As the ques tio n itself i nd i ca t e s, there were four pos s i b l e ordered responses, coded as i nt e ger s 1 through 4. By d es i g n this question w as part of a module that wa s a d m in is t e red on ly to a ran dom two-thirds of t h e s ample. Thus, there were 1 ,0 15 c a se s that were Tn i s s i n g co m p l e t e ly at random. In a dd i t i on another 27 re s p on d e nt s were mi ssing with responses coded " don't kn ow or "no answ e r Our goal is to estimate an ord e r e d log is tic (cumulative logit) mode l (McCullagh, 1 980) in which SPANKING is pre dicted by the following ,
,
,
"
.
"
variables: AGE
Respond ent's age in years, ranging from 1 8 to 89. Missing 6
cases. EOUC INCOME
Number of years of schooling. Missing 7 cases.
Household income, coded as the midpoint of 21 interval cat egories, in thousands of doll ars. Missing 356 cases .
FEMALE BLACK MARITAL REGION NOCHILD
1
1
==
fem ale; 0
==
black; 0
=::
==
mal e
.
white, other.
Five categories of marital status. Miss ing 1 case. Nine categories of region.
1
==
no children; otherwise O . Missing 9 cases.
One a d d it i on al variable, NO DOUBT, r e q u i r e s furt h e r expl a na ti o n Respondents we re asked about their beliefs in Go d There were six response categories r ang in g fro m "1 don't believe in God" to "I know Go d re ally exists and I have no doubts about it. " The l atter statement was the modal response with 62% of the re spo n d e n ts Howeve r like the sp anki ng q u e sti o n this question was part of a m od u l e that was only as ke d of a random subset of 1,386 respondents. So there were 1,606 cases mis s in g by de sign Another 60 cases were treated as miss ing because they s aid " don't know" or " no answer. " As used here, the var iable was coded 1 if the respondent had " no doub ts ; otherwise it wa s coded O. Most of the missing da t a are on three variables, SPANKI N G .
.
,
.
,
.
"
,
NODOlJBT, and INCOME. There were five m aj o r missing data p att e rn s in the sample, account in g for 96% of r e spon d ent s : 77 1 cases
927 cases
No missing d ata on any variables Missing NODOUBT only
..
'I
�I
-
-
1 ,
70 42 1 cases
80 cases
Missing SPANKING only Missing INCOME only
509 cases
Missing SPANKING an d NODOUBT
160 cases
M issing NODOUBT and INCOME
As usual, the simplest approach to data analysis is listwise dele tion, which uses o nly 26% of the original sample. To specify the model, I cr ea t ed dummy variables for three marital status categories: NEVMAR (for never married), DIVSEP (for divorced or separated), and WIDOW (for widowed), with married as the reference category. Three dummy variables were created for region (with West as the omitted category) .14 Results (produced by PROC LOGISTIC i n SAS) are shown in the first column of Table 6 . 3 . Blacks, older respondents, and respondents with "no doubts" about God are more l ikely to favor spanking. Women and more educated respondents are more l i ke ly to oppose it. There are also maj or regional differences: respondents from t he South were more favorable toward spanking and those from the Northeast were more opposed. On t h e other hand, there is no evidence for any effect of income, m arit a l status, or having children. Because 84% of the observations with missing d ata are miss ing by design (and hence completely at random), listwise deletion should produce approximately unbiased estimates. However, the loss of nearly three-quarters of the sample is a big price to pay one that is avoidable with multiple i m put at io n To imple m e n t MI, I fi rs t used the dat a augmentation method under the multivariate normal model described in Ch apte r 5 . Before beginning the process, the sin gle case with missing data o n marital status was deleted to avoid having to impute a multicategory variable. A reasonable a rgument could be made for also deleting the 1,042 c a s e s that were missing t he SPANKING v ariab l e , because cases that are m i s si n g on the depen dent variable contain l i tt le information about regression coefficients . However, there is no harm in in cl u ding them and potentially some benefit, so I kept them in. All 1 3 variables in t h e model were included in the in1putation process without any normalizing transformations. 1�he imputed v a l ue s fo r all the dummy variables were rounded t o o o r 1 . The imputed values for SPANKING were rounded to the i n tegers 1 t h r o u g h 4. Age and inconle had some i m p u t e d values that we re o u t of the valid range, and these were re c od e d to the u p p e r o r lower bounds. The cumulative logit model was then estimate d for .
-- ---- ----
1
-- --------
---- ---- - --
,
'I I TABLE 6.3
Coefficie nt Estimate s (and Stand ard Errors ) for Cumu lative Logit Model s Predic ting SPANKIN G -
Variable
List wise
Normal Data
Se qu en tia l
Deletion
Augmentation
Regression
-0.355 (0. 141 ) * -0.4 8 1 (0.08 9 ) * * * 0 .565 (0 .218 ) * * 0 .756 (0. 1 1 7 )* * * BLACK - 0 . 00 3 6 (0.003 3) -0.00 52 (0.0020) * INCOM E -0.055 (0.027) * -0.061 (0.016 ) * * * EDUC 0 , 4 54 (0. 147) * * 0.465 (0. 120) * * NODOUBT NOCHILD - 0.205 (0.199) -0.109 (0. 1 12) 0 . 0043 (0.003 2) 0.010 (0.00 5 ) * AGE - 0.7 1 2 (0.219) * * -0 .444 (0. 1 25 ) * * * EAST -0. 1 6 1 (0. 136) MIDWEST -0.12 2 (0.20 3) 0.404 (0. 1 9 1 ) * * 0.323 (0. 156) * SOUTH -0.0 46 (0.238) -0.075 (0. 148) NEVMAR -0. 1 9 1 (0. 194) -0.203 (0. 150) DIVSEP 0. 1 48 (0.298 ) - 0.244 (0. 150) W I D OW p < 0 .05;
(Number A1is.\'/I I.1 : on SPA NJ\/N( ,' )
-
FEMALE
*
Seq. RCRrt 'ssi( ) 1 1
**
p < 0.01 ;
***
- 0.449 (0.098) * * * 0.693 (0.1 19) * * * - 0 . 0042 (0 .0 027) -0.0 73 (0.019 ) * * 0 .455 (0. 1 5 6) * - 0 . 14 1 (0.164) 0.003 1 (0.0 032) - 0.5 1 9 (0. 15 6) * * - 0.228 (0 .14 9) 0.262 (0. 1 2 9) * -0.036 (0. 17 3) -0. 14 1 (0. 12 8) - 0. 1 16 ( 0. 1 7 7 )
- 0 , 489 (OJ)'). ! ) I, ' 1 ' 1 0.68 5 ( 0. I . h ) '1 -0.00 47 (0.00, ) .) ) , 1 -0.068 (0.0 1 <) ) 0.438 (O. I ) I ) � -0.091 (O. I L�_ q 0.0040 (0.00 \ I ) -0,488 (0. 1 ' ( 1 ) '1 - 0. 159 (0. L�H ) 0. 357 (0. L) I ) -0.071 (0 . I _� I ) -0. 184 (O. I . '(! ) - 0.215 (O. I I I ) I,
,I
I
I
1
-I'
I
j
� -1-
P < 0 .00 1 .
each of the five data sets and the estima tes were combined tlsi l1'� 1 1 l C standard formulas. Results are s h own in the second column of Tabl e 6. 3 . The has i c p; 1 f t ern is the same, althoug h there is now a significa nt effect of I N ( '( ) M I and a loss of signifi cance for AGE. What is most striking is t h a l n i t " standa rd errors for all the coefficients are substantially lowe r I I , ; I I I those for listwise deletio n, typical ly by abou t 40 % . Even the s t a n d ; ! ! d error fo r NODO UBT is 1 8 % small er which is surprising g i v e n t 1 1 ;11 over h al f the cases were missing on this variable . The small er s t a t u L I i d errors yield p values that are much lower for many of the va r i a h l e : , The cumul ative logit model i mposes a constraint on the d a t a k I l O \V I I as the propOrliona l odds assumption . In brief, this phrase lll c a n s t h . 1 I the coefficients are assum ed to be the same for any di c hotOll l l /,; l t l O l l of the depend ent variabl e. P R O C LOGIS TIC reports a ch i " sq l L I I ' statistic ( score test) for the null hypothe sis that the proport iOlla I o d d , . assump tion is correct . However, because we are worki n g w i t h 1 1\' 1 ' data sets, we get five chi-squ ares , each with 26 degrees o f fn:..-dol l l 32.0, 3 1 .3, 3 8 .0, 36.4, an d 35.2. Using the COMBCHI m acro d e st'l ! ! I ear lier, these five values are combined t o pro duce a p v a hl l' ( ) j _ " I . '
,
(
H -j
t
72
suggest i ng that the constraint imposed by the model fits the d ata we ll .
For e a ch of the five data sets, I als o calculated Wald chi-square s for
the
hypothesis that all the region coefficients were zero . With
null
3 d . L , t he values were 72. 9 , 8 1 . 3 , 53 .4, 67.7, and
val ue
was
67.0.
.00002 .
The comb ined p
In the third column of Table 6 . 3 , we see results from applying
the multiple imputation method of Raghunathan et al . ( 1 9 99) , which
relies on s equential generalized regressions. A regression model was
estimated for e ach vari ab l e with missing data, which was taken as the depend ent variable, and all other variables were predictors. These regression m odels then were used to generate five sets of random imputat ions . For ED UC and INCOME, the models wer e ordinary lin
ear regression models, a l t ho u g h upper and lower bounds were built
into the imp utation process. Log i sti c regression models were speci
fied fo r NODOUBT and NOCHILD . A multinomial logit model was used for
SPANKING.
There were 20 rounds in the imputation pro
cess, which mean s that for e ach of the five comp leted data sets, the variables with missing d at a were sequentially i mpu t ed
20
times before
using the final result.
As b efore, the cumulative logit model was estim ated on e ach of the five comp leted data sets and the results were comb ined us i n g the stan d ard formul as. The coefficient estimates in the third column
of Table 6 .3 a r e quite similar to those for mu l t iv a r iate normal data augmentation. The stan dard errors are generally a little higher than those for data augmentation, although not ne arly as high as those for listwise deletion. Somewhat surprising i s the fact that the chi-s quare statistics for the proportional odds assumption are ne arly twice as l arge for se quen t ial regression compared with those for normal data augmentation . Specif ically, with 26 degrees of freedom, the values were 85 . 4 , and 5 9
.
0
,
54. 9,
59 .9
,
6 6 . 7,
e ach with a p valu e well below . 00 1 . However, when
the values are comb ined using the
COMBeRI
macro , the res u l ti n g
p va l u e is .45 . Why the great disparity between the ind ivi d u al p val u e s
and the combined value ? The answer is that the large var iance among
the chi-s quare s is an ind ication that e ach one of them may be a sub s t a nt i a l
overestimat e .
Th e
formula for comb ining them takes this into
account . What accounts for the disparity between the chi-squares for normal data augmentation an d t hose for se quential regression ?
I
suspect that
stems from the fact that t h e m u lt i nom ia l logit model for i mp u ti ng
73 SPANKING did not impose any ordering on that variable. As a result, the imputed values were less likely to correspond to the proportional odds assumption. When the sequential imputations were redone with a linear model for SPANKING (with rounding of imputed values to integers), the chi-squares for the proportional odds assumption were more in line with those obtained under normal data augmentation. Alternatively, I redid the sequential imputations after first deleting all missing cases on SPANKING. SPANKING was still specified as categorical, which means that it was treated as a categorical predictor when imputing the values of other variables. Again, the chi-squares for the proportional odds assumption were similar to those that resulted from normal data augmentation. The last column of Table 6.3 shows the combined results with sequential regression imputation after deleting missing cases on SPANKING. Interestingly, both the coefficients and their standard errors are generally closer to those for data augmentation than to those for sequential imputation with all missing data imputed. Fur thermore, there is no apparent loss of information when we delete the 1,042 cases with data missing on SPANIGNG. MI for Longitudinal and Other Clustered Data
So far, we have assumed that every observation is independent of every other observation, a reasonable presumption if the data are a simple random sample from some large population. However, many data sets are likely to have some dependence among the observations. Suppose, for example, that we have a panel of individuals for whom the same variables are measured annually for five years. Many com puter programs for analyzing panel data require that data be orga nized so that the measurements in each year are treated as separate observations. To link observations together, there also must be a vari able that contains an identification number that is common to all observations from the same individual. Thus, if we had 100 individuals observed annually for five years, we would have 500 working observations. Clearly, these observations would not be independent. If the multiple imputation methods already discussed were applied directly to these 500 observations, none of the longitudinal information would be utilized. As a result, the completed data sets could yield substantial underestimates of the longitudinal correl ations, especially if there were large amounts of missing data.
- - -- - --
-
-
-
--_.-
-
-
-
.-
-
-
-
_
--
.
74
Similar problems arise if the observations fall into n aturally occur ri ng clusters. Suppose we have a s ample of 500 m arried couples, and the same b at t e ry of questions is administered to both husb and and
wife . If we i mpute miss ing d at a for either spouse, it i s imp ortan t to do i t in a way th at
uses
the correlation between the spo use s ' responses.
The s am e is true for students in the same cl assroom or respondents in the same neighborhood.
One approach to these k i n ds of data is to do multiple imputation
under a model t h at builds i n the dependence among t h e obseIVations.
Schafer ( 1 997) proposed a mul tivariate, linear m ixed -effects model for clustered data and also developed a comp uter program
(PAN)
to do
the i m p u t at i on using the m ethod of d at a a u g m en t at i o n (available o n
the Web at http ://www . s tat.psu. edu/'"'-'jls/) . Al though a Windows ver
sion of this program is promised, the current version runs o n l y as a library to the S-PLUS package . There is also a much simpler ap p ro ac h that works well for p anel data wh en the
n u mbe r
of wave s is re latively s m all. The b asic idea is
to format the data so that there is only one record for each individual and with distinct variables for the m easurements on the same variable at different points in time . Multiple imp utation i s then performed using any of the methods we h ave considered . This allows for variables at any p oin t in time to be used as predictors for variables at any other point in time. Once the data h ave been imputed, the d ata set can be reformatted so that there are multiple records for e ach individual, one record for e ach point in tim e .
MI Example 3
Here is an example of multip le imputation of longitudinal data
using the simpler method j ust discussed. The sampl e consisted of 220
wh ite women, at least 60 years old, who were treated surgically for a hip fracture in t h e greater Philadelphia are a (Mossey, Knott, 1 9 90) .
After
&
Craik,
their release from the hosp ital, they were interviewed
three times : at 2 months, 6 months, and 1 2 months. The following five variables, me asured at each of the three waves , were CESD SRH
A m e as u re of de p ression, on a scale from 0 to 60.
Self-rated health , m e as ured on a four-po int scal e
4
==
excellent ) .
considere d.
(1
poor,
,
1-
_ ' "
75 WALK ADL
PAIN
Coded 1 if the patient could walk without aid at home; otherwise
coded O .
Number of self-care " activities of daily living" that could be com pleted without assistance (ranges fr om 0 to
3) .
Degree of pain experienced by the p at i en t [ranges from 0 (none)
to 6 ( co n st a n t ) J.
()ur goal was to estimate a "fixed-effects" linear regression model (Greene, 2000) with CESD as the dependent variable and the other four as independent variables. The model has the form
where Yit is value of CESD for person i at time t and the e it sat isfy the usual assumptions of the linear modeL What is notewor thy about this model is that there is a different intercept (Xi for each per son in the sample, thereby controlling for all stable char acteristics of the patients. This person-specific intercept also induces a correlation a mong the multiple responses for each individual. To estimate t he model, a working data set of 660 obseIVations, one for each person at each point in time, was created. There are two e quivalent computational methods for getting the OLS regr e ssion esti mat es : (1) include a dummy variable for each pe rso n ( l ess 1) or (2) run the regr ession on deviation scores. This se cond method involves subtracting the person-specific mean (over the three time points) from each v ariable in the model before running the mu ltiple r e gr e s s ion. Unfortunately, there was a substantial amount of attrition from the study, along with additional nonresponse at each of the time points. If we delete all person times with any missing d at a the working data s et is reduced from 660 observations to 453 observations. If we delete all persons with missing data on any variable at any time, the data set is reduced to 101 persons (or 303 person tim es) Table 6.4 displays fixed-effects regress i on results using four methods 15 for han dl ing the missing data. The first two columns give coefficients and standard errors for two versions of listwise deletion: (1) de l et ion of persons with any missing data and (2) d e l e t ion of person-times with any missing data. There is cle ar evidence that the level of depr e s s ion is affect ed by self-rated healt h but only marginal evidence for an effect of walking ability. It is also evident that the level of d epre s sion in waves 1 and 2 was much higher than in wave 3 (wh en most of the -
,
-
,
.
-
-
--- ---- _.-
-
- --- -
---
76 TABLE 6.4
Coefficient Est i m a t e s
(and
St an d a rd Errors)
for Fixed-Effects Models Predicting CESD ..
Listwise
PAIN WAVE 1 WAVE 2 .
05
Person
Person - Time
0 .03 1 (0 . 1 79)
8 .004 (0.650) * *
N (perso n -times ) <
Deletion by
Augmentation
by Person- Time
Data
Augmen tation -
by Person
.
ADL
p
De leI ion by
Data
2.341 (0.5 8 6 ) * * 1 .64 1 (0.556) * * 2.522 ( 0 . 6 1 7) * * 1 .5 38 (0 .5 0 1 ) * * - 1 55 2 (0 . 77 1 ) * - 1 .381 (0.76 1 ) - 1 .842 (0.960) - 0 . 5 5 0 (0.825) - 0 . 6 76 (0.5 28) -0 .335 (0.539 ) - 0 .385 (0.562) - 0.4 1 0 (0.435 )
SRH WALK
*
Listwise
.
,
**
7 . 045 (0.579) * *
0 .2 1 5 (0 . 1 68)
8.787
0.305 (0. 1 80)
(0.6 1 3 ) * *
6.900 (0.729 ) * *
7.93 0 ( 0 . 5 20) * *
5 . 808 ( 0 . 642) * *
453
303
660
0. 1 70 ( 0 . 1 64)
9 . 1 1 2 (0 .6 1 5 ) * *
8 . 1 3 1 (0 .549) * * 660
p < .0 1 .
pa t i e n t s had fully r e cover e d ) . There is little or no evidence for effects of ADL and PAIN. The last two columns giv e results based on the full sample, with missing data im p u t e d by data augmentation under the multivariate 1 6 For r e s u l t s in the third column, the imputation was nor ma l model . carried out on the 660 person-times treated as in d e p en d e n t observa tions. Thus , missing data were i m p u t e d us ing only information at the same point in time. To do the imputation for the las t col umn, the data were reorganized into 220 p e rson s with d ist i n c t variable names for each time p oin t In this w ay each variable with missing data was i m p ut e d based on information at all three po i n t s in tim e . In p r i n c i ple, this should p ro d u c e much b e t t er imp utations, especially because a m i s s i ng value could be predicted by m e a sur em e nt s of the same vari able at diffe r en t points in time. In fact, the estimated standard errors for the last column are all a bit lower than those for the penultimate co lu m n They also tend to be a bit lower than th os e for either of the two listwise deletion meth ods. O n the other hand, the standard errors for data a ugm e nta tio n based on p e rs o n- t i mes tend to be somewhat l arge r than those for t h e two l i s tw i s e deletion methods. In any case, there is no ovelWhelming advant age to m u l t i p l e i m pu t at i o n in this application. Qual it atively, the conclusions would be pretty much the same regardless of the im p u ta tion method. .
,
.
77
7. NONIGNORABLE MISSING DATA Previous chap ters have focused on methods for situations in which the missing data mechanism is ignorable. Ignorability implies that we do not have to model the process by which data happen to be missing. The key requirement for ignorability is that the data are missing at random the probability of missing data on a particular variable does not depend on the values of that variable (net of other variables in the analysis) . The basic strategy for dealing with ignorable missing data is eas ily summarized: Adjust for all observable differences between missing and nonmissing cases, and assume that all remaining differences are unsystematic. This is, of course, a familiar strategy. Standard regres sion models are designed to do j ust that adjust for observed differ ences and assume that unobserved differences are unsystem at i c. Unfortunately, there are often strong reasons to suspect that data are not missing at random. Common sense tells us, for exampl e that people who have been arrested are less likely to report their arrest sta tus than people who have not been arrested. People with high incomes may be less likely to report their incomes. In clinical drug trials, p eo ple who are getting worse are more likely to d rop out than people who are get t ing better. What should be done in these situations? There are models and methods for handling nonignorable missing data, and it is natural to want to apply them. However, it is no accident that there is lit tle software available for estimating nonignorable models (with one important exception Heckman's selectivity bias model) . The basic problem is that, given a model for the data, there is only one ignor able missing data mechanism, but there are infinitely many different nonignorable missing data mechan i sms So it is hard to write com puter programs that will han d le even a fraction of the possibilities. Furthermore, the answers may vary widely depen d in g on the model chosen. So it is critically important to choose the right model, and that choice requires very accurate and detailed knowledge of the phe nomenon under investigation. Worse still, there is no empirical way to discriminate one nonignorable model from another (or from the ignorable mo d e l) I will not go so far as to say, "Don't go there, " but I will say t his: "If you choose to go there, do so with extreme caution . " In a dd ition, if you do not have much statistical expertise, make sure you find a ,
.
.
, "
t
- ( ) I L I i >o ra tor who does . Keeping these cave ats i n mind, this chapter
i s d e s i g ne d to give you a brief introduction and oveIView to some
appro aches for dealing with nonignorable missing data. 'The first
thing you
been pushing for
n eed to kn ow is th at the two
ignorable
missing
data
m ethods
I have
maximum likelihood an d
multiple imputation
can be re adily adapted to deal with nonignor
able missing data. If
t he
ch osen model is correct (a big if) , these
two methods have the sam e opt i ma l properties that they have in the ignorable setting. A second poin t to remember i s that any method
for non ignorab le missing d ata should be accompanied by a sensitivity an alysis. Because results can vary widely depe nding on the assumed mo
del ,
it is im p o rtant t o try
if they give similar answers .
out
a range of plausible m o dels and see
Two Classes of Models
Regardless of whether you choose maxim um likelihood or multiple imputation, there are two quite different approaches to
modeling non
ig no r ab l e missing data: selection models and p attern-mixture mode l s .
This i s most e asily expl ained for a single vari able with missing data. Let
Y
be the vari able of interest an d let R be a dummy vari able with
a value of 1 if Y is observed and
0
if Y is missing. Let
j oi n t probability density function (p. d .f.) for
the
feY, R) be
two variables . Choos
ing a model means choosing some expl icit specification fo r f ( Y, The j oint p . d .£. can be factored
Rubin, 1987) .
fey)
R).
two diffe rent ways (Little an d
In selection models we use
fC Y, R ) where
in
t he
==
Pr(R I Y)!( Y) ,
is th e marginal density of Y a n d Pr(R I Y ) i s the con
diti ona l prob ability of R given some value of
Y. In
words, we first
mode l Y a s if no d ata were missing. Then, given a value of Y, we model whe ther or not the data are missing. For example, we could assume that f ( Y) is a normal distribution with mean J.L and variance (J" 2 , and that Pr(R I Y) is given by Pr( R
n
==
1
Y)
==
if Y > 0 ,
if
Y
<
This model is ide t i fi ed and can b e es timated by
O. ML.
,
79 The alternative factorization of the joint p.d.f. corresponds to I >attern-mixture models,
f ey, R )
==
f e y R ) P r( R ) ,
where fey R) is the density for Y conditional on whether Y is miss ing or not. For example, we could presume that Pr(R) is just some constant 8 and fey R) is a normal distribution with variance (72 and 1 and mean fLu if R O. Unfortunately, this model is lnean IL l if R nbt identified and, hence, cannot be estimated without further restrictions on the parameters. Pattern-mixture models may seem like an unnatural way to think about the missing data mechanism . Typically, we suppose that the values of the data (in this case Y) are predetermined. Then, depend ing on the data collection procedure, the values of Y may have some impact on whether or not we actually obtain the desired information. This way of thinking corresponds to selection models. Pattern-mixture models, on the other hand, seem to reverse the direction of causality, allowing missingness to affect the distribution of the variable of inter est. Of course, conditional probability is agnostic with respect to the direction of causality, and it turns out that pattern-mixture models are sometimes easier to work with than the more theoretically appealing selection models, especially for multiple imputation. I now consider some examples of both selection models and pattern-mixture models. ==
==
.
Heckman' s Model for Sample Selection Bias
Heckman's (1976) model for sam_pIe selection bias is the classic example of a selection model for missing data. The model is designed for situations in which the dependent variable in a linear regression model is missing for some cases, but not for others . A common moti vating example is a regression that predicts women's wages, where wage data are necessarily missing for women who are not in the labor force. It is natural to suppose that women are less likely to enter the labor force if their wages would be low. Hence, the data are not miss ing at random. Heckman formulated his model in terms of latent variables, but I will work with a more direct specification. For a sample of n cases
xo
(i
==
1, . .
a n c e 0-2
n), let Yi be a normally distributed variable with a vari and a mean giv e n by &
,
[7.1 ] a column vector o f ind ep e nd e n t variables (including a v al ue of 1 for the intercept) and {3 is a row vector of coefficients. The go al is to estimate {3 . If all Yi were observed, we c o u l d get ML est im a te s of (3 by ordinary least squares r egr e ss i on. However, some Yi are m is s ing The p rob ab i li ty of m is s in g data on Yi is assumed to follow a prob it mod el where Xi is
.
,
where
==
.
[7.3 ] where
+00 - (X)
Pr ( R i
==
0
y, Xi )f(y Xi ) dy
==
�
aD
+ ( a l {3 + a2 ) xi 1
+ ai (T2
•
[ 7 . 4]
E q u a ti on 7.4 follows from the general p r i ncip l e that the likelihood for an o b s e rva ti o n with missing data can be fou nd by integrating the like liho o d over all p o ss ib l e values of the mis s ing data. The likelihood for the entire sample can be re adily maximized u s i ng standard numerical methods. Unfortunately, estimates produced by this method ar e ext r eme ly sensitive to the assu mp ti o n th at Y h a s a normal d is trib u tion . If Y actually h as a skewed distrib ution, ML estimates obtained under
severely biased, perh ap s even mor� ( h ; l I ) an igno rabl e mis sing data model (I j ( I l l' 8-
Heckman's model may b e estimates obtain e d under Rubin, 1 987) .
Heckman also propose d a two-step estimator that is less s� n s i l l v c to depart u re s from normality, substantially easier to com p u t e � u " I . th e re for e m or e p op u l a r than ML. How eve r the two-step method I L l .' , its own l imitat ions ,
,
.
In brief, the two steps are as follows: 1. Estimate a probit regression of R X variables.
the missing data indicator
-
l ial'
Oil
2. For cases that h ave data p resent on Y, estimate a least-squ ares f l l I l " : u regression of Y on X p lus a variable that is a transform ation o f t i l l " predicted values from the probit regression. 17
ML m eth o d the two-step p ro ce dure is not feasib le i f f l i t - I I are no X variab l e s Furthermore, the parameters are only w( " ; dd v i d en t ifi e d if the X v ar iables are the same in the probit and I i n ( ' . 1 1 regressions . To get reasonably s table estimates, it is essen t i ; " t h . 1 1 th e re be X var iable s in the p robi t regre ss i on that are excluded f I ) J I I th e l ine a r r egre ssion. Of course, it is rare th a t such exclusion n · �; f I It tions can be persuasively j ustified. Even when all these cond i t i o l l s . 1 1 4 met, the two-step es tim at or may perform poorly in many rea l i s t ic �� . I u ations (Stolzenberg & Relles, 1990, 1 997) . Given the appa re nt sensitivity of these sample selection m e t hod�-, t � , vi o l ations of as sumpt i o n s how should we proceed to do a St' I l �-; i f I V I ( v analysis? For the ML e stimator the key as s um ption is the nO r l l l : d l t y of the dep ende n t variable Y. So a n a t ura l s tr at egy would he t ( l ' I f m od e l s that assume different distributions. Skewed distribut i ( ) l 1 � I t il. ( We ib ul l or gamma probably would be most useful because i t i s H I C 1\/1 1 symme t ry of the n o rmal distribution that is most cru cial for M I estimation sh ou l d be possible for alternative distributions., a l t h O I l " , h the integral in Equation 7.4 may not have a convenien t fO rl n ; l I l d may require numerical integration. For the two-step estiln a t H t l l t key assumption is the exclusion of certain X v ariab l es fr01l1 t h e I I I H . 1 1 r egre s s io n that p r e d i cts Y. A se nsi t ivi ty analysis might cx p l o l c t l l t consequences of choosing different sets of X variables fo r t h e I \ V I ' Unlike the
,
.
(
,
,
, _
(
_
-
equations.
82 ML Es timation With Pattern-Mixture Models
models
are notoriously underidentified . Suppose we have two variables X and Y, with four observed p atterns of nl 1sslngne ss : Pattern-mixture
•
•
1 . Both X and Y observe d .
2 . X observed , Y missing. 3 . Y observed, X missing. 4. Both X and Y missing .
1 , 2, 3, or 4, depending on which of these patterns i s observed . Let R A pattern-mixture mode l for these data has the general form ==
I(X,
Y, R)
==
fe y, X R ) Pr (R )
.
To make the model more s p ecific , we might suppose that Pr( R ) is g iven by the set of values P I , P2 , P3 , and P4 ' Then we might assume that fe y, X R ) is a bivariate normal distribution with the usual param eters : f1., x ' J.L y , (J"x , ay , o-xy · However, we allow each of these param_ eters to be d ifferent for e ach value of R . The pr o b l e m is th a t when X is observed but Y is n o t there is no information to estimate the mean and standard deviation of Y or the covariance of X and Y. Similarly, when Y is observed but X is not, there is no information to e s t im at e the mean and standard deviation of X or the covari ance of X and Y. If both variables are missing, we have no information at all. To make any headway, we must impose some restrictions on the fo ur sets of parameters. Let e( i) be the set of parameters for p at tern i . A simple but very restrictive condition i s to assume that e( 1 ) (} ( 4 ) , which is equivalent to MCAR. In that case, ML 0(2) (J(3 ) estimation of the pattern-mixture model is identical to that discussed in Chapter 3 for the normal model with ignorable m is sin g d ata. Little (1993, 1 994) proposed othe r cl asses of restrictions that do n o t cor r e s p on d t o ig n o rabl e missing data, but y ie ld identified models. Here is one example . Let 8 � l x represe n t the conditional di stribution of Y given X for pattern i . What Little calls complete- case missing-variable restrictions are given by ,
==
==
==
! II I I I
I1, I , , I
83
I
I
I
!I
, I I
For the two patterns with one variable missing, the conditional distri bution of the m issing variable
given
the observed variable is equated
to the corresponding distribution for the complete-case pattern. For
, I
the p attern with both variables missing, all p arameters are assumed
:, I 1!
equal to those in the complete-case pattern. This model is identified
II
and the
ML
estim ates can be found by noniterative means. Once all
these estimates are obtained, they can be e as ily combined to get esti mates of the m arginal distribution of
X
and
Y. I, '. ,
Multiple Imputation With Pattern-Mixture Models ML estimation of p attern mixture models is still rather esoteric at this point in timeo
Much
more practical and useful is the combination
of pattern-mixture models with multiple imputation (Rubin, 1 987) . The simplest strategy is to generate imputations
u n d er
an ignorable
model first and then modify the imputed values using, say, a linear tranSfOI'Ination. A sensitivity analysis is then easily obtained by repeat ing the process with different constants
in
Here is a simple example. Suppose that we again
X
transformation� have two variables
the linear
and Y, but only two missing data patterns: ( 1 ) complete case and
(2) missing
Y.
We assume that within each pa ttern,
X an d
Y have
a bivariate nonna} distribution. We also believe that cases with miss ing
Y
tend to be those with higher values of
Y,
so we assume that a ll
p arameters for the two patterns are the same except that where
C
JL�)
==
CJ.L�),
is some constant greater than 1 . Multiple imputation then
amounts to generating imputed values for Y under an ignorable miss ing data mechanism and then multiplying all the imputed values by Co
Of course, to make this work, we have choose a value for c, and
that choice may be rather arbitrary. reimputing values of
C
the
. 18
A
sensitivity analysis consists of
data and reestimating the model for several different
Now let us turn this into a real example . For the college data,
were 98 colleges with missing data on the dependent variable GRAD RAT. It is pl au sib l e to suppose that those colleges that did not report g ra duat i o n rates had lower graduation rates than those that did
there
report their rates. This speculation is supported by the fact that the
mean of the imputed graduation rates for the multiply imputed data described in Chapter
5
is about
10
percentage points lower than the
mean graduation rate for the colleges without missing
data.
However,
84 7. 1 Regression of Graduation Rates on Several Variables, Under Different Pattern-Mixture Models TABLE
-
Variable
=
900/0
100%
80%
700/0
60% •
CSAT
LENROLL PRIVATE STUFAC RMBRD
0 .067 2.039 12.716 -0.217 2.383
0.069 2.062 1 2.542 0 . 1 42 2.264
-
0 .071 2.077 1 1 .908 -0.116 2.738
0.072
2.398
12.675 -0. 126 2.5 1 3
0.071 2.641 12.522 -0.213 2 .464
this difference is entirely due to differences on the p re di c tor vari ables and does not constitute evidence that the data are not missing at random. Suppose, however, that the difference in graduation rates between missing and non missing cases is even greater. Specifically, modify the imputed graduation rates so that they are a specified percentage of what would otherwise be imputed under the ignorability assumption. Table 7. 1 displays results for imputations that are 100%, 90%, 80% , 70%, and 60% of the original values. For each regression, enti re ly new imputations were generated. Thus, some of the variation across columns is due to the randomness of the imputation process. In gen eral, the coefficients are quite stable, suggesting that departures from ignorability would not have much impact on the conclusions. The STUFAC coefficient varies the most, but it is far from statistically significant in all cases.
8. SUMMARY AND CONCLUSION Among conventional methods for handling missing data, listwise dele tion is the least problematic. Although listwise deletion may discard a substantial fraction of the data, there is no reason to expect bias unless the data are not missing completely at random. In addition, the standard errors also should be decent estimates of the true standard errors. Furthermore, if you are estimating a linear regression model, listwise deletion is quite robust to situations where there are missing data on an independent variable and the probability of missingness
85 depends on the value of that variable. If you are estimating a logis tic regression model, listwise deletion can tolerate either nonrandom missingness on the d ependent variable or nonrandom missingness on the independe nt vari ables (but not both). By contrast, all other conventional methods for handling missing data introduce bias into the standard error estimates, and many con ventional lnethods (like dummy variable adjustment) produce biased parameter estimates, even when the data are missing completely at random. So listwise deletion is generally a safer approach. If the amount of data that must be discarded under listwise deletion is intolerable, then two alternatives to consider are maximum likeli hood and multiple imputation. In their standard incarn ations, these two methods assume that the data are missing at random, an ap pre ciably weaker assumption than missing completely at random. Under fairly general conditions, these methods prod u ce estimates that ar e approximately unbiased and efficient. They also p roduce good e sti mates of standard errors and test statistics. The downside is that they are more difficult to implement than most conventional met hods and multiple imputation gives you slightly different r e s u lt s eve ry ti me you do iL If the goal is to estimate a linear model that falls within the cl as s of models estimated by LISREL and similar packages, then maximum likelihood is probably the preferred method. Currently there are at least four statistical packages that can accomplish this, the best known of which is Amos. If you want to estim ate any kind of nonlinear model, then multi ple imputation is the way to go. There are many different ways to do multiple imputation . The most widely used method assumes that the variables in the intended model have a multivariate normal dis tribution. Imputation is accomplished by a Bayesian technique that involves iterated regression imputation with random draws for bot h the data values and th e parameters. Several software packages are currently available t o accomplish this. Other multiple imputation methods that make less restrictive distri butional assumptions are currently under development, but they have not yet reached a leve l of theoretical or computational refinement that would justify wi despread use. It is also possibl e to do maximum likelihood or m u l t i p l e i l l l p l l -' tation under assumptions that the data are not m issi ng a t . . a l l d o l l ) � but getting good results is tricky. These methods are ve ry se i l s i l i vc i n
--
-
-
S( )
; ' SS 1 1 1 1 1 p t ions
ol adc about
t h e m i ssing ne ss mechanism or about the d is
' r i h u 1. iolls of the variables with missing data. In addition, there is no
way
to
lllent
is
test these assumptions. Hence,
t he
most important require
good a priori knowledge of the mechanism for generating the
{n issing data. Any
effort
to estimate nonignorable models should be
acco111panied by a sensitivity
an
a lys is
.
:"" '!:---- -' ----
87
NOTES
1.
!( YIX),
The proof is straightforward. We want to estimate
Y given
distribution of
X , a vector of p r edictor variables. Let
A
ables a r e obsexved; otherwi s e ,
fe y X, A
=
1).
O. Listwise de l etion is
=
A equiva l ent
The aim is to show that this function is the same
definit i on of conditional probability, we h ave
!( YIX, A
1)
=
th e conditional =
1 if a11 vari
to estimating as f( Y I X). Fro m the
fC Y, X, A 1 ) f(X, A 1 ) Pr(A 1 1 Y, X)f( Y X)f (X) Pr ( A l IX)f(X) =
=
=
-
=
•
=
Assume that Pr e A
=
1 1 Y, X )
=
Pre A
=
p r ese nt on all variables does not depend on
It im media t ely follows that
f C Y I X, A
=
1 IX), that is, that the p rob abili ty of data
Y,
1)
=
but may depend on any variables
in X .
I( Y I X ) .
Note that this re sult app l ies to any regression procedure, not j ust linear reg r ession . 2.
Ev en
if the prob a b i 1 i ty of missing data d e p e n d s on both
some situations when l istwise de l eti on
n
X
and
Y , there are
is u nproble matic. Let p( Y, X ) be the pro b ability
s
of missi g data on one or more of the variables in the regre sion model, as a function
of the dichotomous d ep endent variable Y and a vect or of indepe n dent variables X . If
that probability can be factored as p ( Y,
X)
=
f( Y)g( Y),
using lis tw is e deletion are consi st e nt estimates of t he true
3.
Gla s ser
regression slopes coefficients ( Glynn 1 985).
then l ogi s t i c
,
(1964) de rived formulas that are reason ably easy to implement
,
but valid
only when the independent variables and missing data pattern are "fixed" from s ample to sam ple , an unlikely condition for realis tic applications . The fo rm u las of Van Praag, Dijkstra,
and Van Velzen (1985)
are more ge n e r ally applicable , but require information
beyond that give n in the covariance matrix : higher-order moments an d the number of available cases for all sets of four variables.
j
4. Although the dummy variable ad u stment method is
data are tr u ly m i ssi n g ,
c l ea r ly
u naccept able whe n
it may still be appropriate in cases where t he unobserved value
simply does not exist . For example , married re sp o nden t s may be a sked to rate the
quality of their m arriage, but that q u e st ion has no meaning for u nm arried respondents .
Su ppose we assume that there is one linear equation for
ma rrie d
couples and a n o t he r
equation for u n m arried couples. The m ar ried equ a t ion is ide nt ical to the unmarried
equation except that it has (a) a tenn that corr espond s to the effect of marital quality on the depend ent variab l e and (b) a different in tercept. It is easy to show t h at the dum my variab le adjustment method p r od uc es optimal estim ates in this sit uation .
5.
Schafer and Schenker (2000) propo sed a lne thod for getti n g consistent estimates
of the s tand ar d errors when conditional m e a n i m pu t a t i o n is u sed. They claim that un d er
a pp ropriate
conditions their method can yi eld more pre cise estimates with les s
co mputational effort than mul t iple imputation.
---- -- --- ------�
' "
,
!
•
6.
Maximum likeli ho od u nder the multivariate normal ass u m ption p r oduc e s con s is t e n t est i m ate s of the me ans and the c ov arianc e matrix for any multivariate distribution wi th finite fourth m o m e n ts (Little & Smith, 1 987) . 7 . This v a ril a b le is the sum of two variables in the o riginal data set, " room costs" and b oa r d variable . " 8 . For the two-step method, it is also possible tu get stan da rd error estimates u sing " sandwich" formulas d e scri b e d by B r ow n e and A rminger ( 1 9 95) . "
9 . The s ta n dar d e rro rs were estimated u nder the as s u mp t io n of bivariat e no rm a l i ty
the d a ta were drawn from a bivariate
which is appropria te for this example b ecau s e normal dis trib ution . The formula is
r2 S.E. (r) := In 1
Altho ugh the s a m pl e correlatio n coefficient is
,
-
.
not normally d is t ribu te d , the l a rge s ample
size i n this ca se should ensure a c lo se approximation to no rmality. Thu s, these standard
errors could b e a p pro priately
us
ed
to construct confidence intervals.
1 0. For data a u gmen tation , the stand ard noninforma tive prio r ( S chafer, 1 9 97) , ( p + 1 )/ 2 , where � i s the c ov a r i a n c e matrix kn own as the J e ffre ys prior, is wr i t ten as and p is the number
\L
of variables.
1 1 . One way to ge t such an
ov e r dis p e r s e d
For example, take five d ifferen t random
-
distribu t i on is to u s e a bootstrap method .
s a m pl e s w i t h rep l ac e m e n t from the original ,
data set, a n d c o n1p u te the EM estimate in each of these salnples. Th e EM estimates then
c ou
e
be used as starting v a l u s in each of five p a r a lle l chai n s .
ld
to d o this is t o d ivid e the (0, 1 ) i n t e rv a l into five subint erval s with lengths propo rtional to the p ro b a b ilit i es for e ach category of marital statu s . Draw a 1 2 . An e asy way
ra n do m number from a
ae
u nifo rm
distri bution on the unit in te rva1 . Assign the marit al
st atus c t go ry that co r re s pond s to the subinterval in which the 13.
Ac tu a lly the ,
d e gre e s of freedom
not always, r .
is
r an d o m
e qual to the rank of L,
w h ic h
n u mb er falls . is usually, but
1 4 . The
nine regions were clas s i fied as follows : East (New England, Midd le Atl antic ) 566 ca s e s ; Central (East North Central, West North Cen t ral ) 7 1 5 case s ; South ( S o u th Atlantic, East South Central, We st South Central), 1 , 095 cas es ; West ( Mou nta i n Pacific), 6 1 6 cases. 15. Th e re g r e s s io n a n alys i s was p e rform ed with the GLM p roced u re i n SAS u sing ,
,
,
the ABSORB stateme nt to h an dl e the fixed effects .
16. Ten data s e t s were produced, with 30 iterations per data se t . After imp u t a ti on ,
u
the imputed val e s were recoded
of
the o riginal var i a bl es .
17.
Specifically, the
a ddi tio n al
w
h er ever necessary t o preserve the pe rmissi bl e values
variable is A ( axj ) , where
a
mated coefficients fro m the probit mode l . The function A( z ),
is the row vec tor of esti
the inverse Mills fu nction ,
is defined as c,b ( z) / tP ( z ) , where 4> ( z ) i s the d e n s i ty function and ¢(z) is the cu m u lative
d i stribution fu nct ion, both for a s ta n d a r d normal variable.
18. In the mode l s c o n s id e red by Rub in ( 1987), the con ditional pri or distribution of the p a r a m e t e r s i s s p e c i fi e d for the missing data pa t te rn given the parameters i n the
t rn .
complete d a ta p a te
In the example given here, I
h av e
me rely a s s u m e d that the
multiple of the mean for the c o m pl e te cases . Additionally, the condition al v ar iance for the missing c a s e s can be a l l owed to be larger than that for the co mpl et e cases to compensate for gre ater uncertainty . condition al mean for the missing case s is some
--
- ----- - -
- -- -
89
REFERENCES AGRESTI, A., and FINLAY, B. ( 1 997).
Statistical methods for the social sciences.
En gle
wood Cliffs, NJ : Prentice-Hall.
ALLISON, P. D . ( 1 987). Estimation of linear models with incomplete data . In C. Clogg (Ed . ) ,
Sociological methodology 1987
olog i c al Association.
(pp . 7 1 - 1 03 ) . Was hing ton, DC: American Soci
ALLISON, P. D . (2000 ) . Mul tiple i mputation for missing data.
Research, 28,
Sociological Methods
&
30 1-309 .
RUBIN, D . R. ( 1 999) Small - sa m ple degrees of fr eedo m with mul tiple imputation. Biometrika, 86, 948-955 . BEALE, E. M . L. , and LI'I'TLE, R. J. A. ( 1 975 ) . Missing values in m u ltiv a r iate analys is . Journal of the Royal Statistical Society, Series B, 3 7, 1 29-145 . B RAN D, J. P. L. ( 1 999) . Development, implementation and evaluation of multiple imputa
BARNARD, J . , and
.
tion strategies for the statistical analysis of incomplete data sets. University Rotterdam ( I SB N 90-74479-08- 1 ) .
Ph . D . Thesis, Erasmus
BROWNE, M . W , and ARMING ER, G. ( 1 995 ) . Specification and estimation of mean and covariance st ruc t ure
(Eds .)�
models. I n G. Arm inger, C . C. Clogg, and M .
E. Sobel
Handbook of statistical modeling for the social and behavioral sciences
(pp .
1 85-249) . New York: PlenUlTI .
COHEN, J., and COHEN P. ,
for the behavioral sciences
(1985). Applied multiple regression and correlation
analysis
(2nd ed . ) . Hillsdale, NJ : Erlbau m .
DAVIS, J. A., and SMITH, T W. ( 1 997) .
General social surveys, 1 9 72-1 996.
Chicago,
IL: National Opinion Research Center (producer); Ann Arbor, M I : Interuniversity
for Pol i t ic a l a n d Social Re search ( distribu tor) . DEMPSTER, A. P., LAIRD, N . M . , and RUBIN, D . B. ( 1 9 77) . Maximum likelihood estimation from in com pl e t e data vi a the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Conso r t i u m
FUCHS, C. ( 1982) . Maximum like1 ihood estimation and mod el selection in contingency t ables with missing d ata.
Journal of the American Statistical Association,
77,
270-278 .
GLASSER, M. ( 1 964) . Linear regression analysis with missing observati ons amon g the
i able s Journal of the American Statistical Association, 59, 834-844 . GLYNN, R. ( 1 985). Regression estimates when nonresponse depends on the outcome vari able. Unpublished doctoral disserta tion, Sc hool of Public Health, Harvard Un iv e rsi ty . GOURIEROUX, C., and MONFORT, A. ( 1 98 1 ) . On the problem of missing data in l inea r models . Review of Economic Studies, 48, 5 79-5 86. GREENE, W. H. ( 2000 ) . Econometric analysis (4th ed. ) . En glewood Cliffs, NJ: Prentice independent
v ar
.
Hall .
HAITOVSKY, Y. ( 1 968). Missing data
tistical Society, Series B, 30,
HECKMAN, J.
67-82.
in
regression analysis.
Journal of the Royal Sta
J. ( 1 9 76) . The common structure of stat i sti c al models of truncated,
sample selection and limited dependent variables, and a simple estimator of
models. Annals of Economic and Social Measurement,
5,
475-492.
such
90
IVERSEN,
G. (1985) . Bayesian statistica l inference . Sage U n ive r s ity Pa p er s Series on
Science, 07- 4 3 . Thousand Oaks, CA: S ag e . JONES, M. P. ( 1 996) . I n d ica to r and st rat ification me th ods for missing expla n at ory vari abl es in m u l t i p l e l i n e a r r egre s s i o n . Journal of the American Statistical Association, 91 , 222-230. KIM, J . -O . , and CURRY, J . ( 1 977). The t r e a t m e n t of m is s i n g data in m u ltiv a r i a t e anal ysis. Sociological Methods & Research , 6, 215-240. KING, G., HONAKER, J., JOSEPH, A., SCHEVE, K. , and SINGH, N. ( 1 999) . AMELIA: A program for missing data . Unpublished program manual . Available at http://gking.harvard .edu/stats.sh tm l . KING, G . , HONAKER, J . , JOSEPH, A . , and SCHEVE, K. (200 1 ) . Analyzing i n c o m plete political sci e n ce data : An al t e r nat iv e algorithm for m u lt i p le imputation. A merican Politica l Science Review, 95, 49-69. Available at http ://gking.harvard .e du/ Qu an t i t a tiv e Applications in So c ial
stats .shtml.
lANDER MAN, L . R., LAND,
K. C., and PI E P E R C. F. ( 1 997). ,
An
empirical evalua mi s s i n g v a lu e s . Soc iologi
p r e d i ct ive mean matching method fo r i mpu t i ng cal Methods & Research , 26, 3-33 . LI, K. H., MENG, X. L., RAG HUNATHAN , T. E . , and RUBIN, D. B. ( 1 991). S i g n if icance levels from r e pe a t ed p-values and multiply i m pu te d data. Sta tistica Sinica , 1 , 65-92. LITTLE, R. J . A. ( 1 988) . Missing data i n large s urv eys (with discussion ) . Jo u rnal of Business and Economic Sta tis tics 6, 287-30 1 . LITTLE, R . 1. A. ( 1992). Re gre s si o n wit h missing X' s : A review . Journal of the Amen'can Statistical Association , 87, 1227-1237 . LI1"I'LE , R . J . A. ( 1 993). Pattern-mixture mod els for multivariate i n co mp l e te d a t a . Journal of the American Statistical Association, 88, 1 25-1 34. LITTLE, R. J . A. ( 1 994) . A class of pattern-mixture models for normal i n comp l e te data. Biometrika , 81 , 471--483 . LITTLE , R . J. A., and RUBIN, D . B . ( 1 987) . Sta tistica l analysis with missing data . New York: Wi l ey . LITTLE, R. J. A. , and SMITH, P. J. ( 1 9 87) Edit i n g and imputation for q u an t i t a tive survey d a t a . Journal of the American Statistical A ssociation, 82, 5 8-68 . MARINI , M. M . , OLSEN, A. R., a nd RUBIN, D . ( 1 979). Maximum like lihood esti m a t io n i n p a n el s t ud i e s w it h m i s si n g data. In K. F. S c hu e s s le r (Ed. ) , Sociological lnethodology 1980 ( p p . 3 1 4-357). S an Francisco : Jossey-Bass. McCULLAGH, P (1980). Re g r e ss i on models for ordinal d a t a (with discussion). Journal of the Royal Statistical Society, Series B, 42, 1 09-1 4 2. M cLAC H LAN G . J . , and KRISHNAN, T. ( 1 997) . The EM algorithm and extensions . New York: Wi ley . MOSSEY , J . M . , KNOTT, K., and CRAI� R . ( 1 990) . Effects of persistent d epressive sy m p to ms on h i p fracture recovery. Journal of Gerontology: Medical Sciences , 45, M163-168 . MUTHEN, B . , KAPLAN, K., and HO LLI S M . ( 1 98 7 ) . On structural equation m o d e lin g with data that are not missing completely at random . Psychometrika , 42, 43 1-462 . RAGHU NATHAN T E., LEPKO WSKI, J . M., VAN HOEWYK, J . , a nd SOLEN BE R G E R P. ( 1 999). A multivariate technique for multiply imputing missing val ues using a sequence of regression models. U npublished m a n u s c rip t . Contact ter aghu @umich.edu. tion of the
,
.
,
"
,
,
,
" , I I'
91
II
"
I
i
II ' i
:
,
,
;
,
ROBINS, J . M . , and WANG , N. (2000). Inference for imputation estimators . Biometrikn , 87, 1 13-124. RUBIN, D . B . ( 1976 ) . Inference and missing data. Biometrika ,
RUBIN, D . B . ( 1 98 7
)
.
63, 58 1-592.
Multiple Impu tation for Nonresponse in Surveys . New York:
RUBIN, D . B., and SCHENKER,
Wiley.
, ,
N. ( 1 99 1 ) . Multiple imputation in he alth-care
databases: An overview and some applications . Statistics in Medicine, 1 0, 5 85-598.
,
r .:
I
SCHAFER, J . L . ( 1 997) . Analysis of Incomplete Multivariate Data . London: Chapman & HalL
SCHAFER,
J. L., and SCHENKER, N. (2000). Inference with imp uted conditional
means. Journal of the America n Sta tistica l Association, 95, 1 44- 154.
M. G. ( 1 996). Partially parametric techniques for multiple imputation. Computational Statistics and Data Analysis , 22, 425-446. STOLZENBERG, R. M . , and RELLES, D. A. ( 1 990) . Theory te sting in a world of SCHENKER, N . , and TAYLOR, J.
constrained research design : The significance of Heckman's censored sampling bias correction for n onexperimental rese arch. Sociologica l Methods & Research, 415.
18, 395-
STOLZENBERG, R . M. , and RELLES, D. A. ( 1 997). Tools for intuition about sample selection bias and its correction. American Sociological Review, VACH,
W.
62, 494-507.
'I I
I
I
I !' , ,
,, I
I'
I
( 1 994) . Logistic Regression with Missing Values in the Cova ria tes . New York:
,
Springer-Verlag.
Biased estimation of the odd s ratio in case control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. A merican Journal of Epidemiology , 1 34, 895-907.
VACH,
W.,
and BLE'rrNER, M. ( 1 99 1 ) .
,I
,I
VAN BUUREN, S . , and OUDSI-IOO RN, K. ( 1 999) . Flexible multiple imputation by MICE. Report TN O-PG 99.054,
TNO Prevention and Health, Leiden. Available at
http ://www . multiple -imputation . com/
VAN PRAAG, B. M . S . , DIJKSTI{A, T. K., and VAN VELZEN, J . ( 1 9 85). Le ast
squares the ory based on gene ral distributional assumptions with an application to
the incomplete observations problem. P5ychom etrika , 50, 25-36. WANG,
N.,
and R O B I N S,
imputation . Biometrika , WINSHIp, C., and
J. M.
( 1 998) . Large sample inference in param etric mUltiple
85, 935-948 .
RADBILL,
L. ( 1 994) . Sampling we ights and regression analysi s .
Sociological Methods & Research, 23, 230-257.
:I
,I
I