ADVANCES IN ECONOMETRICS Series Editors: Tom Fomby and R. Carter Hill Recent Volumes: Volume 15: Volume 16: Volume 17:
Volume 18: Volume 19:
Nonstationary Panels, Panel Cointegration, and Dynamic Panels, Edited by Badi Baltagi Econometric Models in Marketing, Edited by P. H. Franses and A. L. Montgomery Maximum Likelihood of Misspecified Models: Twenty Years Later, Edited by Thomas B. Fomby and R. Carter Hill Spatial and Spatiotemporal Econometrics, Edited by J. P. LeSage and R. Kelley Pace Applications of Artificial Intelligence in Finance and Economics, Edited by J. M. Binner, G. Kendall and S. H. Chen
Volume 20A: Econometric analysis of Financial and Economic Time Series, Edited by Dek Terrell and Thomas B. Fomby Volume 20B: Econometric analysis of Financial and Economic Time Series, Edited by Thomas B. Fomby and Dek Terrell
ADVANCES IN ECONOMETRICS
VOLUME 21
MODELLING AND EVALUATING TREATMENT EFFECTS IN ECONOMETRICS EDITED BY
DANIEL L. MILLIMET Southern Methodist University, Dallas, TX, USA
JEFFREY A. SMITH University of Michigan, Ann Arbor, MI, USA
EDWARD J. VYTLACIL Yale University, New Haven, CT, USA
Amsterdam – Boston – Heidelberg – London – New York – Oxford Paris – San Diego – San Francisco – Singapore – Sydney – Tokyo JAI Press is an imprint of Elsevier
JAI Press is an imprint of Elsevier Linacre House, Jordan Hill, Oxford OX2 8DP, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA First edition 2008 Copyright r 2008 Elsevier Ltd. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email:
[email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://www.elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-7623-1380-8 ISSN: 0731-9053 (Series) For information on all JAI Press publications visit our website at books.elsevier.com Printed and bound in the United Kingdom 08 09 10 11 12 10 9 8 7 6 5 4 3 2 1
LIST OF CONTRIBUTORS Jaap H. Abbring
Department of Economics, VU University and Tinbergen Institute, Amsterdam, The Netherlands
Marco Caliendo
Institute for the Study of Labor (IZA), Bonn, Germany
Xavier de Luna
Department of Statistics, Umea˚ University, Umea˚, Sweden
Rajeev Dehejia
Department of Economics, Tufts University, Medford, MA, USA; Royal Holloway, University of London, London; National Bureau of Economic Research, Cambridge, MA, USA
Arthur S. Goldberger University of Wisconsin, Madison, WI, USA Daniel J. Henderson
Department of Economics, State University of New York, Binghamton, NY, USA
Reinhard Hujer
Johann Wolfgang Goethe University, Frankfurt/Main, Germany
Per Johansson
Department of Economics, Uppsala University, Uppsala, Sweden
Michael Lechner
University of St. Gallen, Swiss Institute for International Economics and Applied Economic Research (SIAW), St. Gallen, Switzerland
Mingliang Li
Department of Economics, State University of New York, Buffalo, NY, USA
Terra McKinnish
Department of Economics, University of Colorado, Boulder, CO, USA vii
viii
LIST OF CONTRIBUTORS
Fabrizia Mealli
Department of Statistics, University of Florence, Italy
Daniel L. Millimet
Department of Economics, Southern Methodist University, Dallas, TX, USA
Jakob Roland Munch
Department of Economics, University of Copenhagen, Denmark
Christopher F. Parmeter
Department of Agricultural and Applied Economics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
Donald B. Rubin
Department of Statistics, Harvard University, Cambridge, MA, USA
Marianne Simonsen
Department of Economics, University of Aarhus, Denmark
Lars Skipper
AKF, Copenhagen, Denmark
Stephan L. Thomsen
Zentrum fu¨r Europa¨ische Wirtschaftsforschung, Mannheim, Germany
Justin L. Tobias
Department of Economics, Iowa State University, Ames, IA, USA
Le Wang
Minnesota Population Center, University of Minnesota, Minneapolis, MN, USA
Jeffrey M. Wooldridge
Department of Economics, Michigan State University, East Lansing, MI, USA
Junni L. Zhang
Guanghua School of Management, Peking University, P. R. China
INTRODUCTION The estimation of the effects of treatments – endogenous variables representing everything from child participation in a pre-kindergarten program to adult participation in a job-training program to national participation in a free trade agreement – has occupied much of the theoretical and applied econometric research literatures in recent years. This volume brings together a diverse collection of papers on this important topic by leaders in the field from around the world. This collection draws attention to several key facets of the recent evolution in this literature. First, the recent literature rejects the (typically quite implausible) notion of a single treatment effect that applies to all subjects potentially affected by a treatment in favor of explicit consideration of heterogeneous treatment effects. Making this leap raises many new issues, both conceptual and econometric. On the conceptual side, instead of a single parameter of interest there are now many parameters, including averages over subgroups defined by exogenous covariates as well as averages defined by behavioral responses to instrumental variables. Describing and motivating the average treatment effect estimated by a particular empirical exercise has now become an important part of any empirical study. Second, while concern over the underlying assumptions required for identification of the parameter of interest is not new, the recent literature places greater emphasis on the use of economic theory and context-specific institutional knowledge to guide the choice of an econometric model. Third, reductions in the cost of computation combined with larger data sets have helped to spur developments in the econometric theory of semiparametric and nonparametric estimation. Application of these methods has allowed researchers to relax the assumptions implicit in the linear, parametric models that have heretofore dominated empirical work. Cheap computing power has also allowed the exploration of computationally intensive Bayesian methods within the treatment effects literature and more broadly within applied economics and statistics. Fourth, and finally, recent years have witnessed a somewhat belated recognition that related literatures in several disciplines (including economics, statistics, political science, sociology, epidemiology, and biostatistics), which address similar issues of causal inference using ix
x
INTRODUCTION
observational data, might actually benefit from interacting. Though many of the potential gains from trade remain unrealized, the treatment effects literature has become more interdisciplinary, allowing both theorists and applied researchers to draw on a larger body of knowledge and to reach a wider audience with their work. To begin, we have taken advantage of the opportunity presented by this volume to publish one of the classic unpublished papers in the treatment effects literature. Authored by Arthur S. Goldberger, and originally released in 1972 as a University of Wisconsin working paper, ‘‘Selection Bias in Evaluating Treatment Effects: Some Formal Illustrations’’ remains widely cited in the literature as a working paper. The paper has enjoyed renewed attention of late due to recent interest among economists in regression discontinuity designs. The paper makes the point within the context of a ‘‘sharp’’ discontinuity design that balance in the covariates between the treated and untreated units does not represent a requirement for unbiased estimates. We left the paper as it was and asked Professor Goldberger to provide a short postscript. In the postscript, he gives his motivation for writing the paper along with some thoughts about how, 35 years on, he sees it fitting into the literature. Several of the papers in the volume analyze and extend econometric methods for the evaluation of treatment effects that have garnered relatively little attention in the literature to date. In one such paper, Jaap H. Abbring presents ‘‘The Event-History Approach to Program Evaluation.’’ Specifically, the author details a mixed semi-Markov event-history model, discusses its application to program evaluation, and analyzes its empirical content. The results offer important insights into what one can learn from longitudinal data about, for example, the effects of training programs for the unemployed on their unemployment durations and subsequent job stability. Abbring offers the researcher guidance on model specification, as well as methods for the empirical analysis of such effects. The paper by Mingliang Li and Justin L. Tobias, ‘‘Bayesian Analysis of Treatment Effects in an Ordered Potential Outcomes Model,’’ considers a binary treatment effect when the outcome of interest is an ordered, discrete variable within a Bayesian framework. Their model is applicable to situations such as the effect of an educational intervention on completed years of schooling. Efficient calculation of features of the posterior distribution represents a major computational challenge on this and other Bayesian models. The authors address this challenge by applying a particular simulation method to a reparameterized version of the model. Another interesting issue in Bayesian analysis of treatment effects concerns
Introduction
xi
the handling of the unidentified cross-regime correlation coefficient (i.e., the correlation between potential outcomes in the treated and nontreated states). Li and Tobias utilize a method that does not restrict the cross-regime correlation, but instead uses identified model parameters to bind the cross-regime correlation. Junni L. Zhang, Donald B. Rubin, and Fabrizia Mealli also adopt a Bayesian approach in their paper entitled ‘‘Evaluating the Effects of Job-Training Programs on Wages through Principal Stratification.’’ Specifically, the authors address the long-standing topic of estimating effects on wages for programs that may affect both employment and wages. Even data generated from an experiment that randomly assigns the treatment only provide an unbiased estimate of the program impact on earnings, not wages. The problem arises because nonworkers have an observed value for earnings, namely zero, but no observed value for the wage. In the presence of nonrandom selection into employment that differs between the treatment and control groups, as one would expect it to in most contexts, a comparison of the mean wages of the employed treated units and the employed control units does not estimate a meaningful parameter. The authors illuminate the problem within the framework of principal stratification. Within this framework, there are four groups, or strata, that correspond to the possible combinations of employed if treated and employed if not treated. The authors argue that it makes sense to estimate wage effects only for those employed in both the treated and untreated states. Their analysis begins with nonparametric bounds on the estimated wage effects for the always employed stratum both with and without various additional assumptions such as monotonicity. They then lay out a parametric model of stratification and outcomes conditional on the stratification within a Bayesian framework. They also provide a thorough discussion of the computational issues associated with their parametric model, which they illustrate with an application to simulated data. The majority of both applied and theoretical research on program evaluation focuses on the average effect of the program for a particular population or subpopulation. Policy conclusions stemming from the effect of a program on average outcomes implicitly assume a particular social welfare function. Given this limitation, examinations of the impact of a program on the distribution of the outcome of interest are becoming increasingly popular. In the paper, ‘‘When is ATE Enough? Risk Aversion and Inequality Aversion in Evaluating Training Programs,’’ Rajeev Dehejia formalizes a decision framework for program evaluation using posterior predictive distributions for the outcome of interest. Relying on stochastic
xii
INTRODUCTION
dominance tests of predictive distributions, programs are rankable using a more explicit welfare foundation. Furthermore, under more restrictive parametric and functional form assumptions, the paper develops intuitive mean-variance tests for program evaluation that are consistent with the underlying decision problem. These concepts are applied to the GAIN and JTPA data sets. Michael Lechner’s paper entitled ‘‘Matching Estimation of Dynamic Treatment Models: Some Practical Issues,’’ along with his related work, addresses an important, but heretofore largely neglected, dimension in the increasingly sophisticated literature on the evaluation of active labor market programs. This dimension concerns the additional evaluation questions, and related econometric issues, that arise when taking explicit consideration of the fact that many programs provide a conditional sequence of treatments. For example, all participants might receive some form of job search assistance prior to receiving any other, more intensive services, as under the current U.S. Workforce Investment Act programs. Participants who find a job do not receive any additional services, while those who do not find a job go on to some other service such as classroom training or subsidized on-thejob training at private firms. In such a program, in which later service assignment (if any) depends on intermediate outcomes following initial service assignment, the number of policy-relevant treatment effect parameters increases substantially, and the assumptions required for identification within the context of a selection-on-observables framework require careful delineation. Lechner’s contribution in this volume outlines the basic substantive and econometric issues and then discusses some practical issues relevant to empirical applications. Recent years have witnessed major conceptual developments in the literature on instrumental variables estimation of treatment effects. These developments reflect increased attention to the heterogeneous nature of treatment effects, which may reflect both unmeasured heterogeneity in treatment or heterogeneous responses to identical treatments. As is now well known, standard economic models suggest that many widely used instruments in economics, such as instruments based on the costs of receiving treatment, are likely to be correlated with the treatment effect conditional on treatment status, even if they are valid instruments in a common treatment effect world due to being uncorrelated with the unobserved component of outcomes in the untreated state. For example, in the context of a training program, we would expect that those who travel farther to the training center have, on average, higher impacts that compensate them for the higher costs they incur. At the same
Introduction
xiii
time, recent years have also witnessed important developments in the econometrics of semiparametric and nonparametric instrumental variable estimation. The next group of papers in the volume builds on these themes of instrumental variable estimation in the presence of heterogeneous treatment effects and its extension to semiparametric and nonparametric methods. However, before turning to instrumental variables methods, applied researchers using nonexperimental data first must confront the question of whether a treatment is endogenously determined. The usual statistical tests for endogeneity are limited to the extent that they require a valid instrumental variable and they offer no information with respect to the sign of the endogeneity bias. The paper by Xavier de Luna and Per Johansson, ‘‘Graphical Diagnostics of Endogeneity,’’ shows that the endogeneity of a variable may be successfully detected by graphically examining the cumulative sum of the recursive residuals from appropriately sorted crosssectional data. Moreover, the sign of the bias implied by the endogeneity may be deducible as well through such graphical techniques. Finally, whereas a valid instrumental variable is needed to implement the graphical test in general, a graphical test for misspecification due to endogeneity (e.g., self-selection) is derived without the requirement of a valid instrumental variables when a continuous or ordered (e.g., years of schooling) regressor is suspected to be endogenous. Once confronted with endogeneity of a treatment, the researcher has several options for how to proceed. Jeffrey M. Wooldridge’s paper, ‘‘Instrumental Variables Estimation of the Average Treatment Effect in the Correlated Random Coefficient Model,’’ focuses on estimation of a linear, random coefficient model when the random coefficients are correlated with the regressors. Such models allow for flexible treatment effect heterogeneity and for agents to select treatment based in part on their own idiosyncratic effect. Previous methods have used instrumental variable methods or a control function approach to estimate the models, and often impose conditions appropriate for the case of continuous regressors. Wooldridge considers the estimation of such models by focusing on the case when the treatment variable has some discrete aspect (e.g., the treatment variable is determined by a Tobit model). Instead of following either an instrumental variable approach or a control function approach, Wooldridge proposes a new, hybrid approach. In this approach, one augments the outcome equation by a ‘‘correction function,’’ and then applies an instrumental variables estimator to the augmented outcome equation. The approach requires different assumptions than instrumental
xiv
INTRODUCTION
variables or control function approaches, and works in some cases in which alternative estimators would not be consistent. The paper entitled ‘‘Fertility and the Health of Children: A Nonparametric Investigation’’ by Daniel J. Henderson, Daniel L. Millimet, Christopher F. Parmeter, and Le Wang also builds on recent work in the area of instrumental variables estimation of treatment effects by laying out a nonparametric instrumental variables estimator that allows heterogeneous treatment effects. The authors compare recently developed nonparametric instrumental variables treatment effect estimators to the traditional Wald estimator as well as two-stage least squares in the context of an analysis of the causal effect of the number of children in a family on child health using data from Indonesia. They find that in their context the flexibility offered by the nonparametric approaches makes a difference to their conclusions both substantively and statistically. Alternative, commonly used approaches to account for unobservables that may be correlated with both the treatment and outcome when the researcher has access to longitudinal data include fixed effects estimators (such as the within estimator and the dummy variable estimator) and the first differences estimator. The paper by Terra McKinnish, ‘‘Panel Data Models and Transitory Fluctuations in the Explanatory Variable,’’ demonstrates that both fixed effects and first differences estimators may understate the effect of interest because of the variation used to identify the model. Specifically, because within-unit time-series variation often reflects transitory fluctuations that have little effect on behavioral outcomes, the data in effect suffer from measurement error. Using two empirical examples – assessing the relationship between AFDC and fertility and the relationship between local economic conditions and AFDC expenditures – McKinnish demonstrates the attenuation bias that may arise using such methods, and shows how comparing estimates from fixed effects, first differences, and long differences estimators can serve as a specification test. The papers in the final group included in this volume apply the methods developed in the papers already described, or elsewhere in the recent literature, to interesting and timely questions. The paper by Marianne Simonsen and Lars Skipper entitled ‘‘An Empirical Assessment of Effects of Parenthood on Wages,’’ addresses a long-standing issue in labor economics: what is the wage penalty associated with becoming a parent, and how does the effect of parenthood influence the gender wage gap? The authors separately characterize selection into parenthood for men and women and estimate the effects of motherhood and fatherhood on wages using propensity score matching. The authors exploit an extensive, high-quality
Introduction
xv
register-based data set augmented with family background information, and are extremely detailed in their consideration of the institutional context and the proper implementation of a kernel-based matching estimator over the region of common support. The authors also apply a recently developed method to gauge the sensitivity of their estimates to any selection on unobservables. Simonsen and Skipper find that mothers receive 7.4% lower average wages compared to nonmothers, whereas fathers gain 6.0% in terms of average wages from fatherhood. Jakob Roland Munch and Lars Skipper’s paper, ‘‘Program Participation, Labor Force Dynamics, and Accepted Wage Rates,’’ builds on a growing literature assessing the economic impacts of active labor market programs. Munch and Skipper assess the effects of job-training programs on the duration of employment and unemployment spells of participants as well as on their wage histories. Using knowledge of the institutional context, the authors exploit randomness in the timing of the initiation of the training spells. Munch and Skipper find that participation in most of these training programs produces an initial locking-in effect, and even a lower transition rate from unemployment to employment upon completion for some participants. Most programs, therefore, increase the expected duration of unemployment spells. However, the authors find that the training undertaken while unemployed successfully increases the expected duration of subsequent spells of employment for many individuals. These longer spells of employment have a price, though, as they come with a lower accepted hourly wage rate. Lastly, in the ‘‘The Employment Effects of Job Creation Schemes in Germany: A Microeconometric Evaluation,’’ Marco Caliendo, Reinhard Hujer and Stephan L. Thomsen apply matching methods to the rich administrative data available in Germany to estimate the impact of job creation schemes, an important component of overall active labor market policy in that country. The paper is notable for the thoughtful motivation of the conditional independence assumption and for the care taken with the empirical work. Unlike many matching papers, the authors do not simply invoke selection on observables without justifying it, nor do they justify it simply by counting the number of available conditioning variables in their data. Instead, they make the case that their data contain the variables required to make the conditional independence assumption that underlies matching plausible, given what we know about the economics of selection into active labor market programs and given their institutional context. In applying matching, the authors conduct numerous sensitivity analyses and carefully examine the common support issue. They also examine the
xvi
INTRODUCTION
sensitivity of their results to any remaining selection on unobservables using methods from the recent literature. In short, they provide a model of how to use matching methods thoughtfully and carefully in applied econometric research. Their paper also has important policy implications, as they find that job creation schemes do little to enhance the employment chances of those who participate in them. In conclusion, we would like to express our sincere appreciation to the series editors – Tom Fomby at Southern Methodist University and R. Carter Hill at Louisiana State University – for giving us the opportunity to put together this volume. In addition, we thank all of the authors for their contributions to the volume. Without their expertise, hard work, and patience with us, this volume would not be what it is. Not only did we enjoy the experience of working with the authors, but we have also learned much through our interactions. Daniel L. Millimet Jeffrey A. Smith Edward J. Vytlacil Editors
SELECTION BIAS IN EVALUATING TREATMENT EFFECTS: SOME FORMAL ILLUSTRATIONS Arthur S. Goldberger ABSTRACT Regression analyses of compensatory educational programs have been criticized on the grounds that the pupils were not randomly selected. Specifically, it has been argued that a spurious deleterious effect of the treatment will be observed when the selection procedure systematically puts lower-ability students into the treatment group and higher-ability students into the control group. We evaluate this argument via a simple test score model: pretest score and posttest score are fallible measures of underlying true ability and the true treatment effect is zero. Posttest is regressed on pretest and a treatment dummy. The spurious effect arises when selection of subjects for treatment is explicit on the basis of true ability, but not when it is explicit on the basis of pretest score.
1. INTRODUCTION When subjects are not assigned randomly to treatment and control groups, the possibility arises that a spurious treatment effect may be observed. This Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 1–31 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00001-1
1
2
ARTHUR S. GOLDBERGER
possibility has been emphasized in a recent critique of the evaluation of compensatory educational programs. Campbell and Erlebacher (1970) assert: The compensatory program is made available to the most needy, and the ‘control’ group then sought from among the untreated children in the same community. Often this untreated population is on the average more able than the ‘experimental’ group. In such a situation, the usual procedures of selection, adjustment, and analysis produce systematic biases in the direction of making the compensatory program look deleterious.
But this critique may be misleading. The mere fact that the control group is more able than the treatment group does not suffice to produce bias in the evaluation of the treatment effect. We propose to demonstrate this point in terms of a highly idealized setting, that is, in terms of a formal model.
2. THE BASIC MODEL We suppose that true ability x is normally distributed with expectation zero and variance Q: x Nð0; QÞ Further, we suppose that pretest score x and posttest score y are erroneous measures of true ability, more precisely that x ¼ x þ u;
y ¼ x þ v
where u and v are normally distributed with expectation zero and common variance. The common variance of the measurement errors can be written as (1P)/P times the variance of x for some 0oPo1; thus, ð1 PÞQ ð1 PÞQ u N 0; ; v N 0; P P (The motive for parameterizing in terms of Q and P will soon become clear). Further, we suppose that x, u, and v are independent. Consequently, the test scores have expectations EðxÞ ¼ Eðx þ uÞ ¼ Eðx Þ þ EðuÞ ¼ 0 þ 0 ¼ 0 ¼ EðyÞ variances V ðxÞ ¼ V ðx þ uÞ ¼ V ðx Þ þ V ðuÞ þ 2Cðx ; uÞ ¼ Q þ
ð1 PÞQ Q ¼ ¼ V ðyÞ P P
and covariance Cðx; yÞ ¼ Cðx þ u; x þ vÞ ¼ V ðx Þ ¼ Q
Selection Bias in Evaluating Treatment Effects
3
and are joint-normally distributed. In a joint-normal distribution any regression function is linear and the slope(s) are readily calculated from the variances and covariances. Specifically, the regression of y on x is Cðx; yÞ Q EðyjxÞ ¼ x¼ x ¼ Px V ðxÞ Q=P
(1)
Note that P¼
V ðx Þ V ðx Þ V ðx Þ ¼ ¼ V ðxÞ V ðx Þ þ V ðuÞ V ðyÞ
This is the variance ratio that plays a key role in the subsequent analysis. As a matter of fact, x, u, v, x, y are joint-normally distributed with zero expectations and variance–covariance matrix
x u v x y
x
u
v
x
y
Q
0 (1P)Q/P
0 0 (1P)Q/P
Q (1P)Q/P 0 Q/P
Q 0 (1P)Q/P Q Q/P
Thus, for example, the regression of posttest on true ability is Cðx ; yÞ Q Eðyjx Þ ¼ x ¼ 1x x ¼ V ðx Þ Q Comparing this with Eq. (1) we see the familiar result that measurement error attenuates slopes – here Po1. We now suppose that the population is split into two groups – one that receives the treatment and the other that does not. Let z be a binary variable that indicates whether or not an individual receives the treatment: ( 1 if received treatment ði:e:; selected for experimental groupÞ z¼ 0 if did not receive treatment ði:e:; selected for control groupÞ Further, we suppose that the true effect of the treatment is nil. In terms of the model, this amounts to saying that Cðz; vÞ ¼ 0
(2)
4
ARTHUR S. GOLDBERGER
or, in other words, that a multiple regression of y on x and z would yield a zero coefficient on z. (This particular choice of a baseline is for convenience only and involves no essential loss of generality). In practice, the multiple linear regression of y on x and z Eðyjx; zÞ ¼ a0 þ a1 x þ a2 z
(3)
will be run to assess the effect of the treatment. Since x is an erroneous measure of x, this procedure may be biased – a2 may be nonzero. Clearly, what is relevant is the selection procedure – the basis on which individuals were assigned to the treatment and control groups. If the assignment had been random with respect to true ability, so that Cðz; x Þ ¼ 0 and random with respect to the error component of pretest, so that Cðz; uÞ ¼ 0 then no bias would result. For in this situation, Cðz; yÞ ¼ Cðz; x þ vÞ ¼ Cðz; x Þ þ Cðz; vÞ ¼ 0 Cðz; xÞ ¼ Cðz; x þ uÞ ¼ Cðz; x Þ þ Cðz; uÞ ¼ 0 Thus, the normal equations determining the regression slopes, namely V ðxÞa1 þ Cðx; zÞa2 ¼ Cðx; yÞ Cðx; zÞa1 þ V ðzÞa2 ¼ Cðz; yÞ
(4)
would specialize to Q a1 þ 0a2 ¼ Q P 0a1 þ V ðzÞa2 ¼ 0 the solution to which is a1 ¼ P; a2 ¼ 0. As Campbell and Erlebacher indicate, such randomization is unlikely to occur in nonlaboratory situations. Our main objective in this paper is to evaluate the bias – the discrepancy of a2 from zero – in two idealized cases. The two cases are: Case (i). Selection on basis of true ability. All individuals whose true ability is below the average are assigned to the experimental group; all those whose true ability is above the average are assigned to the control group.
Selection Bias in Evaluating Treatment Effects
5
In terms of our model z¼
1 if x o0 0 if x 40
Case (ii). Selection on basis of pretest score. All individuals whose pretest score is below the average are assigned to the experimental group; all those whose pretest score is above the average are assigned to the control group. In terms of our model, 1 if xo0 z¼ 0 if x40 Case (i) is a variation on the model used by Campbell and Erlebacher in their critique of Head Start evaluations. Our variation lies in splitting a single normal population rather than using two distinct normal distributions. Case (ii) seems very similar, but has strikingly different implications, as shown by Barnow (1972). Neither case corresponds literally to reality. For example, when true ability is unobserved, it cannot really provide the basis for selection. But these two polar cases should suffice to clarify the issues. For future reference, note that either selection procedure splits the population into two equal-sized groups, so that in both cases the marginal distribution of z is given by 1 p0 ¼ Probfz ¼ 0g ¼ ; 2
p1 ¼ Probfz ¼ 1g ¼
1 2
whence 1 EðzÞ ¼ p0 0 þ p1 1 ¼ 2 1 2 1 2 1 V ðzÞ ¼ p0 0 þ p1 1 ¼ 2 2 4
(5)
We shall see that the two cases differ with respect to the covariance of z with the other variables.
3. TECHNICAL DIGRESSION In general, the covariance of any variable z with another variable w can be computed by taking the covariance of z with the conditional expectations of
6
ARTHUR S. GOLDBERGER
w given z, that is, Cðz; wÞ ¼ Cðz; EðwjzÞÞ ¼ Eððz EðzÞÞEðwjzÞÞ If z is a binary variable taking on the values 0 and 1 with probabilities p0 and p1, respectively, we have Cðz; wÞ ¼ p0 ð0 EðzÞÞEðwj0Þ þ p1 ð1 EðzÞÞEðwj1Þ ¼ p0 p1 ½Eðwj0Þ Eðwj1Þ where Eðwj0Þ Eðwjz ¼ 0Þ and Eðwj1Þ Eðwjz ¼ 1Þ. In the present setting p0=p1=1/2, so that 1 Cðz; wÞ ¼ ½Eðwj0Þ Eðwj1Þ (6) 4 In words, the covariance of the treatment dummy with any variable is onefourth of the difference between the mean of the variable in the experimental group and the mean of the variable in the control group. To compute these group means and related measures, we draw on the following theorem: Let w Nðm; s2 Þ, that is, let the density function of w be 1 ðw mÞ2 f ðwÞ ¼ ð2ps2 Þ1=2 exp 2 s2 Then, given that aowob, the conditional density of w is 8 0 for w a > > < f ðwÞ for aowob pðwjaowobÞ ¼ F ðbÞ F ðaÞ > > : 0 for b w
(7)
the conditional expectation of w is EðwjaowobÞ ¼ m þ s2
f ðaÞ f ðbÞ F ðbÞ F ðaÞ
(8)
and the conditional variance of w is ( 2 ) ða mÞf ðaÞ ðb mÞf ðbÞ 2 2 f ðaÞ f ðbÞ s V ðwjaowobÞ ¼ s 1 þ F ðbÞ F ðaÞ F ðbÞ F ðaÞ (9)
Selection Bias in Evaluating Treatment Effects
7
Here F(t) R t denotes the cumulative normal distribution function, that is, F ðtÞ ¼ 1 f ðwÞdw. The theorem is proven in Appendix A; a numerical tabulation that illustrates the formulas is provided in Appendix B. For the purposes of this paper, we apply the theorem directly to the upper half of a normal distribution. Setting a=m and b=N, we find the conditional density function given that w is above m: ( 0 for w m pðwjmowÞ ¼ pðwjmowo1Þ ¼ 2f ðwÞ for mow using F ðmÞ ¼ 1=2; F ð1Þ ¼ 1; the conditional expectation given that w is above m: rffiffiffi 2 (10) EðwjmowÞ ¼ m þ s p using also f ðmÞ ¼ ð2ps2 Þ1=2 and f ð1Þ ¼ 0; and the conditional variance given that w is above m: V ðwjmowÞ ¼
s2 ðp 2Þ p
(11)
By symmetry, for the lower half of the normal distribution, we have: ( 2f ðwÞ for w m pðwjwomÞ ¼ 0 for mow rffiffiffi 2 EðwjwomÞ ¼ m s p s2 ðp 2Þ V ðwjwomÞ ¼ p Introducing the binary variable ( z¼
0
if mow
1
if wom
we write the conditional expectations and variances compactly as rffiffiffi 2 EðwjzÞ ¼ m þ ð1 2zÞs p
(12)
8
ARTHUR S. GOLDBERGER
V ðwjzÞ ¼
s2 ðp 2Þ p
(13)
In conjunction with Eq. (6), the conditional expectation formula implies rffiffiffi 1 2 s Cðz; wÞ ¼ 2 s ¼ pffiffiffiffiffiffi (14) 4 p 2p which means that the correlation between z and w is rffiffiffi pffiffiffiffiffiffi Cðz; wÞ s= 2p 2 rzw ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ p V ðzÞV ðwÞ ð1=4Þs2
(15)
As is to be expected, the value of this correlation is entirely independent of the values of m and s2. To summarize the results of our disgression: We have split a normal population into two groups, those above and those below the mean, found the within-group means and variances, and also found the between-group variance (expressed in terms of the covariance of the normal variable with a binary variable depicting the split). In what follows, x and x will alternately take on the role of w.
4. SELECTION ON BASIS OF TRUE ABILITY In case (i) the individuals are assigned to the control group or to the experimental group according as their true ability is above or below the mean true ability in the population. Thus, x plays the role that w did in Section 3. pffiffiffiffi Recalling that V(x)=Q, we set s in Eq. (14) equal to Q and find rffiffiffiffiffiffi Q Cðz; x Þ ¼ 2p Furthermore, since z is determined exactly (i.e., nonstochastically) by x, it must be independent of u and v, which, as will be recalled, are independent of x. Thus, rffiffiffiffiffiffi Q Cðz; yÞ ¼ Cðz; x þ vÞ ¼ Cðz; x Þ þ Cðz; vÞ ¼ Cðz; x Þ ¼ 2p rffiffiffiffiffiffi Q Cðz; xÞ ¼ Cðz; x þ uÞ ¼ Cðz; x Þ þ Cðz; uÞ ¼ Cðz; x Þ ¼ 2p This gives us the moments we need to compute the slopes in the multiple linear regression of y on x and z. Specifically, the normal equations in Eq. (4)
Selection Bias in Evaluating Treatment Effects
9
specialize to rffiffiffiffiffiffi Q Q a2 ¼ Q a1 P 2p rffiffiffiffiffiffi rffiffiffiffiffiffi Q 1 Q a1 þ a2 ¼ 2p 4 2p The solution to these normal equations is a1 ¼
Pðp 2Þ ; p 2P
a2 ¼
pffiffiffiffiffiffiffiffiffi ð1 PÞ 8pQ p 2P
The intercept can then be obtained as a0=E(y)a1E(x)a2E(z)=a2/2 using E(y)=E(x)=E(x)=0 and E(z)=1/2. Note that with 0oPo1, we have a1oP, a2o0, and 0oa0 for p=3.14y>2>2P. The solution value for a2 shows that the selection procedure makes the coefficient of z a biased measure of the true effect of the treatment (which is zero). As the variance ratio P falls 1 to 0, the value of a2 falls pffiffiffiffiffiffiffiffiffiffiffi ffi : from pffiffiffiffi monotonically from 0 to 8Q=p¼ 1:6 Q; that is, the magnitude of the bias (|a2|) rises. The regression spuriously attributes to the treatment deleterious effects on posttest when in fact it had no effect whatsoever. The source of this bias lies in the selection procedure that assigned low-ability individuals to the experimental group and high-ability individuals to the control group. Those differences in true ability were manifested in differences in posttest scores. The treatment variable z gets a negative coefficient because it is proxying (inversely) for true ability. This is the essence of the Campbell–Erlebacher argument, and we see that it holds up in our case (i). To round out the discussion of case (i), we consider what happens when linear regressions of posttest on pretest are run separately in each of the two groups. To obtain those regressions we require the within-group moments – the conditional expectations, variances, and covariance – of posttest and pretest. First applying Eqs. (12) – (13) with m=E(x)=0 and s2=V(x)=Q, we obtain rffiffiffiffiffiffiffi 2Q Eðx jzÞ ¼ ð1 2zÞ ¼ EðyjzÞ p
V ðx jzÞ ¼
ðp 2ÞQ p
(16)
(17)
10
ARTHUR S. GOLDBERGER
Next, recalling that u and v are independent of x and thus of z (which is an exact function of x), we deduce EðxjzÞ ¼ Eððx þ uÞjzÞ ¼ Eðx jzÞ þ EðujzÞ ¼ Eðx jzÞ þ EðuÞ rffiffiffiffiffiffiffi 2Q ¼ EðyjzÞ ¼ Eðx jzÞ ¼ ð1 2zÞ p V ðxjzÞ ¼ V ððx þ uÞjzÞ ¼ V ðx jzÞ þ V ðujzÞ ¼ V ðx jzÞ þ V ðuÞ ðp 2ÞQ ð1 PÞQ ðp 2PÞQ þ ¼ ¼ V ðyjzÞ ¼ p P pP Cðx; yjzÞ ¼ Cððx þ u; x þ vÞjzÞ ¼ V ðx jzÞ ¼
ðp 2ÞQ p
ð18Þ
ð19Þ
(20)
Thus, within either group, the slope of the linear regression of posttest on pretest will be Cðx; yjzÞ Pðp 2Þ ¼ V ðxjzÞ p 2P which is exactly a1, the coefficient of x in the overall multiple regression. The intercepts of course will differ; in the control group it is rffiffiffiffiffiffiffi 2Q Eðyj0Þ ¼ a1 Eðxj0Þ ¼ ð1 a1 Þ p and in the experimental group it is rffiffiffiffiffiffiffi 2Q Eðyj1Þ ¼ a1 Eðxj1Þ ¼ ð1 a1 Þ p The difference between these two intercepts coincides with the coefficient of z in the overall multiple regression: rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi! pffiffiffiffiffiffiffiffiffi 2Q 2Q pð1 PÞ 2Q 8pQ ð1 a1 Þ ¼ 2 ¼ ð1 PÞ p p p 2P p p 2P ¼ a2 Thus, the spurious effect of the treatment turns up again as a difference in the level of the two within-group linear regressions, as Fig. 1 indicates. In this figure, and in those that follow, we take P=.8. We have relied on linear regression for this analysis, despite the fact that the true regression of y on x is nonlinear in the present situation, that is,
Selection Bias in Evaluating Treatment Effects
Fig. 1.
11
Within-Group Linear Regressions of Posttest on Pretest (Selection Based on True Ability).
E(y|x,z) is nonlinear in x. This complication is examined in Appendix C and may be studied more closely in future work. But for the present, there is no reason to believe that it would change our qualitative conclusions. In any event, the results given above are still valid for the best linear approximation to the true conditional expectation function, which is presumably what is fitted in applied studies.
5. SELECTION ON BASIS OF PRETEST SCORE In case (ii) the individuals are assigned to the control group or to the experimental group according as their pretest score (not their true ability) is above
12
ARTHUR S. GOLDBERGER
or below the mean pretest score in the population. Thus, x (not x) plays the role that w didpinffiffiffiffiffiffiffiffiffi Section 3. Recalling that E(x)=0 and V(x)=Q/P, we set ffi m=0 and s ¼ Q=P in Eqs. (12) and (14) to find rffiffiffiffiffiffiffi 2Q EðxjzÞ ¼ ð1 2zÞ pP rffiffiffiffiffiffiffiffiffi Q Cðz; xÞ ¼ 2pP In the present case, z is determined exactly by x, hence it will depend on both x and u, but will remain independent of v. To obtain C(z,y), we proceed as follows. For the population at large, we know that EðujxÞ ¼ ð1 PÞx since C(u,x)=V(u)=(1P)Q/P and V(x)=Q/P together imply that in the regression of these joint-normal variables, the slope is C(u,x)/V(x)=(1P), while E(y)=0=E(x) implies that the intercept is zero. Since z is an exact function of x, it follows that EðujzÞ ¼ EðEðujxÞjzÞ ¼ Eððð1 PÞxÞjzÞ ¼ ð1 PÞEðxjzÞ Consequently, Eðx jzÞ ¼ Eððx uÞjzÞ ¼ EðxjzÞ EðujzÞ ¼ PEðxjzÞ rffiffiffiffiffiffiffiffiffiffi 2PQ ¼ ð1 2zÞ p
ð21Þ
and, since v is independent of z, EðyjzÞ ¼ Eððx þ vÞjzÞ ¼ Eðx jzÞ þ EðvjzÞ ¼ Eðx jzÞ þ EðvÞ rffiffiffiffiffiffiffiffiffiffi 2PQ ¼ Eðx jzÞ ¼ ð1 2zÞ p In conjunction with Eq. (6) this means that rffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffi 1 2PQ PQ Cðz; yÞ ¼ 2 ¼ 4 p 2p
ð22Þ
(23)
Incidentally, Eq. (21) shows that the selection on the basis of pretest scores has made the two groups different in mean true ability, but
Selection Bias in Evaluating Treatment Effects
13
a comparison p offfiffiffiffi Eq. (21) with Eq. (16) shows that this difference is less (by the factor P) than it was when the selection was strictly on the basis of true ability. In case (ii) the control group does not come entirely from the high-ability half of the population; it also includes low-ability individuals who happened to score unusually high on the pretest. We now have the moments we need to compute the slopes in the multiple linear regression of y on x and z. The normal equations in Eq. (4) specialize to rffiffiffiffiffiffiffiffiffi! Q Q a1 þ a2 ¼ Q P 2pP rffiffiffiffiffiffiffiffiffi! rffiffiffiffiffiffiffi Q 1 PQ a1 þ a2 ¼ 2pP 4 2p The solution to these normal equations is simply a1 ¼ P; a2 ¼ 0 and the intercept is a0 ¼ EðyÞ a1 EðxÞ a2 EðzÞ ¼ 0. The solution values show that the selection procedure does not make the coefficient of z a biased measure of the true effect of the treatment (which is zero). In striking contrast to case (i), the regression correctly attributes no effect to the treatment. Despite the fact that the selection procedure tended to assign low-ability individuals to the experimental group and high-ability individuals to the control group, no spurious effect arises. Despite the fact that z by itself is an (inverse) proxy for true ability, it fails to pick up, in the multiple regression, any credit for the effect of true ability on posttest. The explanation for this serendipitous result is not hard to locate. Recall that z is completely determined by pretest score x. It cannot contain any information about x that is not contained in x. Consequently, when we control on x as in the multiple regression, z has no explanatory power with respect to y. More formally, the partial correlation of y and z controlling on x vanishes although the simple correlation of y and z is nonzero. To round out the discussion of case (ii), we consider what happens when separate regressions of y on x are run for the experimental and control groups. We already have the within-group means; it remains to find the within-group variances and covariance. For pretest, we can apply Eq. (13)
14
ARTHUR S. GOLDBERGER
directly, setting s2 ¼ V ðxÞ ¼ Q=P to find V ðxjzÞ ¼
ðp 2ÞQ pP
(24)
For posttest, the route is more roundabout. We start with the decomposition of the marginal variance of x into its between- and within-group components: V ðx Þ ¼ V Eðx jzÞ þ E V ðx jzÞ z
z
Using Eq. (21) we compute the between-group component: V Eðx jzÞ ¼ p0 ðEðx j0Þ Eðx ÞÞ2 þ p1 ðEðx j1Þ Eðx ÞÞ2 ¼ z
2PQ p
Since V(x)=Q, it follows that the expected conditional variance is 2PQ ðp 2PÞQ ¼ E V ðx jzÞ ¼ Q z p p By symmetry, this means that V ðx jzÞ ¼
ðp 2PÞQ p
Then, since v is independent of z and x, we conclude that V ðyjzÞ ¼ V ððx þ vÞjzÞ ¼ V ðx jzÞ þ V ðvÞ ¼ V ðx jzÞ þ p 2P 1 P ðp 2P2 ÞQ ¼Q þ ¼ p P pP
ð1 PÞQ P
Further, the marginal covariance of x and x decomposes into Cðx ; xÞ ¼ C ðEðx jxÞ; EðxjzÞÞ þ E Cðx ; xjzÞ z
z
Using Eqs. (18) and (21) we compute the between-group component: C ðEðx jzÞ; EðxjzÞÞ ¼ p0 ðEðx j0Þ Eðx ÞÞðEðxj0Þ EðxÞÞ z
þ p1 ðEðx j1Þ Eðx ÞÞðEðxj1Þ EðxÞÞ rffiffiffiffiffiffiffiffiffiffirffiffiffiffiffiffiffi 1 2PQ 2Q 2Q ¼ 2 ¼ 2 p pP p
Selection Bias in Evaluating Treatment Effects
15
Since C(x,x)=Q, it follows by symmetry that 2Q 2 ðp 2ÞQ ¼Q 1 Cðx ; xjzÞ ¼ Q ¼ p p p
Since v is independent of x and z, we finally have Cðx; yjzÞ ¼ Cððx; x þ vÞjzÞ ¼ Cðx; x jzÞ ¼
ðp 2ÞQ p
(25)
Taking together Eqs. (24) and (25), we see that within either group, the slope of the linear regression of posttest on pretest is Cðx; yjzÞ ¼P V ðxjzÞ which coincides with the value for the coefficient of x in the overall multiple regression. The intercepts will also be the same, namely zero: rffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffi 2PQ 2Q EðyjzÞ PEðxjzÞ ¼ ð1 2zÞ Pð1 2zÞ ¼0 p pP As Fig. 2 indicates, the two within-group regressions coincide with the overall regression, confirming the absence of a treatment effect. The discussion in Lord and Novick (1968, pp. 141–147) provides a very simple derivation of the fact that no spurious treatment effect can arise in case (ii). Recall from Eq. (1) that for the population at large, the regression of posttest on pretest is linear. The same linear function holds over the entire range of x, and will be observed no matter what subrange of x we choose to observe, as long as we do not tamper with the conditional distribution of y given x. Nothing in the usual regression model requires that the distribution of the explanatory variable be representative of its distribution over the entire population. The only requirement is that the conditional distribution of the dependent variable given the explanatory variable be preserved. For the within-group regressions, in case (ii) we have simply selected a range of x, we have not tampered with the distribution of y given x. Therefore, within each group we must get the same regression as we get overall. This argument, incidentally, demonstrates that the true regression of y on x and z is linear (in contrast to case (i)). Note that this is true even within groups, where the distributions of y and x are clearly nonnormal. Lord and Novick (1968, pp. 143–144) call attention to the fact that correlation coefficients, unlike regression coefficients, are sensitive to selection
16
ARTHUR S. GOLDBERGER
Fig. 2.
Within-Group Linear Regressions of Posttest on Pretest (Selection Based on Pretest Score).
on the independent variable. In the present case, the overall correlation between x and y is Cðx; yÞ Q rxy ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ P V ðxÞV ðyÞ ðQ=PÞðQ=PÞ while their within-group (i.e., partial given z) correlation is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Cðx; yjzÞ p2 rxyz ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ P p 2P2 V ðxjzÞV ðyjzÞ Since 0oPo1, we see that rxyz orxy , as might be expected.
Selection Bias in Evaluating Treatment Effects
17
6. EFFICIENCY The basic results for cases (i) and (ii) may be brought together in Table 1, along with those for the random selection procedure discussed at the end of Section 3 and identified here as case (o). We are reminded that case (ii) – selection on basis of pretest – produces the same unbiased regression results as case (o) – purely random selection. But we should not conclude that the two procedures are equally desirable. Recall that our analysis has been couched in terms of population parameters, so that sampling variability is ignored. For finite samples, the case (ii) regression will remain unbiased but, as we now show, is subject to more sampling variability than the case (o) regression. That is, random selection provides a more efficient experimental design. Inspecting the table we see that in case (o) the explanatory variables are uncorrelated, that is, C(z,x)=0, while in case (ii) they are correlated, C(z,x)6¼0. This suggests that the standard errors of the regression coefficient estimates are larger in the latter case. Indeed, the usual formula for regression on two explanatory variables shows that the variance of regression coefficient estimators is multiplied by a factor 1/(1r2) in moving from an uncorrelated to a correlated design, where r2 is the squared correlation of the explanatory variables; cf. Kmenta (1971, p. 388). In case (ii), the relevant r2 is r2xz ¼
C 2 ðx; zÞ Q=ð2pPÞ 2 ¼ ¼ V ðxÞV ðzÞ ðQ=PÞð1=4Þ p
Table 1. Summary. Case y Variances and covariances y x z Linear regression coefficients a1 a2
Q/P
(o)
(i)
x
Q Q/P
(ii)
z
0 0 1/4
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Q=ð2pÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Q=ð2pÞ 1/4
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PQ=ð2pÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Q=ð2pPÞ 1/4
P 0
Pðp 2Þ=ðp 2PÞ pffiffiffiffiffiffiffiffiffi ð1 PÞ 8pQ=ðp 2PÞ
P 0
18
ARTHUR S. GOLDBERGER
implying that the sampling variances of the regression coefficients in case (ii) will be 1 p : ¼ ¼2:75 1 ð2=pÞ p 2 times as large as they are in case (o). (It is interesting that this numerical conclusion is entirely independent of the values of the parameters of the model, P and Q.) Thus, the efficiency of the random selection procedure is confirmed – a random sample of size 100 being as good as a selected-onpretest sample of size 275. In the preceding calculation we relied implicitly on the assumption that the disturbance variance did not change with the change in experimental design. This assumption is justified since in both cases V(y)=Q/P and E(y|x,z)=E(y|x)=Px, which implies that Q V ðyjx; zÞ ¼ V ðyjxÞ ¼ V ðyÞ V EðyjxÞ ¼ P2 V ðxÞ x P Q Q 2 Q 2 ¼ P ¼ ð1 P Þ P P P in both cases.
ACKNOWLEDGMENTS The research reported here was supported in part by funds granted to the Institute for Research on Poverty at the University of Wisconsin by the Office of Economic Opportunity pursuant to the provisions of the Economic Opportunity Act of 1964. The conclusions are the sole responsibility of the author. The research draws heavily on the work of Burt Barnow and Glen Cain. The author thanks Irv Garfinkel and Edward Gramlich for helpful comments.
APPENDIX A. CONDITIONAL DENSITY AND MOMENTS To evaluate the density, expectations, and variance conditional on a normally distributed variable lying within a specified interval, we begin with the case of standard normal distribution.
Selection Bias in Evaluating Treatment Effects
19
Let s Nð0; 1Þ, that is, let the density function of s be 1 f ðsÞ ¼ ð2pÞ1=2 exp s2 2 The probability that s lies in the interval between a and b is F(b)F(a), where Z s Z s 1 2 1=2 F ðsÞ ¼ f ðrÞdr ¼ ð2pÞ exp r dr 2 1 1 denotes the cumulative standard normal distribution. Therefore, the conditional distribution of s given that aosob is given by the density function 8 0 for s a > > < f ðsÞ for aosob (A.1) p ðsjÞ p ðsjaosobÞ ¼ F ðbÞ F ðaÞ > > : 0 for b s The moment-generating function for this distribution is Z 1 mðtÞ Eðets Þ ¼ ets p ðsjÞds 1
¼
1 ð2pÞ1=2 F ðbÞ F ðaÞ
Z
b a
1 expftsg exp s2 ds 2
ðA:2Þ
Completing the square in the exponent via 1 1 1 ts s2 ¼ ðs tÞ2 þ t2 2 2 2 we rewrite Eq. (A.1) as
1 ðF ðbÞ F ðaÞÞmðtÞ ¼ exp t2 2
1=2
Z
ð2pÞ
a
b
1 2 exp ðs tÞ ds 2
The term in square brackets will be recognized as the probability that a N(t,1) variable lies in the interval between a and b, which is equal to the probability that a N(0,1) variable lies in the interval between at and bt, namely F(bt)F(at). Thus, 1 ðF ðbÞ F ðaÞÞmðtÞ ¼ exp t2 ðF ðb tÞ F ða tÞÞ 2
20
ARTHUR S. GOLDBERGER
Differentiating with respect to t gives 1 2 ðF ðbÞ F ðaÞÞm ðtÞ ¼ t exp t ðF ðb tÞ F ða tÞÞ 2 1 þ exp t2 ðf ðb tÞ þ f ða tÞÞ 2
0
using Fu(s)=f(s). Setting t=0 to generate the first moment we find ðF ðbÞ F ðaÞÞm0 ð0Þ ¼ ðf ðbÞ þ f ðaÞÞ that is, EðsjaosobÞ ¼ m0 ð0Þ ¼
f ðaÞ f ðbÞ F ðbÞ F ðaÞ
Differentiating a second time with respect to t gives 1 2 ðF ðbÞ F ðaÞÞm ðtÞ ¼ t exp t ðf ðb tÞ þ f ðs tÞÞ 2 1 2 þ t exp t2 ðF ðb tÞ F ða tÞÞ 2 1 þ exp t2 ðF ðb tÞ F ða tÞÞ 2 1 þ exp t2 ððb tÞf ðb tÞ þ ða tÞf ða tÞÞ 2 1 þ t exp t2 ðf ðb tÞ þ f ða tÞÞ 2
00
using fu(s)=sf(s). Setting t equal to zero to generate the second moment, we find ðF ðbÞ F ðaÞÞm00 ð0Þ ¼ ðF ðbÞ F ðaÞÞ þ ðaf ðaÞ bf ðbÞÞ that is, Eðs2 jaosobÞ ¼ m00 ð0Þ ¼ 1 þ
af ðaÞ bf ðbÞ F ðbÞ F ðaÞ
The variance could now be computed as V ðsjaosobÞ ¼ Eðs2 jaosobÞ E 2 ðsjaosobÞ ¼ m00 ð0Þ ðm0 ð0ÞÞ2
(A.3)
Selection Bias in Evaluating Treatment Effects
21
Proceeding to the general case, let w Nðm; s2 Þ, that is, the density function of w is 1 w m 2 2 1=2 f ðwÞ ¼ ð2ps Þ exp 2 s and the cumulative distribution of w is Z w Z F ðwÞ ¼ f ðrÞdr ¼ ð2ps2 Þ1=2
1 r m 2 exp dr 2 s 1
1
w
Note that 1 w m f ðwÞ ¼ f ; s s
F ðwÞ ¼ F
w m s
(A.4)
where f( ) and F( ) are the standardized functions. We introduce the standardized variable s=(wm)/s. The event ‘‘aowob’’ is identical with the event ‘‘aosob,’’ where a ¼
am ; s
b ¼
bm s
Therefore, the probability that aowob is identical with the probability that aosob, namely F(b)F(a). From Eqs. (A.1) and (A.4) it follows that the conditional probability distribution of w given that aowob is given by the density function 8 0 for w a > > < f ðwÞ for aowob pðwjaowobÞ ¼ F ðbÞ F ðaÞ > > : 0 for b w This is Eq. (7) in the text. Further, with w=m+ss everywhere, it must be true that for any event EðwjÞ ¼ m þ sEðsjÞ Specifically EðwjaowobÞ ¼ m þ sEðsja osob Þ ¼ m þ s sf ðaÞ sf ðbÞ F ðbÞ F ðaÞ f ðaÞ f ðbÞ ¼ m þ s2 F ðbÞ F ðaÞ
f ða Þ f ðb Þ F ðb Þ F ða Þ
¼mþs
ðA:5Þ
22
ARTHUR S. GOLDBERGER
using Eqs. (A.2) and (A.4). This is Eq. (8) in the text. Similarly, it must be true that for any event, Eðw2 jÞ ¼ s2 Eðs2 jÞ þ 2ms EðsjÞ þ m2 so V ðwjÞ ¼ Eðw2 jÞ E 2 ðwjÞ ¼ s2 ½Eðs2 jÞ E 2 ðsjÞ Specifically V ðwjaowobÞ ¼ s2 ½Eðs2 ja osob Þ E 2 ðsja osob Þ ( ) a f ða Þ b f ðb Þ f ða Þ f ðb Þ 2 2 ¼s 1þ s F ðb Þ F ða Þ F ðb Þ F ða Þ ( 2 ) ða mÞf ðaÞ ðb mÞf ðbÞ 2 2 f ðaÞ f ðbÞ s ¼s 1þ FðbÞ F ðaÞ F ðbÞ F ðaÞ
ðA:6Þ
using Eqs. (A.2) – (A.4). This is Eq. (9) in the text.
APPENDIX B. ILLUSTRATION OF CONDITIONAL MOMENTS The following tabulation may serve to illustrate the consequences of selecting a subpopulation from a normal distribution. Constructed for a standard normal distribution s Nð0; 1Þ, the table indicates for various values of a, the probability that a random drawing exceeds a, namely 1F(a); the conditional expectation given that it exceeds a, namely EðsjaosÞ ¼
f ðaÞ 1 F ðaÞ
and the conditional variance given that it exceeds a, namely af ðaÞ f ðaÞ 2 V ðsjaosÞ ¼ 1 þ 1 F ðaÞ 1 F ðaÞ ¼ 1 þ aEðsjaosÞ E 2 ðsjaosÞ ¼ 1 EðsjaosÞðEðsjaosÞ aÞ The formulas here are obtained by taking b=N in Eqs. (A.2)–(A.3).
Selection Bias in Evaluating Treatment Effects
Cutoff Point a
N 2 1 0.5 0 0.5 1 2
Probability of Selection
23
Conditional Expectation
1.000 .977 .841 .692 .500 .308 .159 .023
0 .06 .29 .51 .80 1.14 1.52 2.38
Conditional Variance 1.00 .88 .62 .49 .36 .27 .21 .10
Our table may be compared with that in Lord and Novick (1968, p. 141), which uses different cutoff points, reports conditional standard deviations rather than variances, and does not report conditional expectations.
APPENDIX C. EXACT REGRESSIONS WHEN SELECTION IS BASED ON TRUE ABILITY To develop the exact (nonlinear) regression functions of y on x when selection is based on true ability, we proceed as follows. For typographical convenience in this appendix we denote x by w and V(u) by R¼
ð1 PÞQ P
For the population at large we have w Nð0; QÞ and xjw Nðw; RÞ. The respective densities are 1 w2 f ðwÞ ¼ ð2pQÞ1=2 exp 2Q 1 ðx wÞ2 1=2 exp pðxjwÞ ¼ ð2pRÞ 2 R For the lower half of the true ability distribution (i.e., the treatment group, for whom z=1), the density of w is ( 2f ðwÞ for wo0 pðwj1Þ ¼ 0 for 0 w
24
ARTHUR S. GOLDBERGER
while the density of x|w is still p(x|w). Thus, the joint density of x and w is ( 2pðxjwÞf ðwÞ for wo0 pðx; wj1Þ ¼ pðxjwÞpðwj1Þ ¼ (C.1) 0 for 0 w Now 1 ðx wÞ2 w2 þ pðxjwÞf ðwÞ ¼ ð2pRÞ1=2 ð2pQÞ1=2 exp 2 R Q 1=2 2 2pQ 1 Px 1 ðw PxÞ2 ð2pð1 PÞQÞ1=2 exp ðC:2Þ ¼ exp P 2 Q 2 ð1 PÞQ
where we have completed the square in the exponent via ðx wÞ2 ðQ þ RÞ Qx 2 1 Q w 1 þ w2 ¼ þ x2 QR QþR R QþR R ¼
ðw PxÞ2 Px2 þ ð1 PÞQ Q
using the definition of R. Thus, p(x,w|1), the joint density of x and w in the treatment group is zero for 0 w, and is twice the expression in Eq. (C.2) for wo0. The density of x in the treatment group can now be obtained as Z 1 Z 0 pðxj1Þ ¼ pðx; wj1Þdw ¼ 2pðxjwÞf ðwÞdw 1
1
2pQ 1=2 1 Px2 ¼2 exp ð2pð1 PÞQÞ1=2 P 2 Q Z 0 1 ðw PxÞ2 exp dw 2 ð1 PÞQ 1
ðC:3Þ
This distribution of pretest scores in the treatment group, sketched in Fig. 3, is clearly nonnormal. (High pretest scores are a rare phenomenon in the treatment group, which by construction has no high-ability individuals.) From Eqs. (C.1) – (C.3), the conditional density of w given x in the treatment group now follows: 8 f x ðwÞ for wo0 pðx; wj1Þ < ¼ F x ð0Þ pðwjx; 1Þ ¼ : pðxj1Þ 0 for 0 w
Selection Bias in Evaluating Treatment Effects
Fig. 3.
25
Distribution of Pretest, Treatment Group (Selection Based on True Ability).
where 1 ðw PxÞ2 f x ðwÞ ¼ ð2pð1 PÞQÞ1=2 exp 2 ð1 PÞQ Rw F x ðwÞ ¼ 1 f x ðrÞdr We recognize fx(w) as the density of a N(Px,(1P)Q) variable, which means that fx(w)/Fx(0) is the conditional density of such a variable given that the
26
ARTHUR S. GOLDBERGER
variable is less than zero. Consequently, Z
1
Eðwjx; 1Þ ¼
wpðwjx; 1Þdw Z 0 wf x ðwÞ 1 ¼ wf ðwÞdw dw ¼ F x ð0Þ F x ð0Þ 1 x 1 1 0
Z
ðC:4Þ
is the expected value of a N(Px,(1P)Q) variable given that the variable is less than zero. Applying Eq. (A.5) we find f x ð1Þ f x ð0Þ F x ð0Þ F x ð1Þ f ð0Þ ¼ Px ð1 PÞQ x F x ð0Þ
Eðwjx; 1Þ ¼ Px þ ð1 PÞQ
As in Appendix A, let f( ) and F( ) denote the standard normal density and cumulative distribution functions, respectively. Then using Eq. (A.4), write ! ! 1 w Px f x ðwÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi f pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 PÞQ ð1 PÞQ ! w Px F x ðwÞ ¼ F pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 PÞQ Introduce the transformation Px s ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 PÞQ
(C.5)
and write f x ð0Þ ¼ F x ð0Þ ¼
! 1 f ðsÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 PÞQ F ðsÞ ! 1 f ðsÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 PÞQ 1 F ðsÞ
ðC:6Þ
Selection Bias in Evaluating Treatment Effects
27
using f(s)=f(s) and F(s)=1F(s). Finally, inserting Eq. (C.6) into (C.4) we have pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi f ðsÞ (C.7) Eðwjx; 1Þ ¼ Px ð1 PQÞ 1 F ðsÞ as the regression function of true ability on pretest in the treatment group. By symmetry, the regression function of true ability on pretest in the control group (i.e., for z=0) must be pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi f ðsÞ Eðwjx; 0Þ ¼ Px þ ð1 PÞQ (C.8) F ðsÞ That these last two equations are also the regression of posttest on pretest for the two groups follows from the fact that y=x+v with v independent of x (and hence of z). The shape of these curves is not hard to characterize. Consider the treatment group. As x-N, s-N, so f(s)/(1F(s))f(N)/(1F(N))=0/(10)=0 which means that E(y|x,1)-Px. However, using L’Hospital’s rule: lim f(s)/(1F(s))=lim fu(s)/(Fu(s))=lim (sf(s))/(f(s) lim (s)). So as x-N, s-N and f(s)/(1F(s))N, which means that E(y|x,1) is asymptotic to the horizontal axis. For the control group, we have the mirror image, as shown in Fig. 4, which also has the linear regressions. The fact that E(y|x,z) is always negative for z=1 and always positive for z=0 is an automatic consequence of the fact that the selection procedure kept true ability always negative for the treatment group and always positive for the control group. We see that the exact regressions are nonlinear in x and that the spurious treatment effect shows up in a nonadditive manner. What can be said about the spurious treatment effect in the exact (nonlinear) regressions as compared with the spurious treatment effect in the approximate (linear) regressions? In Section 4, we saw that the control group line was parallel to the treatment group line and lay above it by the constant amount pffiffiffiffiffiffiffiffiffi 8pQ a2 ¼ ð1 PÞ ¼h p 2P say. Now we see that the control group curve lies above the treatment group curve by the variable amount pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 f ðsÞ ð1 PÞQf ðsÞ þ ¼ ð1 PÞQ F ðsÞ 1 F ðsÞ F ðsÞð1 F ðsÞÞ ¼ hðxÞ
28
ARTHUR S. GOLDBERGER
Fig. 4.
Within-Group Regressions of Posttest on Pretest (Selection Based on True Ability).
say. In particular, at x=0, the distance between the curves is hð0Þ ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2pÞ1=2 ¼ hgðPÞ ð1 PQÞ ð1=2Þð1=2Þ
say. For 0oPo.67, we find that .96og(P)o1, so that h(0) is slightly less than, but virtually indistinguishable from, h. For .67oPo.80 we find 1og(P)o1.10, so that h(0) is slightly greater than h. Then for .80rPo1, g(P) continues to rise and h(0) becomes substantially greater than h. The picture is somewhat mixed, but on balance it seems that the linear regressions may show less of a spurious treatment effect than the curvilinear regressions.
Selection Bias in Evaluating Treatment Effects
29
Admittedly, it is not clear that x=0 is a sensible point at which to compare the treatment and control curves. As noted at the end of Section 5, the complication of curvilinearity does not arise when selection is on the basis of pretest scores rather than true ability.
POSTSCRIPT This chapter originally appeared in April 1972 as Discussion Paper 123–72 of the Institute of Research on Poverty at the University of Wisconsin. In the intervening third of a century, the paper circulated widely and was occasionally cited as an early contribution to the econometrics of selection bias. Never having submitted it for publication, I cannot attribute its heretofore unpublished state to the shortsightedness of referees who failed to recognize its innovative quality. The stimulus to the paper was very specific, namely the argument by Donald Campbell and Albert Erlebacher that contemporary evaluations of the Head Start program systematically understated its effectiveness because treatment groups were typically drawn from lower-ability segments of the population. My colleague Glen Cain readily convinced me that their argument – namely that pre-existing differences between treatment and control groups inevitably produce bias – was incorrect. Suppose, he said that SES was a relevant determinant of outcome, that the probability of assignment to treatment varied with SES, and that SES was used as a control variable in the analysis of the effect of treatment on outcome. No bias need be expected, even though the SES distribution differed between treatment and control groups. Donald Campbell was the leading methodologist in educational psychology, a discipline in which classical test score theory is central. There true values are unobserved but reflected in observed scores that may be preand post scores. (The formal theory is virtually identical to Milton Friedman’s Permanent Income hypothesis, with true score playing the role of permanent income, and the observed test scores playing the roles of measured income and consumption.) I wanted to formalize Cain’s argument. So I was led to address the issue in terms of classical test score theory, where normality is characteristically taken for granted. For simplicity I used a sharp cutoff for assignment to treatment and control – that is, a truncation selection mechanism, although I did not know that language at the time. Evidently I was unaware of the attention devoted to selection in the genetics literature. A standard textbook, Crow and
30
ARTHUR S. GOLDBERGER
Kimura (1970, Chapter 5) considered milk production in a herd of cows, which initially follows a normal distribution. The animal breeder wants to improve milk production, so he selects only cows whose production lies above a truncation point, culling – that is, discarding – the rest. The authors derived the formula for the mean production of the selected subherd. Another standard text available at the time, Cavalli-Sforza and Bodmer (1971, pp. 606–608), dealt with the distribution of human heights under natural selection. They remarked that ‘‘It is, of course, ridiculous to assume that there is a sharp threshold in stature below which all people die before reproducing and above which all survive. A more plausible model y would assign some smooth regularly increasing curve relating fitness to height. y A convenient analytical shape for such a curve is provided by y the integral of the normal distribution.’’ This ‘‘directional selection’’ led to a selected population that has a normal density multiplied by a normal cumulative function. They tabulated the mean and variance of the resulting distribution and discussed parameter estimation. But for the fact that no covariates were used, this textbook material had the structure of the bivariate-normal sample selection model that emerged in econometrics in the immediately following years. There truncation selection on one variable leads to directional selection on a correlated outcome. In my paper, in case (i), selection is truncation on true ability, and thus directional on the correlated variable pretest score. In case (ii) selection is truncation on pretest score. Had I been aware of those sources, I might have taken a less pedantic and tedious approach to deriving established formulas. However, it may be that my pedantic strain is genetic, or at least congenital. More embarrassing was my failure to realize that some of the calculation was to be found in the study by Tobin (1958), a reference that I did know. I happened to use a sharp cutoff for convenience, which is apparently what led my paper to be cited over the years as an early contribution to the regression discontinuity design literature. I did soon realize that it was not the sharp cutoff that underlay the unbiasedness of the regression in case (ii), but rather the fact that the treatment variable, while correlated with true ability, provided no information about true ability once pretest was controlled for. So my analysis is best viewed as making the distinction between selection on observables and selection on unobservables. Burt Barnow, a doctoral student working with Cain, incorporated his contribution into his doctoral dissertation (Barnow, 1973). Cain (1975) set the critique of Campbell and Erlebacher into a broader context. Finally,
Selection Bias in Evaluating Treatment Effects
31
I am grateful to Dawn Duren and Peter Nepokroeff for expert help in bringing my paper into the electronic age.
REFERENCES Barnow, B. S. (1972). Conditions for the presence or absence of a bias in treatment effect: Some statistical models for head start evaluation. Institute for Research on Poverty, University of Wisconsin–Madison, Discussion Paper no. 122–72. Barnow, B. S. (1973). The Effects of Head Start and Socioeconomic Status on Cognitive Development of Disadvantaged Children. Doctoral Dissertation, Department of Economics, University of Wisconsin. Cain, G. G. (1975). Regression and selection models to improve nonexperimental comparisons. In: Evaluation and experiment, some critical issues in assessing social programs (pp. 297–317). New York: Academic Press. Campbell, D. T., & Erlebacher, A. (1970). How regression artifacts in quasi-experimental evaluations can mistakenly make compensatory education look harmful. In: J. Hellmuth (Ed.), Compensatory education: A national debate of The Disadvantaged Child (Vol. III). New York: Brunner Mazel. Cavalli-Sforza, L. L., & Bodmer, W. F. (1971). The genetics of human populations. San Francisco: W. H. Freeman. Crow, J., & Kimura, M. (1970). An introduction to population genetics. New York: Harper & Row. Kmenta, J. (1971). Elements of econometrics. New York: McGraw-Hill. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, PA: Addison-Wesley. Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica, 26(1), 24–36.
THE EVENT-HISTORY APPROACH TO PROGRAM EVALUATION$ Jaap H. Abbring ABSTRACT This paper studies the event-history approach to microeconometric program evaluation. We present a mixed semi-Markov event-history model, discuss its application to program evaluation, and analyze its empirical content. The results of this paper provide fundamental insights into what can be learned from longitudinal microdata about, for example, the effects of training programs for the unemployed on their unemployment durations and subsequent job stability. They can guide the choice of particular models and methods for the empirical analysis of such effects.
1. INTRODUCTION Event-history methods are an important tool for the microeconometric evaluation of dynamic programs using longitudinal data. For example, the effects of training and counseling on unemployment durations and job stability have been analyzed by applying event-history methods to data on
$
Revised version of ‘‘The Non-Parametric Identification of Mixed Semi-Markov EventHistory Models’’ (July 2000).
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 33–55 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00002-3
33
34
JAAP H. ABBRING
individual labor-market and training histories (Ridder, 1986; Card & Sullivan, 1988; Gritz, 1993; Ham & LaLonde, 1996; Eberwein, Ham, & LaLonde, 1997; Bonnal, Fouge`re, & Se´randon, 1997). Similarly, the moral hazard effects of unemployment insurance have been studied by analyzing the effects of time-varying benefits on labor-market transitions (e.g., Meyer, 1990; Abbring, Van den Berg, & Van Ours, 2005; Van den Berg, Van der Klaauw, & Van Ours, 2004). In fields like epidemiology, the use of eventhistory models to analyze treatment effects is widespread (see, e.g., Andersen, Borgan, Gill, & Keiding, 1993; Keiding, 1999). In this chapter, we study the event-history approach to program evaluation.1 We present a mixed semi-Markov event-history model. We discuss its applications to program evaluation and develop some novel identification results. The event-history approach to program evaluation is firmly rooted in the econometric literature on state dependence and heterogeneity (Heckman & Borjas, 1980; Heckman, 1981a). In the tradition of the selection-model literature, event-history models along the lines of Heckman and Singer (1984a, 1986) are used to jointly model transitions into programs and transitions into outcome states. Causal effects of programs are modeled as the dependence of individual transition rates on the individual history of program participation. Dynamic selection effects are modeled by allowing for dependent unobserved heterogeneity in both the program and outcome transition rates. Without restrictions on the class of models considered, true state dependence and dynamic selection effects cannot be distinguished. Any history dependence of current transition rates can be explained both as true state dependence and as the result of unobserved heterogeneity that simultaneously affects the history and current transitions. This is a dynamic manifestation of the problem of causal inference from observational data. It is the fundamental problem of distinguishing state dependence and heterogeneity. In applied work, researchers avoid this problem by imposing additional structure. One example is a mixed semi-Markov model in which the causal effects are restricted to the effects of program participation in the previous spell (e.g., Bonnal et al., 1997). There is a substantial literature that studies the structure needed to enable the identification of state dependence and heterogeneity in duration and event-history models from longitudinal microdata (see Heckman & Taber, 1994; Van den Berg, 2001, for reviews). However, little is known about the identifiability of general event-history models. The existing literature restricts attention to either single-spell
The Event-History Approach to Program Evaluation
35
two-state models (e.g., Elbers & Ridder, 1982; Heckman & Singer, 1984b; Ridder, 1990; Kortram, Lenstra, Ridder, & Van Rooij, 1995; Abbring, 2002, 2007), multispell two-state models (Honore´, 1993), or competing-risks models (Heckman & Honore´, 1989; Abbring & Van den Berg, 2003a). Neither of these models handles the effect of a dynamically assigned treatment, like a training program, on event-history outcomes such as unemployment durations. Abbring and Van den Berg (2003b) develop results for a structural bivariate duration model of the effect of a single treatment time on an outcome duration. Their model can be rewritten as a particular three-states event-history model with state dependence (see Section 2). In this chapter, we discuss more general event-history models. We will focus on mixed semiMarkov models, which allow for dynamic selection and various forms of state dependence, including duration dependence and dependence on the previous state occupied (‘‘lagged occurrence dependence’’). The chapter is organized as follows. Section 2 presents the mixed semiMarkov event-history model and discusses its relation to models that have been used in empirical work. Section 3 discusses the model’s identifiability from a random sample of censored event histories. Section 4 discusses alternative sampling schemes. Section 5 concludes with some remarks on the implications for applied empirical work.
2. A MIXED SEMI-MARKOV EVENT-HISTORY MODEL 2.1. Model The model is set up along the lines of Heckman and Singer (1984a, 1986). Point of departure is a continuous-time stochastic process assuming values in a finite set S at each point in time. We will interpret realizations of this process as agents’ event histories of transitions between states in the state space S. Suppose that event histories start at real-valued random times T0 in an S-valued random state S0, and that subsequent transitions occur at random times T1, T2, y such that T 0 oT 1 oT 2 o . Let Sl be the random destination state of the transition at Tl. Taking the sample paths of the event-history process to be right-continuous, we have that Sl is the state occupied in the interval [Tl,Tl+1). Suppose that heterogeneity between agents is captured by vectors of timeconstant observed covariates X and unobserved covariates V.2 Then, state
36
JAAP H. ABBRING
dependence in the event-history process for given individual characteristics X,V has a causal interpretation.3 We structure such state dependence by assuming that the event-history process conditional on X,V is a timehomogeneous semi-Markov process: Conditional on X,V, the length of a spell in a state and the destination state of the transition ending that spell depend only on the past through the current state. In our notation, (DT l ; Sl Þ@fðT i ; S i Þ; i ¼ 0; . . . ; l 1gjS l1 ; X ; V ; where DTl : ¼ Tl Tl1 is the length of spell l. Also, the distribution of (DT l ; S l ÞjS l1 ; X ; V does not depend on l. Note that, conditional on X,V, {Sl, lZ0} is a timehomogeneous Markov chain under these assumptions. Nontrivial dynamic selection effects arise because V is not observed. The event-history process conditional only on observed covariates X is a mixed semi-Markov process. If V affects the initial state S0, or transitions from there, subpopulations of agents in different states at some time t typically have different distributions of the unobserved characteristics V. Therefore, a comparison of the subsequent transitions in two such subpopulations does not only reflect state dependence, but also sorting of agents with different unobserved characteristics into the different states they occupy at time t. We model {(DTl,Sl), lZ1}|T0,S0,X,V as a repeated competing-risks model. Due to the mixed semi-Markov assumption, the latent durations corresponding to transitions into the possible destination states in the lth spell only depend on the past through the current state Sl1, conditional on X,V. This implies that we can fully specify the repeated competing-risks model by specifying a set of origin-destination-specific latent durations, with corresponding transition rates. Let T ljk denote the latent duration corresponding to the transition from state j to state k in spell l. We explicitly allow for the possibility that transitions between certain (ordered) pairs of states may be impossible. To this end, define the correspondence Z : S ! PðSÞ assigning to each s 2 S the set of all destination states to which transitions are made from s with positive probability.4 Here, PðSÞ is the set of all subsets of S (the ‘‘power set’’ of S). Then, the length of spell l is given by DT l ¼ mins2ZðSl1 Þ T lSl1 s , and the destination state by S l ¼ arg mins2ZðSl1 Þ T lSl1 s . We take the latent durations to be mutually independent, jointly independent from T0,S0, and identically distributed across spells l, all conditional on X,V. This reflects both the mixed semi-Markov assumption and the additional assumption that all dependence between the latent durations corresponding to the competing risks in a given spell l is captured by the observed regressors X and the unobservables V. This is a standard assumption in econometric duration analysis, which, with the semi-Markov
The Event-History Approach to Program Evaluation
37
assumption, allows us to characterize the distribution of {(DTl,Sl),lZ1}| T0,S0,X,V by specifying origin-destination-specific hazards yjk(t|X,V) for the marginal distributions of T ljk jX ;V . We assume that the hazards yjk(t|X,V) are of the mixed proportional hazard (MPH) type:5 ljk ðtÞfjk ðX ÞV jk if k 2 ZðjÞ (1) yjk ðtjX ; V Þ ¼ 0 otherwise The baseline hazards ljk : Rþ ! ð0; 1Þ capture duration Rdependence of the t individual transition rates. They have integrals Ljk ðtÞ :¼ 0 ljk ðtÞdto1 for all t 2 Rþ . The regressor functions fjk : X ! ð0; 1Þ are assumed to be continuous, with X Rq being the support of X. In applications, these functions are frequently specified as fjk(x)=exp(xubjk) for some parameter vector bjk. We will not make such parametric assumptions. Note that the fact that all regressor functions are defined on the same domain X is not restrictive, because each function fjk can ‘‘select’’ certain elements of X by being trivial functions of the other elements. In the parametric example, the vector bjk would only have nonzero elements for those regressors that matter to the transition from j to k. Finally, the (0,N)-valued random variable Vjk is the scalar component of V that affects the transition from state j to state k. Note that we allow for general dependence between the components of V. This way, we can capture, for example, that agents with lower reemployment rates have higher training enrollment rates. We normalize Ljk ðt Þ ¼ 1
and
fjk ðx Þ ¼ 1;
j 2 S;
k 2 ZðjÞ
for some a priori chosen tA(0,N) and x 2 X. These normalizations are innocuous, because Vjk can capture the scale of yjk. This fully characterizes the distribution of the transitions {(DTl,Sl), lZ1} conditional on the initial conditions T0,S0 and the agents’ characteristics X,V. A complete model of the event histories {(Tl,Sl), lZ0} conditional on X,V would, in addition, require a specification of the initial conditions T0,S0 for given X,V. It is important to stress here that T0,S0 are the initial conditions of the event-history process itself, and should not be confused with the initial conditions in a particular sample (which we will discuss in Section 4). In empirical work, interest in the dependence between start times T0 and characteristics X,V is often limited to the observation that the distribution of agents’ characteristics may vary over cohorts indexed by T0. The choice of initial state S0 may, in general, be of some interest, but is often trivial. For example, we could model labor-market histories from the
38
JAAP H. ABBRING
calendar time T0 at which agents turn 15 onwards. In an economy with perfect compliance to a mandatory schooling age over 15, the initial state S0 would be ‘‘(mandatory) schooling’’ for all. Therefore, we will not consider a model of the event-history’s initial conditions, but instead focus on the conditional model of subsequent transition histories. Because of the semi-Markov assumption, the distribution of {(DTl,Sl), lZ1}|T0,S0,X,V only depends on S0, and not T0. Thus, T0 only affects observed event histories through cohort effects on the distribution of unobserved characteristics V. The initial state S0, however, may both have causal effects on subsequent transitions and be informative on the distribution of V. For expositional clarity, we assume that the distribution of unobserved covariates does not vary over cohorts, or more precisely that V@T0|S0,X, throughout the chapter.6 An econometric model for transition histories conditional on the observed covariates X can be derived from the model of {(DTl,Sl), lZ1}|S0,X,V by aggregating over V. The exact way this should be done depends on the sampling scheme used. First, in Section 3, we consider sampling from the population of event histories. We assume that we observe the covariates X, the initial state S0, and the first L transitions from there. Then, we can model these transitions for given S0,X by integrating the conditional model over the distribution of V|S0,X. Next, in Section 4, we briefly discuss more complex, and arguably more realistic, sampling schemes. For example, when studying labor-market histories we may randomly sample from the stock of unemployed at a particular point in time. Because the unobserved factor V affects the probability of being unemployed at the sampling date, the distribution of V|X in the stock sample does not equal its population distribution. Moreover, in this case we typically do not observe an agent’s entire labormarket history from T0 onwards. Instead, we may have data on the time spent in unemployment at the sampling date and on labor-market transitions for some period after the sampling date. This ‘‘initial-conditions problem’’ complicates matters further (Heckman, 1981b).
2.2. Applications to Program Evaluation Several empirical papers study the effect of a single treatment on some outcome duration or set of transitions. Two approaches can be distinguished. In the first approach, the outcome and treatment processes are explicitly and separately specified. The second approach distinguishes
The Event-History Approach to Program Evaluation
39
treatment as a separate state in a single event-history model with state dependence. The first approach is used in a variety of papers in labor economics. Eberwein et al. (1997) specify a model for labor-market transitions in which the transition intensities between various labor-market states (not including treatment) depend on whether someone has been assigned to a training program in the past or not. Abbring et al. (2005) and Van den Berg et al. (2004) specify a model for reemployment durations in which the reemployment hazard depends on whether a punitive benefits reduction has been imposed in the past. Similarly, Van den Berg, Holm, and Van Ours (2002) analyze the duration up to transition into medical trainee positions and the effect of an intermediate transition into a medical assistant position (a ‘‘stepping-stone job’’) on this duration. These models fit Abbring and Van den Berg’s (2003b) framework, or a multistate extension thereof. The model considered by Abbring and Van den Berg is a bivariate duration model in which realization of the outcome duration censors the treatment duration, and realization of the treatment duration changes the hazard of the outcome duration from that time onwards. We can rephrase this type of model in terms of a simple event-history model with state dependence as follows. Distinguish three states, untreated (O), treated (P), and the exit state of interest (E), so that S={O,P,E}. All subjects start in O, so that S0=O. Obviously, we do not want to allow for all possible transitions between these three states. Instead, we restrict the correspondence Z representing the possible transitions as follows: 8 s¼O > < fP; Eg ZðsÞ ¼ fEg if s ¼ P > : ; s¼E Simple state dependence of the transition rates into E will already capture a treatment effect in the sense of Abbring and Van den Berg. Not all models in their paper are, however, included in the simple semi-Markov setup discussed here. In particular, in this chapter we do not allow the transition rate from P to E to depend on the duration spent in O. This extension with ‘‘lagged duration dependence’’ (Heckman & Borjas, 1980) would be required to capture one variant of their model. The model for transitions from ‘‘untreated’’ (O) is a competing-risks model, with program enrollment (transition to P) and employment (E) competing to end the untreated spell. If the unobservable factor VOE that determines transitions to employment and the unobservable factor VOP
40
JAAP H. ABBRING
affecting program enrollment are dependent, then program enrollment is selective in the sense that the distribution of VOE – and then typically that of VPE – among those who enroll at a given point in time does not equal its distribution among survivors in O up to that time.7 The second approach is used by, for example, Gritz (1993) and Bonnal et al. (1997). Consider the following simplified setup. Suppose workers are either employed (E), unemployed (O), or engaged in a training program (P). We can now specify a transition process among these three labor-market states in which a causal effect of training on unemployment and employment durations is modeled as dependence of the various transition rates on the past occurrence of a training program in the labor-market history. Partly to avoid initial-conditions problems, Bonnal et al. (1997) restrict attention to first-order lagged occurrence dependence. So, suppose that transition rates only depend on the current and previous state occupied. Such a model is not directly covered by the semi-Markov model, but with a simple augmentation of the state space it will be. In particular, we have to include lagged states in the state space on which the transition process is defined. Because there is no lagged state in the event history’s first spell, initial states should be defined separately. So, instead of just distinguishing states in S ¼ fE; O; Pg, we distinguish augmented states in S ¼ fðs; s0 Þ 2 ðS [ IÞ S : sas0 g. Then, (I,s), s 2 S , denote the initial states, and ðs; s0 Þ 2 S the augmented state of an agent who is currently in su and came from s6¼su. In order to preserve the interpretation of the model as a model of lagged occurrence dependence, we have to exclude certain transitions by specifying Zðs; s0 Þ ¼ ðs0 ; s00 Þ; s00 2 S nfs0 g This excludes transitions to augmented states that are labeled with a lagged state different from the origin state. Also, it ensures that agents never return to an initial state. For example, from the augmented state (O,P) – previously unemployed and currently enrolled in a program – only transitions to augmented states (P,sv) –previously enrolled in a program and currently in sv – are possible. Moreover, it is not possible to be currently employed and transiting to initially unemployed, (I,O). Rather, an employed who loses his job would transit to (E,O) – currently unemployed and previously employed. The effects of, for example, training are now modeled as simple statedependence effects. For example, the effect of training on the transition rate from unemployment to employment is simply the contrast between the
The Event-History Approach to Program Evaluation
41
individual transition rate from (E,O) to (O,E) and the transition rate from (P,O) to (O,E). Dynamic selection into the augmented states (E,O) and (P,O), as specified by the transition model, confounds the empirical analysis of these training effects. Note that there are no longer run effects of training on transition rates from unemployment to employment due to the fact that we have restricted attention to first-order lagged occurrence dependence, similar to Bonnal et al. (1997).
3. THE MODEL’S EMPIRICAL CONTENT 3.1. Sampling Scheme Suppose that we randomly sample from the population of event histories and observe the first L transitions, including destinations, for each sampled event history, with possibly L=N. Thus, we observe a random sample of {(Tl,Sl), lA{0, 1, y, L}} and X. If LoN then our data are right censored; if L=N they are not.8
3.2. Identification from First Transitions and Variation in Initial Conditions First, consider what can be learned from data on the first transition from the initial state S0 only. Denote the support of S0 by S0 . For j 2 S, let #ZðjÞ denote the number of elements in ZðjÞ, i.e., the number of destination states that are reached with positive probability from j. Consider the following assumptions. Assumption 1. (Vjk, k 2 ZðjÞ)@X|S0, for all j 2 S0 . Assumption 2. E½V jk o1, for all k 2 ZðjÞ and j 2 S0 . Assumption 3. The range fðfjk ðxÞ; k 2 ZðjÞÞ; x 2 Xg of the regressor functions contains a nonempty open set in ð0; 1Þ#ZðjÞ , for all j 2 S0 . These are multivariate extensions of assumptions that are standard in the single-spell MPH literature (e.g., Elbers & Ridder, 1982). Assumption 1 requires independence of the observed covariates and the unobserved heterogeneity in the relevant subpopulations. Because we only observe a single transition from each origin state and cannot apply panel-data
42
JAAP H. ABBRING
techniques to deal with unobserved heterogeneity, this is necessary for the identification of the regressor functions fjk. Assumption 2 is a technical, but far from innocuous, assumption (Ridder, 1990). Without it, the integrated baseline hazards Ljk and regressor functions fjk can only be identified up to power transformations.9 Such transformations may substantially change the interpretation of Ljk and fjk. Finally, Assumption 3 ensures that there is independent variation with the regressors of the individual hazard rates yjk(t|X,V) corresponding to the various competing risks in state j. With fjk(x)=exp(xubjk), it would be sufficient that ðbjk ; k 2 ZðjÞÞ has full column rank and X contains a nonempty open set in Rq , for all j 2 S0 . In turn, this could be achieved by imposing exclusion restrictions of the sort encountered in instrumentalvariables analysis. However, such exclusion restrictions are not necessary for Assumption 3 to hold. We have the following result. Proposition 1. If Assumptions 1–3 are satisfied, then ((fjk,Ljk), j 2 S0 ; k 2 ZðjÞ) and the joint distributions of ðV jk ; k 2 ZðjÞÞjS0 ¼ j; j 2 S0 , are identified from the distribution of DT1,S1|S0,X. Proof. For each j 2 S0 , the model of DT1,S1|S0=j,X is an MPH competing-risks model. The result follows from repeated application of Abbring and Van den Berg (2003a, Proposition 2).10 Because the model for the first transition from the initial state is an MPH competing-risks model, the proof of Proposition 1 is a direct application of Abbring and Van den Berg’s (2003a) identification results for such models. The intuition for these results comes in two stages. First, consider the transition rate from state j to state k among those who have survived for some time t in their initial state j, ljk ðtÞfjk ðX ÞE½V jk jDT 1 t; S 0 ¼ j; X This ‘‘crude’’ hazard rate can be computed from the distribution of DT1,S1|S0,X that Proposition 1 takes as data. For t>0, E½V jk jDT 1 t; S 0 ¼ j; X typically depends on X, and variation of the crude hazard rate with the covariates reflects both these selection effects and the agent-level effects through fjk(X). However, because of Assumptions 1 and 2, E½V jk jDT 1 t; S 0 ¼ j; X reduces to E½V jk jS0 ¼ jo1 as tk0. Thus, near the start of the spell, by Assumption 1, subpopulations with different regressor values are similar in terms of their unobserved components. Therefore, we can identify
The Event-History Approach to Program Evaluation
43
fjk by contrasting crude hazard rates between such subpopulations near the start of the spell. Second, note that, for t>0 and given fjk(X), the crude hazard rate above can only depend on X through the selection effects on E½V jk jDT 1 t; S 0 ¼ j; X ¼ E½V jk jT 1jk0 t; k0 2 ZðjÞ; S0 ¼ j; X Now, suppose that V jk @ðV jk0 ; k0 2 ZðjÞnfkgÞjS0 ; X . Then, the event fT 1jk0 t; k0 2 ZðjÞnfkgg is not informative on Vjk for given S0,X, and E½V jk jDT 1 t; S0 ¼ j; X ¼ E½V jk jT 1jk t; S0 ¼ j; X Thus, in this case, E½V jk jDT 1 t; S 0 ¼ j; X , and therefore the crude hazard rate, does not depend on X for given fjk(X). If Vjk (Vjku, k0 2 ZðjÞnfkg)|S0, X, then E½V jk jDT 1 t; S0 ¼ j; X depends on fjku(X) for given fjk(X) through the dependence of Vjk and Vjku. In sum, independent variation in fjku(X), k0 2 ZðjÞnfkg, and fjk(X) can be exploited to infer whether the competing risks are dependent or not. Assumption 3 ensures that there is such independent variation in the regressor effects. In applications, like those in Section 2, we are typically interested in contrasting the distributions of T 1jk and T 1j0 k in a subpopulation with given values of X and S0, for some j, j 0 2 S0 and k 2 ZðjÞ \ Zðj 0 Þ. Such a contrast can be interpreted as an average treatment effect on the given subpopulation.11 However, Proposition 1 only provides identification conditional on S0. In particular, it gives conditions under which we can construct the distributions of T 1jk jS 0 ¼ j; X and T 1j0 k jS 0 ¼ j 0 ; X . The contrast between these distributions reflects both causal treatment effects and selection into the initial state. This is the standard problem of causal inference. Some standard solutions to this problem, adapted to this event-history setting, are the following.12 First, we could assume that assignment to initial states is ‘‘randomized’’: Assumption 4. V@S0. This, with Assumption 1, would allow us to identify the distributions of T 1jk jX and T 1jk jX ; S 0 – our objects of interest – with that of T 1jk jS 0 ¼ j; X . The latter is identified by Proposition 1. Second, we could rely on instruments to generate random variation in S0. Abbring and Van den Berg’s (2005) nonparametric and semiparametric results for single-spell duration outcomes apply directly to the special case that there is only one destination from the initial states. Their extension to competing-risks outcomes is required in the general case.
44
JAAP H. ABBRING
The MPH structure will prove key here in separating the effects on the competing risks.13
3.3. Dynamic Selection The event-history approach to program evaluation does not rely on random variation in the initial state, but instead exploits variation in the states that arises in the course of the event history due to transitions. There is a close connection to the selection-model literature: Dynamic selection into states is modeled jointly with outcomes by means of the mixed semi-Markov eventhistory model. This approach is particularly relevant in the important case in which there is no variation at all in the initial state, and S0 is degenerate. Then, Assumption 4 is trivially satisfied, but there is no scope for contrasting transitions from different initial states. In this case, all variation in states arises dynamically according to the mixed semi-Markov transition process. Multiple spells (L>1) are needed to compare transition rates from different origin states. A selection problem arises if the probability, conditional on X,V, that the state of interest is never occupied during the observed event history depends on V for given X. This is typically the case if only a finite number L of spells is observed, so that there is censoring.14 Without censoring, if L=N, it is often true that the state of interest is almost surely occupied at some point during the event history. Then, the sample of event histories’ first spells in the state of interest does not suffer from selection on unobservables; standard competing-risks results then give full identification. Similarly, if both the first and the second spell in a state of interest occur almost surely, the much stronger results for multispell competing-risks models of Abbring and Van den Berg (2003a) can be applied. Obviously, these results are of little empirical relevance, as we typically observe only a limited number of spells in any event history, and panel data are subject to other types of censoring. However, they highlight that dynamic selection problems arise either because of restrictions on the event-history process that lead to selective occurrence of first (and higher) spells in given states, or because of limited observability of event histories, e.g., due to censoring. In the remainder of this section, we will analyze under what conditions the structure imposed on the mixed semi-Markov model in Section 2 suffices for identification of state dependence and treatment effects.
The Event-History Approach to Program Evaluation
45
3.4. Identification from Censored Event Histories Let j 2 S be accessible from S0, so that Pr(Sl=j)>0 for some lZ0, and let k 2 ZðjÞ. Consider the identification of the determinants of yjk. Let L(j):=min{lZ0: Pr(Sl=j)>0} be the smallest number of transitions from S0 through which j is accessible. Then, we need data on at least L(j)+1 spells to identify the determinants of yjk. Take some u0, u1, y, uL(j)+1 such that Pr(S0=u0, y, SL(j)+1 = uL(j)+1)>0, uL(j)=j, and uL(j)+1=k. We approach the identification of the determinants of yjk by considering the identification of the determinants of PrðS 0 ¼ u0 ; . . . ; SLðjÞþ1 ¼ uLðjÞþ1 ; DT 1 4t1 ; . . . ; DT LðjÞþ1 4tLðjÞþ1 jX ; V Þ . In turn, these can be expressed in terms of the for ðt1 ; . . . ; tLðjÞþ1 Þ 2 RLðjÞþ1 þ transition intensities corresponding to the origin–destination pairs in X :¼ fðs; s0 Þ 2 S2 : s 2 fu0 ; . . . ; uLðjÞ g; s0 2 ZðsÞg. Note that, along with the identification of yjk’s determinants, we consider identification of the selection process into the state j. This selection process is modeled as a repeated competing-risks model, and its identification will again exploit results from the competing-risks literature. We make the following assumptions. Assumption 5. (Vp,pAX)@X|S0. Assumption 6. For all s 2 Zðu0 Þ, E½V u0 s o1. For all s 2 Zðul Þ; E½V u0 u1 V ul1 ul V ul s o1; l ¼ 1; . . . ; LðjÞ. These assumptions generalize Assumptions 1 and 2 to the case of multiple transitions, but for a single initial state u0. Assumption 5 requires that the unobserved factors relevant to the chosen path from u0 to k, including those corresponding to the pairs in Zðu0 Þ, are jointly independent of X, given S0. Assumption 6 again facilitates inference on regressor effects at short durations. The higher moments appear in probabilistic expressions involving histories with multiple short spells. Together with Assumption 5, it allows us to derive the following result. Proposition 2. If Assumptions 5 and 6 are satisfied, then (fp, pAX) is identified from the distribution of {(DTl, Sl), l=1, y, L(j)+1}|S0,X. Proof. See the appendix. With Assumption 4 and a generalization of Assumption 3 to the case of multiple transitions, we can extend this result to identification of the full
46
JAAP H. ABBRING
model. Let Xðul Þ :¼ ful g Zðul Þ be the set of all origin–destination pairs with origin ul. Assumption 7. For l=0, y, L(j), the set fðLp ðtiþ1 Þfp ðxÞ; p 2 Xðui Þ; i ¼ 0; . . . ; lÞ; x 2 X; ðt1; . . . ; tl Þ 2 Rlþ ; tlþ1 ¼ t g contains a nonempty open set in ð0; 1Þ
Pl i¼0
#Xðui Þ
.
A sufficient condition for Assumption 7 is that the range {(fp(x), pAX), x 2 X} of the regressor functions contains a nonempty open set in (0, N)#X. If fp(x)=exp(xubp), it would be sufficient that (bp, pAX) has full column rank and X contains a nonempty open set in Rq . However, Assumption 7 is substantially weaker, as it allows us to substitute variation in the durations of previous spells for regressor variation. We have the following result. Proposition 3. If Assumptions 4–7 are satisfied, then ((fp, Lp), pAX) and the joint distribution of (Vp, pAX) are identified from the distribution of {(DTl,Sl), l=1, y, L(j)+1}|S0,X. Proof. See the appendix. The model for the first L(j) transitions is a repeated MPH competing-risks model, and the proof of Proposition 3 iteratively applies an identification strategy similar to that of Abbring and Van den Berg (2003a) for the competing-risks model. In particular, identification of the determinants of the first transition from u0 is proved analogously to Proposition 1 (or, similarly, Abbring & Van den Berg, 2003a, Proposition 2). With this in hand, we can proceed to identification of the determinants of the transition from u1, exploiting knowledge of the determinants of the first transition, etc. Proposition 3 establishes identifiability of the determinants of yjk from event histories that include only a single spell in state j. If we have data on sufficiently long event histories, we may be able to observe multiple spells in state j. The literature on the identifiability of multispell duration models (Honore´, 1993; Abbring & Van den Berg, 2003a, 2003b) suggests that many of our assumptions, including the proportional-hazards assumption, can be relaxed in this case. We will not further pursue this here.
The Event-History Approach to Program Evaluation
47
4. ALTERNATIVE SAMPLING SCHEMES We have analyzed identifiability of the mixed semi-Markov event-history model from a random sample of censored event histories. In empirical practice, we often have to deal with alternative, more complex sampling schemes. We distinguish two cases: inflow sampling and stock sampling (e.g., Lancaster, 1990, ch. 8). With inflow sampling, we sample from the flow into a given subset of states during some time interval. Due to dynamic selection, the distribution of X,V in an inflow sample is typically not the same as its distribution in the population. With stock sampling, we sample from the stock in a given set of states at a certain moment in time. Again, the distribution of X,V in a stock sample is generally not the same as its population distribution. Moreover, conditional on X,V, the distribution of the spells ongoing at the sampling date will not be the population distribution in this case either. Various ways to model inflow and stock samples have been proposed in the literature (see, e.g., Heckman & Singer, 1984a, 1986; Lancaster, 1990). In the case of inflow sampling, we could replace Assumption 5 by an ad hoc assumption on the distribution of the covariates in the inflow. We could make similar ad hoc assumptions in the stock-sampling case, together with an assumption on the historical development of the inflow into the states from which we sample. One common assumption is that the inflow has been constant over time. As Lancaster (1990, Section 8.4.2) points out, the common ad hoc assumptions on the distribution of the covariates are likely to be mutually inconsistent between the various sampling schemes. This suggests that we only make assumptions on the population, such as Assumption 5, and derive the distributions of the various samples from the population model. In general, this is hard because the distributions of the inflow and stock samples depend on that of the full event history, including its initial conditions T0,S0. An elegant solution is to assume that the samples are drawn from the event histories’ long-run equilibrium distributions. This is only appropriate in carefully selected applications, and requires ergodicity of the semiMarkov model for the individual event histories. The resulting models for the inflow and stock samples are easy to derive and handle (Lancaster, 1990). In general, they involve dependent observable and unobservable covariates even under Assumption 5. However, this dependence is very tightly structured. We conjecture that this structure can be exploited to prove identifiability under conditions not unlike those for the case of sampling from the population. This is a topic for future research.
48
JAAP H. ABBRING
5. CONCLUDING REMARKS This paper reviews the use of event-history models to simultaneously model dynamic selection into programs and the causal effects of the participation in such programs on event-history outcomes. A leading example is the analysis of the effects of training programs for the unemployed on their unemployment durations and subsequent job stability. We have provided novel identification results for a particular class of event-history models with a mixed semi-Markov structure. In doing so, we have highlighted and exploited the central role of dependent competing-risks models. We have focused on identification of causal and selection effects from ‘‘ideal’’, large data sets and have ignored sampling variation. Therefore, our results cannot be implemented directly in empirical practice. Instead, they explore the logical limits on what we can reasonably expect to learn about causal effects of dynamic programs from observational data. As such, they can guide the empirical analysis of causal program effects using appropriate event-history models.
ACKNOWLEDGMENT This research was supported by a fellowship of the Royal Netherlands Academy of Arts and Sciences (KNAW).
NOTES 1. See Heckman and Vytlacil (2007) for a review of the program evaluation literature, Abbring and Heckman (2007, 2008) for studies of the microeconometric treatment-effect and structural approaches to dynamic policy evaluation, and Abbring and Van den Berg (2004) for a discussion of the relation between the eventhistory approach to program evaluation and standard latent-variable and panel-data methods. 2. We restrict attention to time-invariant observed covariates for expositional convenience. The analysis can easily be adapted to more general time-varying external covariates. Restricting attention to time-constant regressors is a worst-case scenario for identification: External time variation in observed covariates aids identification (Heckman & Taber, 1994). 3. Abbring and Van den Berg (2003b) make their model’s causal structure explicit in a potential-outcomes model of the causal effects of a treatment time on an outcome duration. Abbring (2003) and Abbring and Heckman (2007, 2008) present the symmetric extension of this model, a nonparametric structural bivariate duration model allowing for simultaneous causal dependence of both durations. Extending
The Event-History Approach to Program Evaluation
49
this further to the general event-history setup adds a lot of complexity, but little extra insight. 4. Throughout the chapter, we assume that Z is known. It is important to note, however, that Z can actually be identified trivially in all the cases considered. 5. The MPH model is an extension of the Cox (1972) proportional hazard model by Lancaster (1979) and Vaupel, Manton, and Stallard (1979). 6. This can easily be relaxed, but at the expense of some extra notation and technical conditions. 7. Note that, in addition, the survivors in O themselves are a selected subpopulation: Because V affects survival in O, the distribution of V among survivors in O is not equal to its population distribution. 8. Our results can be adapted to other common censoring schemes, such as censoring at some nonrandom finite time. See Andersen et al. (1993) for an overview of censoring schemes. 9. Ridder and Woutersen (2003) prove semiparametric identification of a singlespell MPH model under an alternative assumption on the baseline hazard that is equally substantial. 10. Abbring and Van den Berg study the case with two risks, but the extension to more than two risks is trivial. 11. Abbring and Van den Berg (2005) discuss the definition of treatment effects in duration models. They argue that the usual treatment effects defined in terms of the distributions of potential outcome durations confound effects on individual hazard rates and effects that operate through dynamic selection. Recursive economic models often primarily predict effects on individual hazard rates, and semiparametric structure, such as the MPH model, is needed to identify such effects. Here, we do not explicitly address this issue. 12. See, e.g., Heckman, LaLonde, and Smith (1999) for an overview. 13. In addition, semiparametric structure will be important if one is interested in treatment effects on individual hazard rates. See Footnote 11. 14. Abbring and Van den Berg (2005) relate a similar argument for the case of simple random censoring to the dynamic selection problems studied by Ham and LaLonde (1996).
REFERENCES Abbring, J. H. (2002). Stayers versus defecting movers: A note on the identification of defective duration models. Economics Letters, 74, 327–331. Abbring, J. H. (2003). Dynamic econometric program evaluation. Discussion Paper no. 804. IZA, Bonn. Paper prepared for the H. Theil Memorial Conference, Amsterdam, August 16–18, 2002. Abbring, J. H. (2007). Mixed hitting-time models. Discussion Paper no. 07–057/3. Tinbergen Institute, Amsterdam. Abbring, J. H., & Heckman, J. J. (2007). Econometric evaluation of social programs, Part III: Distributional treatment effects, dynamic treatment effects, dynamic discrete choice, and general equilibrium policy evaluation. In: J. J. Heckman & E. E. Leamer (Eds), Handbook of econometrics (Vol. 6B, Ch. 72). Amsterdam: Elsevier.
50
JAAP H. ABBRING
Abbring, J. H., & Heckman, J. J. (2008). Dynamic policy analysis. In: L. Ma´tya´s & P. Sevestre (Eds), The econometrics of panel data (3rd ed., Ch. 24). Dordrecht: Springer. Abbring, J. H., & Van den Berg, G. J. (2003a). The identifiability of the mixed proportional hazards competing risks model. Journal of the Royal Statistical Society Series B, 65, 701–710. Abbring, J. H., & Van den Berg, G. J. (2003b). The non-parametric identification of treatment effects in duration models. Econometrica, 71, 1491–1517. Abbring, J. H., & Van den Berg, G. J. (2004). Analyzing the effect of dynamically assigned treatments using duration models, binary treatment models, and panel data models. Empirical Economics, 29, 5–20. Abbring, J. H., & Van den Berg, G. J. (2005). Social experiments and instrumental variables with duration outcomes. Discussion Paper no. 05–047/3. Tinbergen Institute, Amsterdam. Abbring, J. H., Van den Berg, G. J., & Van Ours, J. C. (2005). The effect of unemployment insurance sanctions on the transition rate from unemployment to employment. Economic Journal, 115, 602–630. Andersen, P. K., Borgan, Ø., Gill, R. D., & Keiding, N. (1993). Statistical models based on counting processes. New York: Springer-Verlag. Bonnal, L., Fouge`re, D., & Se´randon, A. (1997). Evaluating the impact of French employment policies on individual labour market histories. Review of Economic Studies, 64, 683–713. Carathe´odory, C. (1918). Vorlesungen u¨ber Reelle Funktionen. Leipzig: Teubner. Card, D., & Sullivan, D. (1988). Measuring the effect of subsidized training programs on movements in and out of employment. Econometrica, 56, 497–530. Cox, D. R. (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society Series B, 34, 187–202. Eberwein, C., Ham, J. C., & LaLonde, R. J. (1997). The impact of being offered and receiving classroom training on the employment histories of disadvantaged women: Evidence from experimental data. Review of Economic Studies, 64, 655–682. Elbers, C., & Ridder, G. (1982). True and spurious duration dependence: The identifiability of the proportional hazard model. Review of Economic Studies, 64, 403–409. Gritz, R. M. (1993). The impact of training on the frequency and duration of employment. Journal of Econometrics, 57, 21–51. Ham, J. C., & LaLonde, R. J. (1996). The effect of sample selection and initial conditions in duration models: Evidence from experimental data on training. Econometrica, 64, 175–205. Heckman, J. J. (1981a). Heterogeneity and state dependence. In: S. Rosen (Ed.), Studies in labor markets (pp. 91–139). Chicago: University of Chicago Press. Heckman, J. J. (1981b). The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete data stochastic process and some Monte Carlo evidence. In: C. Manski & D. McFadden (Eds), Structural analysis of discrete data with econometric applications (pp. 179–195). Cambridge: MIT Press. Heckman, J. J., & Borjas, G. J. (1980). Does unemployment cause future unemployment? Definitions, questions and answers from a continuous time model of heterogeneity and state dependence. Economica, 47(187), 247–283. Special issue on unemployment. Heckman, J. J., & Honore´, B. E. (1989). The identifiability of the competing risks model. Biometrika, 76(2), 325–330.
The Event-History Approach to Program Evaluation
51
Heckman, J. J., LaLonde, R. J., & Smith, J. A. (1999). The economics and econometrics of active labor market programs. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. 3A, Ch. 31, pp. 1865–2097). New York: North-Holland. Heckman, J. J., & Singer, B. S. (1984a). Econometric duration analysis. Journal of Econometrics, 24(1–2), 63–132. Heckman, J. J., & Singer, B. S. (1984b). The identifiability of the proportional hazard model. Review of Economics Studies, 51(2), 231–243. Heckman, J. J., & Singer, B. S. (1986). Econometric analysis of longitudinal data. In: Z. Griliches & M. D. Intriligator (Eds), Handbook of econometrics (Vol. 3, Ch. 29, pp. 1690–1763). Amsterdam: North-Holland. Heckman, J. J., & Taber, C. (1994). Econometric mixture models and more general models for unobservables in duration analysis. Statistical Methods in Medical Research, 3(3), 279–299. Heckman, J. J., & Vytlacil, E. J. (2007). Econometric evaluation of social programs, Part I: Causal models, structural models and econometric policy evaluation. In: J. J. Heckman & E. E. Leamer (Eds), Handbook of econometrics (Vol. 6B, Ch. 70). Amsterdam: Elsevier. Honore´, B. E. (1993). Identification results for duration models with multiple spells. Review of Economic Studies, 60, 241–246. Keiding, N. (1999). Event history analysis and inference from observational epidemiology. Statistics in Medicine, 18, 2353–2363. Kortram, R., Lenstra, A., Ridder, G., & Van Rooij, A. (1995). Constructive identification of the mixed proportional hazards model. Statistica Neerlandica, 49, 269–281. Lancaster, T. (1979). Econometric methods for the duration of unemployment. Econometrica, 47, 939–956. Lancaster, T. (1990). The econometric analysis of transition data. Cambridge: Cambridge University Press. Meyer, B. (1990). Unemployment insurance and unemployment spells. Econometrica, 58, 757–782. Ridder, G. (1986). An event history approach to the evaluation of training, recruitment and employment programmes. Journal of Applied Econometrics, 1, 109–126. Ridder, G. (1990). The non-parametric identification of generalized accelerated failure-time models. Review of Economic Studies, 57, 167–182. Ridder, G., & Woutersen, T. (2003). The singularity of the information matrix of the mixed proportional hazard model. Econometrica, 71, 1579–1589. Tsiatis, A. A. (1975). A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Sciences, 72, 20–22. Van den Berg, G. J. (2001). Duration models: Specification, identification, and multiple durations. In: J. J. Heckman & E. E. Leamer (Eds), Handbook of econometrics (Vol. 5, Ch. 55, pp. 3381–3460). Amsterdam: North-Holland. Van den Berg, G. J., Holm, A., & Van Ours, J. C. (2002). Do stepping-stone jobs exist? Early career paths in the medical profession. Journal of Population Economics, 15, 647–665. Van den Berg, G. J., Van der Klaauw, B., & Van Ours, J. C. (2004). Punitive sanctions and the transition rate from welfare to work. Journal of Labor Economics, 22, 211–241. Vaupel, J. W., Manton, K. G., & Stallard, E. (1979). The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography, 16, 439–454. Widder, D. V. (1946). The Laplace transform. Princeton: Princeton University Press.
52
JAAP H. ABBRING
APPENDIX AN AUXILIARY RESULT The proofs use a result for completely monotone functions. Completely monotone functions are frequently encountered in statistical duration analysis in the form of (derivatives of) Laplace transforms. They are formally defined as follows. Definition 1. Let O be a nonempty open set in Rn . A function f : O ! R is absolutely monotone if it is nonnegative and has nonnegative continuous partial derivatives of all orders. f is completely monotone if f o m is absolutely monotone, where m : x 2 fo 2 Rn : o 2 Og7! x. Note that for n=1, this definition reduces to the familiar definition in Widder (1946). Abbring and Van den Berg (2003a, Proposition 1) state the following result. Proposition 4. Let C be a nonempty open connected set in Rn , and f : C ! R and g : C ! R be completely monotone. If f and g agree on a nonempty open set in C, then f=g. The proof of Proposition 4 exploits two facts that are well known for functions on R and that are also true for functions on Rn : (i) completely monotone functions are real analytic and (ii) real analytic functions are uniquely determined by their values on a nonempty open set.
PROOFS For lZ0, let Sl0 denote the support of (S0, S1, y, Sl). As an extension of Tsiatis (1975), we can represent the ‘‘data’’ of the identification analysis of Section 3.4, the distribution of {(DTl,Sl), l=1, y, L(j)+1}|S0,X, by a collection fQsLðjÞþ1 ðjxÞ; s 2 SLðjÞþ1 ; x 2 Xg such that QsLðjÞþ1 ðtjX Þ equals 0 PrðDT 1 4t1 ; S 1 ¼ s1 ; . . . ; DT LðjÞþ1 4tLðjÞþ1 ; S LðjÞþ1 ¼ sLðjÞþ1 jX ; S0 ¼ s0 Þ almost surely, for all s :¼ ðs0 ; . . . ; sLðjÞþ1 Þ 2 S0LðjÞþ1 and t :¼ ðt1 ; . . . ; tLðjÞþ1 Þ 2 RLðjÞþ1 . þ From these data, we can derive Qlsl , defined analogously to QsLðjÞþ1 ; 0 P and Rlsl1 :¼ s2Zðsl1 Þ Qlsl1 s ; for l ¼ 1; 2; . . . ; LðjÞ þ 1; sl1 :¼ ðs0 ; . . . ; sl1 Þ 0 0
l l 2 Sl1 0 , and s0 2 S0 .
0
The Event-History Approach to Program Evaluation
53
Note that Rlsl1 ðtl1 jX Þ equals 0
PrðDT 1 4t1 ; S 1 ¼ s1 ; . . . ; DT l1 4tl1 ; S l1 ¼ sl1 ; DT l 4tl jX ; S 0 ¼ s0 Þ l l almost surely, for all sl1 2 Sl1 0 ; t1 :¼ ðt1 ; . . . ; tl Þ 2 Rþ ; and l=1,2, y, 0 L(j)+1.
Proof of Proposition 2. The proof proceeds iteratively. First, consider identification of fp for p 2 Xðu0 Þ. Pick an arbitrary x 2 X. Note that Q1p ðjxÞ and Q1p ðjxn Þ are differentiable almost everywhere, and that fp ðxÞ ¼ lim t#0
@Q1p ðtjxÞ=@t @Q1p ðtjxn Þ=@t
because of Assumptions 5 and 6 and the normalization fp(x)=1. Because x is arbitrary, this identifies (fp ; p 2 Xðu0 Þ). Next, iterate the following argument for l=1, y, L(j). Supposing that fu0 u1 ; . . . ; ful1 u1 are identified, consider the identification of fp for lþ1 n p 2 Xðul Þ. Pick an arbitrary x 2 X. Note that Qlþ1 s ðjxÞ and Qs ðjx Þ are differentiable almost everywhere, and that fu0 u1 ðxÞ ful1 ul ðxÞful s ðxÞ ¼ lim t#0
@lþ1 Qlþ1 s ðtjxÞ=@t1 @tlþ1 lþ1 lþ1 @ Qs ðtjxn Þ=@t1 @tlþ1
with s=(u0, y,ul,s), s 2 Zðul Þ, and t=(tl, y,tl+1). Here, we have used Assumptions 5 and 6 and the normalizations fu0 u1 ðxn Þ ¼ ¼ ful1 ul ðxn Þ ¼ ful s ðxn Þ ¼ 1. Because x is arbitrary, this identifies (fp ; p 2 Xðul Þ). Proof of Proposition 3. (fp ; p 2 X) is identified by Proposition 2. The remainder of the proof again proceeds iteratively. First, Assumptions 4 and 5 and the normalizations Lp ðt Þ ¼ 1; p 2 Xðu0 Þ, imply that R1u0 ðtn jxÞ ¼ LXðu0 Þ ðfp ðxÞ; p 2 Xðu0 ÞÞ where LXðu0 Þ is the Laplace transform of the joint distribution of (V p ; p 2 Xðu0 Þ). Note that R1u0 ðt jÞ and (fp ; p 2 Xðu0 Þ) are identified at this point. So, by Assumption 7 we can trace out LXðu0 Þ on a nonempty open set in ð0; 1Þ#Xðu0 Þ . Because LXðu0 Þ is completely monotone, this identifies LXðu0 Þ by Proposition 4. By implication, (Dp LXðu0 Þ ; p 2 Xðu0 Þ) is identified,
54
JAAP H. ABBRING
with Dp LXðu0 Þ being the partial derivative of LXðu0 Þ with respect to the argument corresponding to V p ; p 2 Xðu0 Þ. Pick an arbitrary x. For almost all tA(0,N) L0p ðtÞ ¼
@Q1p ðtjxÞ=@t fp ðxÞDp LXðu0 Þ ðLp0 ðtÞfp0 ðxÞ; p0 2 Xðu0 ÞÞ
;
p 2 Xðu0 Þ
by Assumptions 4 and 5. These #Xðu0 Þ equations form a system of differential equations in (L0p ; p 2 Xðu0 Þ), (Lp ; p 2 Xðu0 Þ), and t, with initial conditions Lp ðt Þ ¼ 1; p 2 Xðu0 Þ, in the sense of Carathe´odory (1918). Analogous to Abbring and Van den Berg (2003a, Proposition 2), this system can be shown to have a unique solution (Lp ; p 2 Xðu0 Þ) in terms of (Q1p ; Dp LXðu0 Þ , fp ðxÞ; p 2 Xðu0 Þ). Because the latter have been identified at this point, this establishes the identification of (Lp ; p 2 Xðu0 Þ). Second, iterate the following argument for l=1, y, L(j). Suppose that l1 (Lp ; p 2 Xðul1 :¼ ðu0 ; . . . ; ul1 Þ and Xðul1 0 Þ) is identified, with u0 0 Þ :¼ l1 [i¼0 Xðui Þ. By Assumptions 4 and 5, the lth partial derivative Dul LXðul Þ of LXðul Þ with respect to the arguments corresponding to 0 0 0 V u0 u1 ; . . . ; V ul1 ul satisfies Dul LXðul Þ ðLp ðtiþ1 Þfp ðxÞ; p 2 Xðui Þ; i ¼ 0; . . . lÞ ¼ 0
0
@Rlþ1 ðtlþ1 1 jxÞ=@t1 @tl ul 0
Pli¼1 L0ui1 ui ðti Þfui1 ui ðxÞ
for almost all tl1 2 Rlþ , all tlþ1 2 Rþ , and all x 2 X. Because the right-hand side is known at this point, this identifies Dul LXðul Þ ðLp ðtiþ1 Þfp ðxÞ; p 2 Xðui Þ; i ¼ 0; . . . ; lÞ 0
0
2 Rlþ1 and x 2 X. Moreover, with the normalizations for all tlþ1 þ 1 Lp ðt Þ ¼ 1; p 2 Xðul Þ, (Lp ðtiþ1 Þfp ðxÞ; p 2 Xðui Þ; i ¼ 0; . . . ; l) is identified for all tl1 2 Rlþ ; tlþ1 ¼ tn , and x 2 X at this point. So, by Assumption 7, we l can trace out Dul LXðul Þ on a nonempty open set in ð0; 1Þ#Xðu0 Þ . Because 0 0 ð1Þl Dul LXðul Þ is completely monotone, this identifies Dul LXðul Þ by 0 0 0 0 Proposition 4. By implication, ðDul s LXðul Þ ; s 2 Zðul ÞÞ is identified. 0 0 Pick an arbitrary x. Pick ti such that Lui1 ui is differentiable at ti, i=1, y, l. For almost all tl+1A(0,N)
1 l L0ul s ðtlþ1 Þ ¼ ful s ðxÞ P L0ui1 ui ðti Þfui1 ui ðxÞ i¼1
@Qlþ1 ðtlþ1 jxÞ=@t1 @tlþ1 ul s 1 0
Dul s LXðul Þ ðLp ðtiþ1 Þfp ðxÞ; p 2 Xðui Þ; i ¼ 0; . . . ; lÞ 0
0
(A.1) ;
s 2 Zðul Þ
The Event-History Approach to Program Evaluation
55
by Assumptions 4 and 5. These #Xðul Þ equations again form system of differential equations in the sense of Caratheodory (1918), now in ðL0p ; p 2 Xðul ÞÞ, ðLp ; p 2 Xðul ÞÞ, and tlþ1 , with initial conditions Lp ðtn Þ ¼ 1; p 2 Xðul Þ. Standard theory can again be applied to show that this system has a unique solution ðLp ; p 2 Xðul ÞÞ in terms of l 0 lþ1 Qul s ; Dul s LXðul Þ; ful s ðxÞ P Lui1 ui ðti Þfui1 ui ðxÞ; s 2 Zðul Þ 0
0
0
i¼1
Because the latter is identified at this point, this establishes the identification of ðLp ; p 2 Xðul ÞÞ. Finally, note that LX is identified by integrating DuLðjÞ LX . In turn, LX 0 identifies the joint distribution of (V p ; p 2 X) by the uniqueness of the multivariate Laplace transform. &
BAYESIAN ANALYSIS OF TREATMENT EFFECTS IN AN ORDERED POTENTIAL OUTCOMES MODEL Mingliang Li and Justin L. Tobias ABSTRACT We describe a new Bayesian estimation algorithm for fitting a binary treatment, ordered outcome selection model in a potential outcomes framework. We show how recent advances in simulation methods, namely data augmentation, the Gibbs sampler and the Metropolis-Hastings algorithm can be used to fit this model efficiently, and also introduce a reparameterization to help accelerate the convergence of our posterior simulator. Conventional ‘‘treatment effects’’ such as the Average Treatment Effect (ATE), the effect of treatment on the treated (TT) and the Local Average Treatment Effect (LATE) are adapted for this specific model, and Bayesian strategies for calculating these treatment effects are introduced. Finally, we review how one can potentially learn (or at least bound) the non-identified cross-regime correlation parameter and use this learning to calculate (or bound) parameters of interest beyond mean treatment effects.
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 57–91 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00003-5
57
58
MINGLIANG LI AND JUSTIN L. TOBIAS
1. INTRODUCTION As evidenced by the vast literature dedicated to the issue, the problem of identifying and estimating the effects of ‘‘treatment’’ from observational data is of central importance to economics and the social sciences. As suggested by the articles appearing in this volume, there are many estimation strategies commonly employed in this literature, and the assumptions made in and issues emphasized by these various approaches can be quite distinct. For instance, some studies employ fully parametric models to conduct their analyses, arguing that the use of such models permits the estimation of a wide range of policy-relevant parameters,1 while others seek a more agnostic approach and thus pursue non-parametric or semiparametric techniques.2 Many empirical studies in this area argue that the most convincing way to surmount the problem of treatment endogeneity is to make use of cleverly chosen natural experiments or instrumental variables,3 while others are content to pursue more structural equation approaches where the role of the exclusion restriction is decidedly less important and the discussion surrounding the instrument is muted.4 Finally, as in econometrics generally, there are both Bayesian and Classical approaches for handling these types of models. In this paper we focus primarily on this last distinction and take up the case of Bayesian estimation of a particular type of treatment-response model. While Bayesian work on the analysis of treatment or causal effects has become more common in the econometrics literature (e.g., Vijverberg, 1993; Koop & Poirier, 1997; Li, 1998; Chib & Hamilton, 2000, 2002; Poirier & Tobias, 2003; Li, Poirier & Tobias, 2004), the use of such techniques continues to remain rare relative to Classical approaches. We do not aim to reduce this disparity by proselytizing at length in this paper about the merits of the Bayesian approach relative to Classical methods. Instead, our goal is to review how a Bayesian might handle specifications, similar to the Roy (1951) model, which are commonly encountered in the treatment effect literature, to review some computational advances which should appeal to all researchers when faced with estimation of these types of models, to introduce an issue that is somewhat unique to the Bayesian literature on this topic and to provide new results on Bayesian estimation of a specific type of treatment effect model. We take up the particular case of a treatment-response model where treatment status is binary and the outcome of interest is ordered. To our knowledge, a discussion of this particular model is new to the Bayesian literature, though highly related models, including those of the binary treatment/continuous outcome and ordered treatment/binary outcome varieties, have appeared in Chib and Hamilton (2000). We present our model in a
Bayesian Analysis of Treatment Effects
59
potential outcomes framework and thus model both the observed outcome of the agent given her treatment choice as well as the potential or counterfactual outcome for that agent had she made a different treatment decision. We show how data augmentation (e.g., Tanner & Wong, 1987; Albert & Chib, 1993) in conjunction with the Gibbs sampler and Metropolis–Hastings algorithm (e.g., Casella & George, 1992; Tierney, 1994; Chib & Greenberg, 1995) can be used to fit this particular model efficiently, and also introduce a reparameterization to help accelerate the convergence of our posterior simulator. Several computational strategies which allow for non-Normality are also discussed, though not employed. Treatment effects similar in spirit to the Average Treatment Effect (ATE), the effect of treatment on the treated (TT) and the Local Average Treatment Effect (LATE)5 are adapted for the case of our ordered response, and Bayesian strategies for calculating these treatment effects are described. Finally, we discuss how one can potentially learn about (or at least bound) the non-identified cross-regime correlation parameter6 and use this learning to calculate (or bound) parameters of interest beyond mean treatment effects. The outline of this chapter is as follows. Section 2 presents the basic potential outcomes model and Section 3 discusses our Bayesian estimation algorithm. Often-reported treatment parameters such as ATE, TT and LATE are derived for our model in Section 4 and procedures for calculating these effects are described. A generated data experiment which illustrates the performance of our algorithm is provided in Section 5, and the paper concludes with a summary in Section 6.
2. THE MODEL What we have in mind is the development of a parametric model that will enable researchers to investigate the impact of a binary (and potentially endogenous) treatment variable, denoted D, where D=1 implies receipt of treatment and D=0 implies non-receipt, on an ordered outcome of interest, denoted y 2 f1; 2; . . . Jg. There are numerous examples where such a model would be appropriate. For example, one might use this model to investigate, say, the impact of enrolling in a supplemental learning center on attitudes toward education (measured as a categorical response) or the quantity of education ultimately received by the student. More generally, such a model is potentially of value in any situation where the outcome of interest (e.g., earnings, education, expenditure) is recorded categorically rather than continuously and the model also contains a dummy endogenous variable.7
60
MINGLIANG LI AND JUSTIN L. TOBIAS
We cast this evaluation problem in a potential outcomes framework and thus explicitly model the counterfactual state – the ordered outcome that would have been observed had the agent made a different treatment decision. Let y(1) denote the outcome received by the agent in the treatment state and y(0) denote the outcome received without treatment. Only one outcome, denoted yi, is ever observed for any agent, and thus ð0Þ yi ¼ Di yð1Þ i þ ð1 Di Þyi
We suppose that the observed treatment decision D and the observed and potential ordered outcomes y(1) and y(0) are generated by an underlying latent variable representation of the model. Specifically, we write8 Di ¼ wi bðDÞ þ ui
(1)
ð1Þ þ Eð1Þ zð1Þ i ¼ xi b i
(2)
ð0Þ þ Eð0Þ zð0Þ i ¼ xi b i
(3)
The binary treatment indicator Di is related to the latent Di as follows: Di ¼ IðDi 40Þ ¼ I½ui 4 wi bðDÞ
(4)
with I( ) denoting the standard indicator function. When the dependence of the treatment decision D on covariates w is crucial to the discussion (as in Sections 4.2 and 4.3), we will typically express D as D(w) to make this dependence explicit. ð0Þ In a similar fashion the ordered responses yð1Þ i and yi are related to the ð1Þ ð0Þ latent variables zi and zi as follows: ðkÞ ðkÞ ðkÞ yðkÞ i ¼ j iff aj ozi ajþ1 ;
k ¼ 0; 1;
j ¼ 1; 2; . . . ; J
(5)
k ¼ 0; 1; j ¼ 1; 2; . . . ; J are cutpoints in the model, mapping The faðkÞ j g; the latent indices in both states into discrete values of our ordered response. We impose standard identification conditions on these cutpoints, namely, ð0Þ ð1Þ ð0Þ ð1Þ ð0Þ að1Þ 1 ¼ a1 ¼ 1; a2 ¼ a2 ¼ 0 and aJþ1 ¼ aJþ1 ¼ 1. We also let ð1Þ ð1Þ að1Þ ¼ ½að1Þ 3 a4 aJ
denote the cutpoint vector for the treated state and define the 1 (J2) vector a(0) similarly. In this model we also assume the availability of an exclusion restriction – some covariate which enters w that is not contained in x. To motivate the
Bayesian Analysis of Treatment Effects
61
importance of this assumption, consider a restricted version of Eqs. (1)–(3) which consists of Eq. (1) and a latent variable equation like Eq. (2), the latter of which includes the observed Di as an element of xi. This restricted model would be of the form of a ‘‘standard’’ treatment or causal effect model that only works with observed rather than potential outcomes. Maddala (1983, p. 122), for example, shows that the parameters of such a model are not identifiable unless the errors of the equation system are uncorrelated or such an exclusion restriction is present. The former condition often seems rather untenable in empirical practice, and thus we maintain that such an exclusion restriction is available. Finally, we fix ideas throughout the remainder of this discussion by assuming joint Normality of the error terms:9 2 3 02 3 2 31 ui 1 rð1Þ rð0Þ 0 X iid 6 Eð1Þ 7 B6 7 6 ð1Þ C 1 rð10Þ 7 (6) 4 i 5jxi ; wi N @4 0 5; 4 r 5A N 0; ð0Þ ð10Þ Eð0Þ r r 1 0 i Eqs. (1)–(6) then denote the complete specification of our ordered potential outcomes model. 2.1. The Likelihood Given the assumed conditional independence across observations, we can write the likelihood function for this model as " # Y ð1Þ pðy; DjGÞ LðG; y; DÞ ¼ Prðyi ¼ yi ; Di ¼ 1jGÞ i:Di ¼1
"
Y
# Prðyð0Þ i
¼ yi ; Di ¼ 0jGÞ
i:Di ¼0
where G ¼ ½bðDÞ bð1Þ bð0Þ að1Þ að0Þ rð0Þ rð1Þ rð10Þ . The joint probabilities required in calculating this likelihood can be obtained from the bivariate Normal cdf. For example, ð1Þ ð1Þ ðDÞ ð1Þ Prðyð1Þ jwi ; xi ; GÞ (7) i ¼ yi ; Di ¼ 1jGÞ ¼ Prðayi o zi ayi þ1 ; ui 4 wi b ð1Þ ð1Þ ð1Þ ¼ Prðað1Þ yi xi b o Ei ayi þ1
xi bð1Þ ; ui 4 wi bðDÞ jwi ; xi ; GÞ
(8)
62
MINGLIANG LI AND JUSTIN L. TOBIAS
Provided one uses a statistical package containing a routine for evaluating a bivariate Normal cdf, standard MLE can be implemented. If no such routine is available on a particular package, one could first reduce the probabilities above to univariate integration problems and then employ standard numerical approximations such as Simpson’s rule or Gaussian quadrature to approximate the requisite integrals. To see this more clearly, let P1; j PrðDi ¼ 1; yð1Þ i ¼ jjGÞ, and note from Eq. (8) ð1Þ ð1Þ ð1Þ ð1Þ ðDÞ P1; j ¼ Prðað1Þ jxi ; wi ; GÞ j xi b o Ei ajþ1 xi b ; ui 4 wi b Z að1Þ xi bð1Þ Z 1 jþ1 ð1Þ ¼ pðEð1Þ i ; ui Þdui dEi ð1Þ að1Þ j xi b
Z
að1Þ xi bð1Þ jþ1
¼ ð1Þ að1Þ j xi b
wi bðDÞ
0
1 ðDÞ
rð1Þ Eð1Þ i C
Bw i b ð1Þ F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ApðEð1Þ i ÞdEi 2 1 rð1Þ
In this form a variety of approaches can be employed to approximate the required univariate integrals. In our discussion of treatment effects in Section 4, we will return to one approach to this problem based on Monte Carlo integration using truncated Normal sampling. Importantly, we also recognize that our estimation strategy via data augmentation, as described in the following section, avoids the need for any numerical integration of the above form, and therefore provides an attractive alternative to the implementation of standard MLE.
3. BAYESIAN ESTIMATION To perform a Bayesian analysis, a researcher first starts off as a classical econometrician might be by specifying the likelihood function for this model, as implied from Eqs. (1)–(6) and described in the preceding section. To this likelihood, the researcher adds a prior density, say p(G), with G denoting the parameters of the model. This prior is chosen to reflect her subjective beliefs about values of the parameters, and in most cases is chosen to be sufficiently ‘‘vague’’ or ‘‘flat’’ so that information contained in the data will dominate information insinuated through the prior. The prior density p(G) combined with the likelihood p(y,D|G) yields the joint posterior density p(G|y,D) up to proportionality via Bayes theorem. This joint posterior completely summarizes the ‘‘output’’ of a Bayesian procedure – from it, one
Bayesian Analysis of Treatment Effects
63
can obtain point and interval estimates, marginal posterior densities, posterior quantiles, or other quantities of interest. While in theory this simple exercise outlines the machinery involved in Bayesian posterior calculations, in practice, extracting useful information from a given posterior p(G|y,D) can be difficult. Direct calculation of a posterior mean of an element of G, for example, first requires that the normalizing constant of the joint posterior is known (while often it is not), and even if the normalizing constant were known, the mean calculation would still require solving a high-dimensional integration problem. In models of moderate complexity, these integration problems usually have no analytic solutions. Instead of direct evaluation of this posterior, modern Bayesian empirical work makes use of recent advances in simulation methods to carry out a posterior analysis. Two simulation devices in particular, called the Gibbs sampler and Metropolis–Hastings algorithm, are widely used and have become indispensable instruments in an applied Bayesian’s toolkit. Both of these algorithms solve the problem of calculation of posterior moments, quantiles, marginal densities or other quantities of interest by first obtaining a set of draws from the posterior p(G|y,D). Typically, one cannot draw directly from this density, but instead, one can generate a sequence of draws (by appropriately following the steps of the algorithms) that converge to this distribution. Once convergence has been ‘‘achieved,’’ the subsequent set of simulated parameter values can be used to calculate the desired quantities (e.g., posterior means). In the Gibbs sampler, a Markov chain whose limiting distribution is p(G|y,D) is produced by iteratively sampling form the complete posterior conditionals of the model. In many cases, typically in models with conditionally conjugate priors, these posterior conditionals have well-known forms and can be easily sampled. The Metropolis– Hastings algorithm is a generalization of the Gibbs sampler and is a multivariate accept–reject algorithm. The algorithm is, again, constructed so that the limiting distribution of the Markov chain is the target density, p(G|y,D).10 In terms of the model described in this chapter, Bayesian estimation of the specification in Eqs. (1)–(6) would likely make use of data augmentation (e.g., Tanner & Wong, 1987; Albert & Chib, 1993) in conjunction with the algorithms above. When data augmentation is used, the posterior is first expanded (or, as the name suggests, augmented) to include not only the parameter vector G, but also the latent data s ¼ ½D zð1Þ zð0Þ . Although this would seem to complicate the estimation exercise, use of data augmentation often simplifies the required posterior calculations. This is
64
MINGLIANG LI AND JUSTIN L. TOBIAS
particularly true when data augmentation is used in conjunction with the Gibbs sampler since, conditioned on the latent data, inference regarding the regression parameters proceeds as a linear regression model would, and given the regression parameters, it is often straightforward to obtain draws from the posterior conditional for the latent data. For our model, this augmented posterior is of the form: pðD ; zð1Þ ; zð0Þ ; Gjy; DÞ / pðy; D; D ; zð1Þ ; zð0Þ ; GÞ
(9)
¼ pðy; DjD ; zð1Þ ; zð0Þ ; GÞpðD ; zð1Þ ; zð0Þ jGÞpðGÞ
(10)
with p(G) denoting the prior for the parameters of our model. The middle term in the above expression is immediately known as a trivariate Normal density, given the joint Normality assumption in Eq. (6) combined with the model in Eqs. (1)–(3). The last term simply denotes the prior for our model parameters. For the first term, conditioned on the latent variables and model parameters, the observed responses D and y are known with certainty and thus the joint (conditional) distribution for y and D is degenerate. Putting these pieces together, and exploiting the assumed conditional independence across observations, we can write the augmented posterior as follows: pðD ; zð1Þ ; zð0Þ ; GjD; yÞ / pðGÞ
n Q
f3 ðsi ; ri b;
P
Þ
i¼1 ð1Þ ð1Þ ½IðDi ¼ 1ÞIðDi 40ÞIðað1Þ yi ozi ayi þ1 Þ
(11)
ð0Þ ð0Þ þIðDi ¼ 0ÞIðDi 0ÞIðað1Þ yi ozi ayi þ1 Þ
where 2
3
2 wi 6 ð1Þ 7 6 z 7 si ¼ 6 4 i 5; ri ¼ 4 0 0 zð0Þ i Di
0 xi 0
3 bðDÞ 6 ð1Þ 7 07 5; b ¼ 4 b 5 xi bð0Þ 0
3
2
(12)
and f3(x;m,O) denotes a trivariate Normal density with mean m and covariance matrix O. Finally, S is defined in Eq. (6). The indicator functions added to Eq. (11) serve to capture the degenerate joint distribution of y and D given the latent data and model parameters.
Bayesian Analysis of Treatment Effects
65
3.1. A Useful Reparameterization In theory, one could directly apply standard computational tools (namely the Gibbs sampler coupled with a few Metropolis-within-Gibbs steps) to fit the model in Eq. (11). However, it has been shown in related works (e.g., Cowles, 1996; Nandram & Chen, 1996; Li & Tobias, 2005) that use of the standard Gibbs sampler in models with ordered responses suffers from slow mixing due to high correlation between the simulated cutpoints and latent data. As discussed in the previous section, the parameter draws obtained from our estimation algorithm form a Markov chain, and when the chain mixes slowly, we observe only very small local movements from iteration to iteration. As a result, it may take a very long time for our simulator to traverse the entire parameter space. When the lagged autocorrelations between the simulated parameters are very high, estimates of posterior features may be quite inaccurate, and numerical standard errors associated with those estimates will be unacceptably large. To mitigate this slow mixing problem, and move closer to a situation where we can obtain iid samples from the posterior, we suggest below an alternate parameterization of the model, building of the suggestion of Nandram and Chen (1996). To shed some insight on this reparameterization, first separate out ð0Þ the largest cutpoints from the treated state, ðað1Þ J Þ, and untreated state, ðaJ Þ, and define the transformations: 2 s1 ¼ 1=½að1Þ J
2 s0 ¼ 1=½að0Þ J pffiffiffiffiffi ~ ð0Þ pffiffiffiffiffi s1 Qð1Þ s0 Qð0Þ i and define Qi i
and
ð1Þ In addition, for any variable Qi let Q~ i similarly. The model in Eqs. (1)–(3) is then observationally equivalent to
Di ¼ wi bðDÞ þ ui
(13)
~ ð1Þ þ E~ ð1Þ z~ð1Þ i ¼ xi b i
(14)
~ ð0Þ þ E~ ð0Þ z~ð0Þ i ¼ xi b i
(15)
where ~ ðkÞ ~ ðkÞ zðkÞ yðkÞ i ¼ j iff a j o~ i a jþ1 ;
k ¼ 0; 1
(16)
In other words, the likelihood function for the observed data is unchanged pffiffiffiffiffi pffiffiffiffiffi when multiplying Eqs. (2) and (3) by s1 and s0 , respectively, and
66
MINGLIANG LI AND JUSTIN L. TOBIAS
appropriately adjusting the rule in Eq. (16) which maps the latent data into the observed responses. The error variance for the transformed disturbances now takes the following form: 2 3 02 3 2 31 ui 1 s~ 1D s~ 0D 0 iid 6 E~ð1Þ 7 B6 7 6 7C (17) 4 i 5jxi ; wi N @4 0 5; 4 s~ 1D s1 s~ 10 5A N 0; S~ ~ ~ s s s E~ð0Þ 0 0D 10 0 i pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffipffiffiffiffiffi where s~ 1D s1 rð1Þ ; s~ 0D s0 rð0Þ and s~ 10 ¼ s1 s0 rð10Þ . The correla(1) (0) (10) tion parameters r , r and r are defined in Eq. (6). When the model is written as in Eqs. (13)–(17), it suggests that we can work with an augmented posterior distribution containing the latent variables D ; z~ð1Þ ; z~ð0Þ and parameters G~ ¼ ½b~ s~ 1D s~ 0D s~ 10 s1 s0 a~ ð1Þ a~ ð0Þ instead of D*, z(1), z(0) and G as in Eq. (11). The transformed cutpoint and ~ are defined as follows:11 coefficient vectors contained in G a~ ð1Þ ¼ ½~að1Þ 3 b~ ¼ ½b
ðDÞ
a~ ð1Þ 4
...
ð1Þ
b~
a~ ð1Þ J1 ;
a~ ð0Þ ¼ ½~að0Þ 3
a~ ð0Þ 4
...
a~ ð0Þ J1 ;
and
ð0Þ
b~
Following similar derivations to those leading to Eq. (11), we obtain the augmented joint posterior distribution for the transformed parameters: ~ ~ yÞ / pðGÞ pðDn ; z~ð1Þ ; z~ð0Þ ; GjD;
n Y
~ SÞ ~ f3 ð~si ; ri b;
i¼1
~ ð1Þ ½IðDi ¼ 1ÞIðDni 40ÞIð~að1Þ zð1Þ i a yi o~ yi þ1 Þ
(18)
~ ð0Þ þ IðDi ¼ 0ÞIðDni 0ÞIð~að0Þ zð0Þ i a yi o~ yi þ1 Þ 0 ~ z~ð0Þ where s~i ½Di z~ð1Þ i i and S is defined in Eq. (17). When working with this model, we employ independent priors for the ~ parameters of G: ~ að1Þ Þpð~að0Þ Þp S ~ ¼ pðbÞpð~ ~ pðGÞ
We center the regression parameters around a prior mean of zero and specify them to be independently distributed with large prior variances: b~ Nðb0 ¼ 0k1 ; V b~ ¼ 1000I k Þ. The prior probability density function of a~ ð1Þ and a~ ð0Þ is assumed to be proportional to some constant: ~ ð1Þ ~ ð0Þ ~ ð0Þ pð~að1Þ 3 ;...;a J1 ; a 3 ;...;a J1 Þ / c, and finally, an inverse Wishart prior ~ of the form S IW ðr; RÞwith r ¼ 6; R ¼ I 3 is employed subject to the ~ is equal to one. restriction that the (1,1) element of S
Bayesian Analysis of Treatment Effects
67
3.2. Benefits and Costs of Reparameterization To this point we have offered no compelling arguments why one should work with the reparameterized model instead of working directly with the original ‘‘structural’’ representation of the model. The first argument in support of the reparameterization, as noted in Nandram and Chen (1996), and further shown in Li and Tobias (2005), is that the rescaling helps to significantly reduce the autocorrelation among the posterior simulations, thus accelerating the convergence of the algorithm. In other words, given an equal number of posterior draws, the numerical standard errors obtained when working with the reparameterized model will be significantly smaller than those obtained when using the original parameterization of the model. Second, and quite importantly from a computational point of view, this transformation effectively ‘‘restores’’ the conjugacy required to simulate the parameters of the covariance matrix. That is, in the original parameterization of the model in Eq. (6), there are restrictions on all three diagonal elements of the covariance matrix. This precludes drawing the elements of the inverse covariance matrix from a Wishart distribution (as is typically the case when conjugate priors are employed), since the posterior conditional is no longer Wishart given the diagonal restrictions. On the other hand, in our reparameterized version of the model, the covariance matrix in Eq. (17) contains only one diagonal restriction, and using Algorithm 3 of Nobile (2000), one can generate draws from this restricted Wishart density. Thus, working with the reparameterized model facilitates simulating parameters of the covariance matrix, and no Metropolis–Hastings steps are required for this portion of the posterior simulator. Finally, for the specific case where there are three possible ordered outcomes (i.e., J=3), there are effectively no unknown cutpoints in the transformed representation of this model. For this case, the cutpoints are sampled through standard sampling of the elements of the covariance matrix. For this particular model with three outcomes, posterior simulation using the reparameterized model is quite fast, and no Metropolis–Hastings steps are required at any point in the algorithm. For J>3, however, additional Metropolis–Hastings steps are required to simulate elements of the cutpoint vectors a~ 1 and a~ 0 . The main, and perhaps only, drawback to working with the reparameterized model is that it requires us to place priors on the transformed ~ The priors we place on these parameters may seem reasonable parameters G. and suitably ‘‘default,’’ but upon closer investigation, they may imply priors for the structural parameters that are unreasonable and are at odds with our views about quantities for which we can more easily elicit our prior beliefs.
68
MINGLIANG LI AND JUSTIN L. TOBIAS
For a two-equation treatment-response model containing an ordered treatment variable and an ordered response, Li and Tobias (2005) derive some connections between priors like those employed in Eq. (18) and their consequences on the priors implied on the structural parameters in Eq. (11). With suitably chosen hyperparameters, they argue that the implied priors on the structural coefficients can be reasonable, and that any costs associated with this prior selection issue are more than outweighed by the benefits afforded by the reparameterization. We take up a more detailed view of this issue of prior selection for this particular model in our generated data experiments of Section 5.
3.3. The Posterior Simulator We now introduce our posterior simulator for fitting our reparameterized ~ x to treatment-response model. In what follows, we adopt the notation G denote all parameters other than x. We first group the joint posterior into ~ The latent data and cutpoints will be ½D z~ð1Þ z~ð0Þ a~ ð1Þ a~ ð0Þ b~ S. sampled in blocking steps, while the regression parameters and covariance matrix will be drawn from their complete posterior conditional. Step 1: Draw b~ from12 ~ s; G~ ~ ; y; D NðD ~ d ~ ; D ~ Þ bj~ b b b b where Db~
n X
1
!1
~ ri þ V 1 ri 0 S b~
and
i¼1
d b~
n X
1
r0i S~ s~i þ V 1 b0 b~
i¼1
~ from Step 2: Draw S "
#! n X 0 ~ si ri bÞ ~ þ rR IðS ~ s; G ~ ~ ; y; D IW n þ r; ~ 11 ¼ 1Þ Sj~ ð~si ri bÞð~ S i¼1
Algorithm 3 in Nobile (2000) is used to generate variates from this inverted Wishart distribution, conditioned on the value of the (1,1) element. The remaining steps in the posterior simulator involve joint sampling of the latent data s~ ¼ ½D z~ð1Þ z~ð0Þ and cutpoint vectors a~ ð1Þ and a~ ð0Þ . We attempt to mitigate autocorrelation in our parameter chains by blocking or grouping the cutpoints from a given equation together with the latent data appearing
Bayesian Analysis of Treatment Effects
69
in that equation. Specifically, we proceed by sampling from the following densities: ~ ð1Þ ; y; D a~ ð1Þ ; z~ð1Þ j~zð0Þ ; D ; G ~a
(19)
~ ð0Þ ; y; D a~ ð0Þ ; z~ð0Þ j~zð1Þ ; D ; G ~a
(20)
~ y; D D j~zð1Þ ; z~ð0Þ ; G;
(21)
and
Taking a closer look at the first of these three densities, we find from Eq. (18) a~ ð1Þ ; z~ð1Þ j~zð0Þ ; D ; G~ ~að1Þ ; y; D n Y ~ S ~ ½Ið~að1Þ o~zð1Þ a~ ð1Þ ÞIðDi ¼ 1Þ þ IðDi ¼ 0Þ f3 s~i ; ri b; / i yi yi þ1 i¼1
/
n Y
(22)
~ ð1Þ ~ ci1 ; s~ c1 Þ½Ið~að1Þ fð~zð1Þ zð1Þ i ;m i a yi o~ yi þ1 ÞIðDi ¼ 1Þ þ IðDi ¼ 0Þ
i¼1
Note that the indicator functions involving Di and z~ð0Þ in Eq. (18) have i disappeared completely simply because we are now conditioning on these latent parameters. In the last line of Eq. (22), we have broken the trivariate ð0Þ Normal density for s~i into a conditional for z~ð1Þ i times the joint for z~i and Di . The latter joint density is then absorbed into the normalizing constant of Eq. (22), as it does not involve a~ ð1Þ or z~ð1Þ . It follows that the conditional mean m~ ci1 and conditional variance s~ c1 are defined as 3 " #1 2 Di wi bðDÞ ~ 1 s 0D ð1Þ 4 5 m~ ci1 xi b~ þ ½s~ 1D s~ 10 ð0Þ s~ 0D s0 z~ ð0Þ xi b~ i
and "
s~ c1
s1 ½s~ 1D
1 s~ 10 s~ 0D
s~ 0D s0
#1 "
s~ 1D s~ 10
#
To obtain a draw from Eq. (19), we proceed in two steps and use the method of composition (see, e.g., Chib, 2001). First, we marginalize Eq. (19) over z~ð1Þ and describe a procedure for drawing a~ ð1Þ from this density. In the second
70
MINGLIANG LI AND JUSTIN L. TOBIAS
~ y; D. The realized values of a~ ð1Þ and z~ð1Þ step, we draw z~ð1Þ from z~ð1Þ j~zð0Þ ; D ; G; then form a draw from Eq. (19). After integrating Eq. (19) over z~ð1Þ , we obtain ! ! Y ~ ci1 a~ ð1Þ ~ ci1 a~ ð1Þ yi þ1 m yi m ð1Þ ð0Þ ~ pffiffiffiffiffic pffiffiffiffiffic F a~ j~z ; D ; G~að1Þ ; y; D / F (23) s~ 1 s~ 1 i:Di ¼1 Step 3: To sample from the density in Eq. (23), we follow the suggestion of Nandram and Chen (1996), who suggest using a Dirichlet proposal density to sample differences13 between the cutpoint values, ~ ð1Þ ~ ð1Þ qð1Þ j ¼ 3; . . . ; J 1. Given that the largest cutpoint takes j ¼a j ; jþ1 a the value of unity, we can then solve back to obtain the values of the ð1Þ ð1Þ J1 cutpoints themselves. Specifically, we sample fqð1Þ j gj¼3 Dirichletfd Pn j nj þ ð1Þ J1 ð1Þ J1 1gj¼3 Þ; where fdj gj¼3 ¼ 0:1 are tuning parameters, and nj i¼1 Iðyi ¼ jÞ IðDi ¼ 1Þ; j ¼ 3; . . . ; J 1 are the numbers of individuals falling into each category of the outcome variable in the treated state. The probability of accepting the candidate draw is the standard Metropolis–Hastings probability, min(R,1), where " pffiffiffiffiffi pffiffiffiffiffi # Y Fð½~að1Þ ~ ci1 = s~ c1 Þ Fð½~að1Þ ~ ci1 = s~ c1 Þ yi ;can m yi þ1;can m R¼ pffiffiffiffiffi pffiffiffiffiffi ~ ci1 = s~ c1 Þ Fð½~að1Þ ~ ci1 = s~ c1 Þ að1Þ i:Di ¼1 Fð½~ yi þ1;l1 m yi ;l1 m 2 3 !dð1Þ nð1Þ j j JY 1 qð1Þ j;l1 6 7 4 5 ð1Þ q j¼3 j;can ‘‘l1’’ denotes the current value of the algorithm and ‘‘can’’ denotes the candidate draw from the Dirichlet proposal density. Step 4: Sample z~ð1Þ i independently from the conditional ( TN ð~að1Þ ;~að1Þ Þ ðm~ ci1; s~ c1 Þ if Di ¼ 1 ind yi ð1Þ ð0Þ yi þ1 ~ ; i ¼ 1; 2; . . . ; n z~i j~z ; D ; G; y; D if Di ¼ 0 Nðm~ ci1 ; s~ c1 Þ This is a Normal density with mean m~ ci1 and variance s~ c1 , and is ~ ð1Þ truncated to the interval ð~að1Þ yi ; a yi þ1 Þ if individual i is observed to be in the treatment group. When Di=0, no restrictions arise regarding the latent data zð1Þ i , and thus the draw is obtained from the untruncated Normal density. To generate draws from a univariate truncated Normal density, one can use standard inversion methods. That is, to generate xBTN(a,b)(m,s2), first
Bayesian Analysis of Treatment Effects
71
draw U uniformly on (0,1) and then set a m a m bm x ¼ m þ sF1 F þU F F s s s Step 5: By similar arguments as those leading up to step 3, one can show that ! ! Y ~ ci0 a~ ð0Þ ~ ci0 a~ ð0Þ yi þ1 m yi m ð0Þ ð1Þ ~ pffiffiffiffiffic pffiffiffiffiffic F (24) F a~ j~z ; D ; G~að0Þ ; y; D / s~ 0 s~ 0 i:Di ¼0 where m~ ci0 xi b~
ð0Þ
" þ ½s~ 0D
s~ 10
1
3 #1 2 Di wi bðDÞ 4 5 ~ ð1Þ s1 b x z~ð1Þ i i
s~ 1D
s~ 1D
and "
s~ c0
s0 ½s~ 0D
1 s~ 10 s~ 1D
s~ 1D s1
#1 "
s~ 0D s~ 10
#
A strategy identical to that described in step 3 can be used to simulate the cutpoints from this proposal density. Step 6: Sample z~ð0Þ i independently from the conditional 8 ðm~ c s~ c Þ if Di ¼ 0 < TN ð~að0Þ að0Þ Þ i0; 0 yi ;~ ind yi þ1 ð0Þ ð1Þ ~ z~i j~z ; D ; G; y; D ; i ¼ 1; 2; :::; n : Nðm~ ci0; s~ c0 Þ if Di ¼ 1 Step 7: Sample D i independently from the conditional ( ~ ciD ; s~ cD Þ if Di ¼ 1 ind TN ð0;1Þ ðm ð0Þ ð1Þ ~ Di j~z ; z~ ; G; y; D ; i ¼ 1; 2; :::; n c c TN ð1;0Þ ðm~ iD ; s~ D Þ if Di ¼ 0 where "
m~ ciD wi bðDÞ þ ½s~ 1D
s1 s~ 0D s~ 10
3 #1 2 ð1Þ ð1Þ z~i xi b~ 4 5 s0 ~ ð0Þ b x z~ð0Þ i i
s~ 10
72
MINGLIANG LI AND JUSTIN L. TOBIAS
and "
s~ cD 1 ½s~ 1D
s1 s~ 0D s~ 10
s~ 10
#1 "
s0
s~ 1D s~ 0D
#
Iterating through steps 1–7 produces a draw from the augmented joint posterior distribution. To recover the structural coefficients of interest, we simply ‘‘invert’’ the mappings described above Eq. (13) and below Eq. (17). 3.4. Extending the Model: Allowing for Non-Normality A limitation of the model described thus far in this paper is its reliance on joint Normality. For some applications, such as log wage outcomes (e.g., Heckman & Sedlacek, 1985; Heckman, 2004), the Normality assumption may be a reasonable approximation, and if the model passes a selection of diagnostic tests14 no further refinements would be required. For other models, researchers may worry about heavy tails, asymmetry or possibly bimodality in the disturbance variance. Below we outline simple computational tricks for capturing these features of the data and generalizing the Normality assumption. The most straightforward extension of the model is to expand to Student’s t-errors by simply adding the appropriate mixing variables to the disturbance variance (see, e.g., Carlin & Polson, 1991; Geweke, 1993; Albert & Chib, 1993; Chib & Hamilton, 2000; Li et al., 2004). For example, if we generalize the Normality assumption in Eq. (6) to 2 3 2 3 ui 1 rð1Þ rð0Þ ind 6 Eð1Þ 7 6 1 rð10Þ 7 where S 4 rð1Þ 4 i 5jli ; xi ; wi ; S Nð0; li SÞ; 5 (25) ð0Þ ð0Þ ð10Þ Ei r r 1 and specify a prior for li of the form15 iid
li IGðv=2; 2=vÞ it follows that (marginalized over the prior for li): 2 3 ui 6 Eð1Þ 7 4 i 5jxi ; wi ; S tv ð0; SÞ Eð0Þ i
(26)
Bayesian Analysis of Treatment Effects
73
a multivariate Student’s t-density with mean zero, scale matrix S and n degrees of freedom. This device is particularly useful for modeling symmetric error densities whose tails are heavier than those implied by the Normal density. In addition, such an extension to the model comes at little computational cost since, conditioned on {li}, sampling the regression parameters and covariance matrix is straightforward, and each li can be drawn independently from its complete posterior conditional, which is of an inverse Gamma form. An analogous and potentially more flexible extension of the model is to suppose that the errors were drawn from a mixture of Normal densities. Like Eq. (25), we might write 2 3 ui ind 6 Eð1Þ 7 (27) 4 i 5jli ; xi ; wi ; S li Nð0; S1 Þ þ ð1 li ÞNð0; S2 Þ ð0Þ Ei So, conditioned on li (which is unobserved), each observation is ascribed to one component of the mixture model with covariance matrix equal to either S1 or S2. Since the component assignment is known given li, it is, again, straightforward to obtain draws from the regression parameters and component-specific covariance matrices. The li are then simulated independently from a two-point distribution.16 Finally, one can generalize this mixture model even further by allowing the regression parameters to vary across the mixture components. To do this, we write ind
si jli ; G li Nðri b1 ; S1 Þ þ ð1 li ÞNðri b2 ; S2 Þ where si and ri are as defined in Eq. (12), and bj, Sj represent the regression parameter vector and covariance matrix from the jth component of the mixture. Generalization to more than two components is also straightforward, and the component indicators and component probabilities can be simulated from multinomial and Dirichlet densities, respectively (see, e.g., Li et al., 2004).
4. TREATMENT EFFECTS In this section we derive expressions for conventional ‘‘treatment effects’’ in our ordered outcome treatment-response model. In particular, we adapt
74
MINGLIANG LI AND JUSTIN L. TOBIAS
conventional treatment parameters including the ATE, the effect of TT and the LATE to our ordered response model, and describe how these can be calculated within this framework.
4.1. The Average Treatment Effect We begin with a discussion of the ATE. This parameter typically quantifies the expected outcome gain for a randomly chosen individual. Since our response is ordered, this parameter may not be of direct relevance, as it demands a cardinal representation of an ordinal variable.17 In light of this issue, we choose to adapt the ATE parameter to describe across-regime changes in probabilities associated with various categories. To fix ideas, then, we consider the impact of the treatment on increasing (or decreasing) the probability that the outcome exceeds the ‘‘lowest’’ category: ATEðx; GÞ Prðyð1Þ 2jx; GÞ Prðyð0Þ 2jx; GÞ
¼
J X ½Prðyð1Þ ¼ jjx; GÞ Prðyð0Þ ¼ jjx; GÞ
(28)
(29)
j¼2
¼
J X ð1Þ ð1Þ ð1Þ ð½Fðað1Þ jþ1 xb Þ Fðaj xb Þ j¼2 ð0Þ ð0Þ ð0Þ ½Fðað0Þ jþ1 xb Þ Fðaj xb ÞÞ
The choice of the lowest category is without loss of generality; other probabilities can be obtained in similar ways. We relate this quantity to ATE since it corresponds to a probability increase (or decrease) for a randomly chosen individual. A point estimate of this treatment impact is readily obtained using our simulated set of parameters drawn from the joint posterior: ^ ATEðxÞ E Gjy;D ½ATEðx; GÞ
M 1 X ATEðx; Gm Þ M m¼1
(30)
where GmBp(G|y,D) and is obtained from the algorithm described in Section 3.3.
Bayesian Analysis of Treatment Effects
75
4.2. The Effect of Treatment on the Treated The effect of TT is a conceptually different parameter and describes the outcome gain (or loss) from treatment for those actually selecting into treatment. Again, we examine the treatment effect on the probability that the outcome variable does not fall into the lowest category: TTðx; w; DðwÞ ¼ 1; GÞ Prðyð1Þ 2jx; w; DðwÞ ¼ 1; GÞ Prðyð0Þ 2jx; w; DðwÞ ¼ 1; GÞ J
X Prðyð1Þ ¼ jjx; w; DðwÞ ¼ 1; GÞ
¼
j¼2
Prðyð0Þ ¼ jjx; w; DðwÞ ¼ 1; GÞ
ð31Þ
where, as stated in the introduction to Section 2, D(w)=1 is equivalent to D=1, but simply emphasizes the dependence of the treatment decision on values of the covariates w. To economize on notation, let us define ð1Þ PTT ¼ jjx; w; DðwÞ ¼ 1; GÞ and 1; j ðGÞ Prðy
(32)
ð0Þ PTT ¼ jjx; w; DðwÞ ¼ 1; GÞ 0; j ðGÞ Prðy
keeping the conditioning on x, w and D(w)=1Pimplicit. Given these TT definitions, it follows that TTðx; w; DðwÞ ¼ 1; GÞ ¼ Jj¼2 ½PTT 1; j ðGÞ P0; j ðGÞ. Recalling our description of the likelihood in Section 2.1, we can write the probabilities in Eq. (32) in more computationally convenient forms. For example, ð1Þ ðDÞ ð1Þ PTT að1Þ Þ 1; j ðGÞ Prðaj oz jþ1 ju4 wb ð1Þ ð1Þ ðDÞ ð1Þ að1Þ Þ ¼ Prðað1Þ j xb o jþ1 xb ju4 wb Z að1Þ xbð1Þ jþ1 ¼ pðð1Þ ju4 wbðDÞ Þdð1Þ ð1Þ að1Þ j xb
Z
að1Þ xbð1Þ jþ1
Z
1
pðð1Þ ; uÞ
¼ ð1Þ að1Þ j xb
Z
wb
Z
1
ðDÞ
að1Þ xbð1Þ jþ1
¼ wb
ðDÞ
að1Þ xbð1Þ j
Prðu4 wbðDÞ Þ pðð1Þ juÞdð1Þ
dudð1Þ pðuÞ
Prðu4 wbðDÞ Þ
du
76
MINGLIANG LI AND JUSTIN L. TOBIAS
2 0 1 0 13 ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ 6 Bajþ1 xb r uC Baj xb r uC7 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 4F@ A F@ A5 ðDÞ 2 2 wb 1 rð1Þ 1 rð1Þ Z
1
pðuÞ Prðu4 wbðDÞ Þ
du
The integral above is simply 0 2 0 1 13 ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ a xb r u a xb r u 6 B jþ1 C B j C7 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi E u 4 F@ A F@ A5 2 2 ð1Þ ð1Þ 1r 1r where u TN ðwbðDÞ ;1Þ ð0; 1Þ. Thus, the strong law of large numbers guarantees that 0 1 X ð1Þ ð1Þ ð1Þ ð1Þ L 1 TT Bajþ1 xb r u C qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P^ 1;j ðGÞ F@ A 2 L l¼1 1 rð1Þ 0 1 ð1Þ ð1Þ ð1Þ ð1Þ Baj xb r u C p TT qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F@ A ! P1; j ðGÞ 2 1 rð1Þ where fuð1Þ gLl¼1 denotes an iid sample from the standard Normal distribution truncated to ðwbðDÞ ; 1Þ.18 Following similar arguments, one can show 0 1 X ð0Þ ð0Þ ð0Þ ðlÞ L a xb r u 1 TT B jþ1 C qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P^ 0; j ðGÞ F@ A 2 L l¼1 1 rð0Þ 0 1 ð0Þ ð0Þ ð0Þ ðlÞ a xb r u B j C p TT qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F@ A ! P0; j ðGÞ 2 1 rð0Þ Putting these results together, and, of course, exploiting the availability of draws from our joint posterior, we can calculate the following point estimate of TT: c w; DðwÞ ¼ 1Þ ¼ E Gjy;D ½TTðx; w; DðwÞ ¼ 1; GÞ TTðx; " # M X J 1 X TT TT ^ ^
ðP ðGm Þ P0; j ðGm ÞÞ ð33Þ M m¼1 j¼2 1; j with Gm pðGjy; DÞ.
Bayesian Analysis of Treatment Effects
77
4.3. The Local Average Treatment Effect The Local Average Treatment Effect can be interpreted as measuring the outcome gain (or loss) from treatment for a group of ‘‘compliers.’’ This corresponds to the effect of treatment on a subgroup of the population who would choose to receive treatment at a particular value of the instru~ 19 Consistent with ment, say w, but would not choose treatment at some w. our discussions in the previous subsections, our parameter of interest is the increased (or decreased) likelihood that the outcome variable exceeds the lowest category: ~ DðwÞ ¼ 1; DðwÞ ~ ¼ 0; GÞ LATEðx; w; w; ~ DðwÞ ¼ 1; DðwÞ ~ ¼ 0; GÞ ¼ Prðyð1Þ 2jx; w; w; ~ DðwÞ ¼ 1; DðwÞ ~ ¼ 0; GÞ Prðyð0Þ 2jx; w; w; ¼
J X
ðPLATE ðGÞ PLATE ðGÞÞ 1; j 0; j
j¼2
where ~ DðwÞ ¼ 1; DðwÞ ~ ¼ 0; GÞ; PLATE ðGÞ PrðyðkÞ ¼ jjx; w; w; k; j
k ¼ 0; 1
To calculate LATE, we follow a similar strategy to that outlined for calculating the TT effect. It follows that 0 1 X ð1Þ ð1Þ ð1Þ ðlÞ L 1 LATE Bajþ1 xb r u C qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P^ 1; j ðGÞ F@ A 2 L l¼1 1 rð1Þ 0 1 ð1Þ ð1Þ ð1Þ ðlÞ a xb r u B j C p LATE qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F@ A ! P1; j ðGÞ 2 1 rð1Þ 0 1 X ð0Þ ð0Þ ð0Þ ðlÞ L 1 LATE Bajþ1 xb r u C qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F@ P^ 0; j ðGÞ A 2 L l¼1 1 rð0Þ 0 1 ð0Þ ð0Þ ð0Þ ðlÞ a xb r u B j C p LATE qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F@ A ! P0; j ðGÞ 2 1 rð0Þ
78
MINGLIANG LI AND JUSTIN L. TOBIAS iid
where uðlÞ TNðwbðDÞ ;wb ~ ðDÞ Þ ð0; 1Þ. We can then proceed to obtain a point estimate of LATE: d ½x; w; w; ~ DðwÞ¼ 1; Dðw~ ¼ 0Þ ¼ E Gjy;D ½LATEðx; w; w; ~ DðwÞ ¼ 1; LATE " # M X J 1 X LATE LATE ~ ¼ 0; GÞ
DðwÞ ðP^ ðGm Þ P^ 0; j ðGm ÞÞ M m¼1 j¼2 1;j with Gm pðGjy; DÞ.
4.4. Beyond Mean Treatment Parameters: Learning about r(10) The treatment parameters discussed in the previous subsections are typical of the mean treatment effects considered in the literature. To see this, note that an equivalent expression of the parameter of interest Pr(y(1)Z2|x)Pr(y(0)Z2|x) is E[I(y(1)Z2)I(y(0)Z2)|x], where I( ) denotes the indicator function. Part of the appeal of these mean treatment parameters is that they enable researchers to quantify a feature of the treatment impact – the average gains or losses under various conditioning scenarios – even though y(0) and y(1) are not jointly observed. Unlike the mean treatment effects described in the previous subsections, however, other quantities of significant policy relevance such as the probability of a positive treatment effect Prðyð1Þ yð0Þ 40jxÞ will depend on the correlation parameter r(10). This correlation parameter does not enter the likelihood for the observed data (see, e.g., Section 2.1) and thus is not identifiable. This fact has, perhaps, limited the scope of most research to the estimation of mean treatment impacts.20 For the Bayesian, this non-identifiability issue raises the question of what can and should be done about our treatment of the correlation parameter r(10).21 One approach, which was used by Chib and Hamilton (2000), is to simply set r(10)=0, fit the model subject to this restriction and then impose that the restricted covariance matrix (subject to r(10)=0) is positive definite. While in most cases this will be an innocuous restriction, in some cases, this approach may have unanticipated consequences. For example, if we set r10=0 in Eq. (6), it follows that S is positive definite if and only if ½rð1Þ 2 þ ½rð0Þ 2 1
Bayesian Analysis of Treatment Effects
79
This restriction thus forces the identified correlation parameters r(1) and r(0) to lie within the unit circle rather than the unit square. To illustrate what this restriction means, suppose that we performed a generated data experiment and set r(1)=r(0)=0.8. If we proceeded to fit this model subject to r(10)=0, and enforced that the restricted covariance matrix was positive definite, then our posterior mode must be inconsistent – the joint posterior r(1),r(0)|y, D could never place any mass over the actual values used to generate the data regardless of the size of the generated data set. This problem manifests itself for rather extreme cases of correlation among the unobservables; if the correlations are more moderate, then this is not likely to be a significant issue.22 An alternate approach, which we have advocated in previous works (e.g., Koop & Poirier, 1997; Poirier & Tobias, 2003; Li et al., 2004), is to simply work with the ‘‘full’’ covariance matrix, as described in Section 3.3, without restricting r(10) a priori. As shown in Poirier and Tobias (2003), this does not induce an inconsistency regarding the identified model parameters, and moreover, one can potentially learn about the non-identified correlation parameter. Intuitively, information arising through the likelihood function will enable us to pin down all of the correlation parameters in Eq. (6) that are identifiable, leaving only r(10) unknown. An additional source of information then arises from the fact that S must be positive definite. In particular, the p.d. restriction imposes that r(10) must have the following conditional support: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rð1Þ rð0Þ ð1 ½rð1Þ 2 Þð1 ½rð0Þ 2 Þ rð10Þ rð1Þ rð0Þ (34) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ ð1 ½rð1Þ 2 Þð1 ½rð0Þ 2 Þ This equation provides identifiable bounds on r(10) as a function of the identified correlation parameters r(1) and r(0). Somewhat surprisingly, these bounds also suggest that selection bias may, in a particular sense, be a good thing. When r(1) and r(0) are large, the bounds given in Eq. (34) become increasingly informative. Intuitively, the presence of selection bias provides a vehicle for learning about r(10) – if the errors in the outcome equations are correlated sufficiently with the error in the treatment equation, then to some extent, they must also be correlated with one another. Eq. (34) shows that beliefs regarding r(10) will generally be revised from the data – as we learn about r(1) and r(0), this information spills over and restricts the conditional support of r(10). This is, unfortunately, as far as the data will take us – the shape of the marginal posterior density of r(10) within the bounds in Eq. (34) is not updated from the data. Poirier and Tobias (2003) show that in sufficiently large samples where r(1) and r(0)
80
MINGLIANG LI AND JUSTIN L. TOBIAS
are estimated precisely and are approximately equal to, say, r(1,) and r(0,): pðrð10Þ jy; DÞ pðrð10Þ jrð1Þ ¼ rð1;Þ ; rð0Þ ¼ rð0;Þ Þ
(35)
That is, the marginal posterior for the non-identified correlation parameter is approximately equal to the conditional prior for that correlation parameter evaluated at the given values r(1,) and r(0,). The support bounds in Eq. (34) are updated from the data, but within the bounds, the shape of the posterior is completely determined by the shape of the conditional prior. For the Bayesian, this is a natural result; in the absence of information arising from the data, one resorts to the use of prior information.23 The results of these studies suggest that there is, in one sense, a limited opportunity for expanding the focus of research beyond mean effects. One could at least bound r(10) and then use these bounds to bound other parameters of interest. If one is comfortable with insinuating prior information, however, one could obtain point estimates of any parameter of interest under a particular prior. ‘‘Default’’ priors yielding marginal posteriors that are uniform over the conditional support bounds may appeal to many researchers when carrying out these calculations. Use of such priors, however, typically makes the problem more challenging from a computational point of view as they often break the inherent conjugacy of the model.
5. A GENERATED DATA EXPERIMENT In this section we conduct a generated data experiment to demonstrate the performance of our posterior simulator and address a potential concern regarding choice of prior. A sample of 5,000 observations is generated from the following ordered potential outcome model: ðDÞ Di ¼ bðDÞ 0 þ wi b1 þ ui ð1Þ zð1Þ þ Eð1Þ i ¼b i ð0Þ zð0Þ þ Eð0Þ i ¼b i
where wi is drawn independently from a N(0,1) distribution and the error terms ð0Þ 0 ½ui Eð1Þ E are drawn jointly from the trivariate Normal distribution: i i 20 1 0 13 2 3 ui 1 0:9 0:7 0 C7 6 Eð1Þ 7 iid 6B C B 4 i 5jwi N 4@ 0 A; @ 0:9 1 0:6 A5 ð0Þ Ei 0:7 0:6 1 0
Bayesian Analysis of Treatment Effects
81
We consider this specific design with a high degree of unobservable correlation to reveal how our algorithm performs when selection bias is a significant problem.24 The non-identified correlation r(10) is set to 0.6, and thus from Eq. (34) the covariance matrix is positive definite. Finally, the regression ðDÞ ð1Þ parameters bðDÞ and bð0Þ and cutpoint values aðkÞ j ¼ 3; 4; 5; k ¼ j ; 0 ; b1 ; b 0; 1 are enumerated in the first column of Table 1, and the observables Di ; yð1Þ i and yð0Þ i are generated as follows: Di ¼ IðDi 40Þ yð1Þ i ¼ j
ð1Þ ð1Þ if að1Þ j ozi ajþ1 ;
j ¼ 1; 2; 3; 4; 5
yð0Þ i ¼ j
ð0Þ ð0Þ if að0Þ j ozi ajþ1 ;
j ¼ 1; 2; 3; 4; 5
With this experimental design, the number of treated versus untreated observations is well-balanced, as 51% of the sample points are assigned to the treatment group. Of those sample points that are assigned to the treatment group, 5%, 5%, 8%, 10% and 71% are associated with ordered outcomes of y(1)=1, 2, 3, 4 and 5, respectively. Likewise, for those observations that do not receive treatment, 46%, 14%, 13%, 10% and 16% of them fall into the categories of y(0)=1,2,3,4 and 5, respectively. We consider this design to be reasonably typical of actual empirical situations, where the outcome variables are not uniformly distributed over the set of possible choices. We fit our model using the posterior simulator described in Section 3.3, run the algorithm for 3,000 iterations and discard the first 600 draws as the burn-in period. To illustrate the performance of the algorithm, we plot in Fig. 1 the lagged autocorrelations up to order 100 for several selected (1) (0) (1) parameters: b(D) 0 ,b ,a3 and r . The lagged autocorrelation plots are a useful way to assess the mixing of the parameter chains – if the lagged autocorrelations remain close to unity, for example, then the posterior simulator only makes small local movements from iteration to iteration, resulting in inaccurate posterior estimates. As shown in Fig. 1, the lagged autocorrelations drop away reasonably quickly for all the selected parameters, suggesting that posterior quantities can be approximated reasonably accurately with only a moderate number of posterior simulations. As discussed in Section 3.2, one potential concern about working with the reparameterized model is that we need to impose priors directly on the transformed parameters instead of the structural parameters. This is an important issue because priors that look suitable for the transformed parameters may turn out to imply rather unreasonable (and possibly quite informative) priors for the structural parameters. For this generated data experiment, we employ the priors described in Section 3.1. We can calculate
82
MINGLIANG LI AND JUSTIN L. TOBIAS
Table 1.
True Values and Posterior Estimates of the Parameters. True Values
Posterior Estimates E( |D)
Regression parameters bðDÞ 0
Std( |D)
P( W0|D)
0
0.0039
0.02
1
1
0.0252
1
0.913 0.477
0.951 0.468
0.0375 0.0284
1 1
0.304
0.315
0.0226
1
0.609
0.626
0.0267
1
0.913
0.924
0.0298
1
0.318
0.33
0.0172
1
0.636
0.666
0.0242
1
0.953
0.987
0.0308
1
Correlation parameters r(1) r(0) r(10)
0.9 0.7 0.6
0.863 0.668 0.635
0.0143 0.0314 0.0665
1 1 1
Mean treatment effectsa ATE TT LATE
0.136 0.102 0.141
0.149 0.113 0.152
0.0131 0.0166 0.0193
1 1 1
Probabilities of positive treatment effects 0.428 Pr(y(1)>y(0)) Pr(y(1)>y(0)|D(w)=1) 0.439 ~ Pr(y(1)>y(0)|D(w)=l, D(w)=0) 0.535
0.452 0.461 0.552
0.0164 0.0283 0.0236
1 1 1
Probabilities of zero treatment effects Pr(y(1)=y(0)) Pr(y(1)=y(0)|D(w)=1) ~ Pr(y(1)=y(0)|D(w)=l, D(w)=0)
0.418 0.479 0.363
0.0248 0.0272 0.0253
1 1 1
0.13 0.0601 0.0856
0.0171 0.0145 0.0204
1 1 1
bðDÞ 1 ð1Þ b bð0Þ
Cutpoint values a3ð1Þ a4ð1Þ a5ð1Þ a3ð0Þ a4ð0Þ a5ð0Þ
0.42 0.492 0.364
Probabilities of negative treatment effects 0.152 Pr(y(1)oy(0)) Pr(y(1)oy(0)|D(w)=1) 0.0694 ~ 0.102 Pr(y(1)oy(0)|D(w)=l, D(w)=0)
0.576
a The mean treatment effects are those defined in Section 4 and capture the treatment impacts on the probability that the outcome variable exceeds the lowest category.
Bayesian Analysis of Treatment Effects
83
0.8 Lag Correlation
Lag Correlation
0.4 0.3 0.2 0.1
0.6
0.4
0.2
0 0 20
40
60
80
100
20
40
60 β
β(D)
80
100
80
100
(1)
0
0.8 Lag Correlation
Lag Correlation
0.8 0.6 0.4 0.2
0.6 0.4 0.2
0
0 20
40
60
80
100
20
α3(0)
Fig.
1.
Lagged
Autocorrelations of Simulated ð1Þ b0ðDÞ ; bð1Þ ; að0Þ 3 and r .
40
60 (1)
ρ
Posterior
Draws
for
the implied priors for the structural parameters by first sampling from the priors for the transformed parameters, inverting to obtain the values of the structural parameters and then smoothing the collection of structural parameter values to obtain their approximate marginal prior densities. To demonstrate this process, we plot in Fig. 2 the marginal priors and ð1Þ ð0Þ ð1Þ posteriors for the selected parameters bðDÞ 0 ; b ; a3 and r . As can be seen clearly from the graphs, the prior densities for all the parameters are almost completely ‘‘flat’’ over the regions where the posterior densities have substantial mass; it is almost impossible for us to visually distinguish between
84
MINGLIANG LI AND JUSTIN L. TOBIAS
10 8 Density
Density
15
10
6 4
5 2 0 -0.1
-0.05
0
0.05
0
0.1
0.8
(D) β 0
25 20
15
Density
Density
1.1
(1)
20
10 5
15 10 5
0.25
0.3
0.35 (0) α 3
Fig. 2.
1 β
25
0
0.9
0.4
0
0.8
0.85
0.9 (1)
ρ
Marginal Prior (Dashed Lines) and Posterior (Solid Lines) Density ð1Þ ð0Þ ð1Þ Functions for bðDÞ 0 ; b ; a3 and r .
the prior densities and horizontal axes when they are plotted together against the posterior densities. This evidence clearly suggests that the data has provided sufficient information for us to substantially revise our prior beliefs, and that the implied priors for the structural coefficients are sufficiently vague to warrant working with the reparameterized model specification. In Table 1, we report posterior estimates of all the parameters along with their true generated data values. As is evident from the table, all the parameters have been estimated with reasonable accuracy and posterior estimates are quite close to their actual values. As for the non-identified
Bayesian Analysis of Treatment Effects
85
correlation parameter r(10), its marginal posterior places considerable mass over the actual value that was used to generate the data, 0.6. We interpret this finding with substantial caution, however, as our point estimate (e.g., posterior mode) of this parameter is not consistent; there is absolutely no way that we can recover the ‘‘true’’ value of this parameter even in the largest data sets. What is true, however, is that as we learn about the identified r(1) and r(0) correlation parameters, Eq. (34) restricts the conditional support of r(10). For this experimental design, the conditional support bounds for r(10) are 0.32 and 0.94, suggesting that there is significant potential for learning about r(10). This learning is manifested in the marginal posterior for r(10), as the posterior simulations automatically incorporate this support restriction. The fact that most of our posterior mass is centered around the approximate midpoint of this region (0.6) is not informative, and is simply a consequence of the shape of our particular prior density. All the data can do is to reveal the support bounds; beyond this, the prior takes over and affects the shape of the posterior within the support bounds. To examine if our results are sensitive to alternate prior specifications for the non-identified correlation parameter, we re-estimated our model by specifying a different value of R used in the inverted Wishart prior for the ~ Specifically, to reflect the potential prior preference for covariance matrix S. positive r(10), we changed the (2,3) and (3,2) elements of R to 0.5 (which were formerly zero). Interestingly, and as expected from a theoretical point of view, with this many observations, the identified correlation parameters are not affected by this change in prior, and thus the bounds described in Eq. (34) continue to be pinned down rather precisely. The shape of the prior within these bounds does change, however, as the shift toward positive values increases the marginal posterior mean to 0.742 instead of 0.635, as described in the previous table. In the second part of Table 1 we also list the true values and posterior estimates of the mean treatment effects ATE, TT and LATE. As discussed in Section 4, these quantities summarize the impact of the treatment in increasing (or decreasing) the probability that the outcome exceeds the lowest category. When calculating TT and LATE (which are functions of covariates in the treatment decision), we set w=0 for TT, and for LATE, set w=0 and w~ ¼ 1. In the bottom portion of Table 1 we also illustrate how parameters beyond mean effects, such as the probabilities of positive, negative or zero treatment impacts for various subpopulations, can be calculated. The fact that the conventional mean treatment parameters are pinned down quite accurately in Table 1 is not surprising – these effects
86
MINGLIANG LI AND JUSTIN L. TOBIAS
are purely functions of identified model parameters which are precisely estimated by our simulation algorithm. For the treatment parameters involving the probabilities of positive, negative or zero treatment impacts, we are able to derive reasonably tight bounds around r(10), and therefore are able to obtain reasonably accurate estimates of the non-identified quantities of interest. We continue to stress, however, that it is not possible to consistently estimate these quantities, though one could potentially bound them, or as done in this section, one can use prior information to fill the gaps created by the absence of data information.
6. CONCLUSION In this chapter we introduced a new Bayesian estimation algorithm for fitting a binary treatment, ordered outcome selection model in a potential outcomes framework. Our particular algorithm made use of a reparameterization, building of the suggestion of Nandram and Chen (1996), to accelerate the convergence of our posterior simulator and mitigate problems of slow mixing. Several computational strategies which allowed for non-Normality were also discussed and conventional ‘‘treatment effects’’ such as the ATE, the effect of TT and the LATE were derived for this specific model. We also reviewed how a Bayesian might attempt to expand the focus of her research beyond mean treatment impacts by exploiting a limited degree of learning that takes place about the non-identified cross-regime correlation parameter.
NOTES 1. Heckman, Tobias, and Vytlacil (2003), for example, discuss parametric approaches for estimating a variety of popular treatment effects under various distributional assumptions. 2. Manski’s (1990, 1994) non-parametric bounding is a leading example. 3. See Angrist and Krueger (2001) for a review. 4. Gould (2002, 2005), for example, argues that having strong predictors for treatment status is more important for practical identification purposes than requiring that some set of covariates are excluded from the outcome equation. In applied Bayesian work (e.g., Poirier & Tobias, 2003; Munkin & Trivedi, 2003; Li et al., 2004; Koop & Tobias, 2006), the instrument tends to receive decidedly less discussion. In empirical practice, however, such exclusion restrictions should be, and typically are, used when available.
Bayesian Analysis of Treatment Effects
87
5. See, for example, Imbens and Angrist (1994) for a discussion of LATE and Heckman and Vytlacil (1999, 2000) for detailed discussions of these and other treatment effects. 6. For related discussions on this topic, see Vijverberg (1993), Koop and Poirier (1997), Poirier (1998), Poirier and Tobias (2003), and Li et al. (2004). 7. One can also conceive of situations where the modeling of count outcomes is desired (e.g., Munkin & Trivedi, 2003). Clearly, approaches to modeling ordered and count outcomes impose different parametric assumptions on the response (e.g., ordered probit versus Poisson or negative binomial distribution), invoke different interpretations of the outcomes of interest (ordinal versus cardinal) and involve different assessments of the censoring feature of the outcomes (censored versus unbounded). Which approach is more appropriate depends critically on the type of application that is considered. 8. In this paper, we assume that the same set of covariates appear in the treated and untreated states. If desired, this assumption could be relaxed and this extension incorporated into the derivations which follow. 9. We discuss how this requirement can be relaxed in Section 3.4 of this chapter. The variances of the errors in all the equations have been normalized to unity for identification purposes. 10. A detailed review of these simulation methods is beyond the scope of this paper; the interested reader is invited to see Casella and George (1992), Tierney (1994), Chib and Greenberg (1995), Gilks, Richardson, and Spiegelhalter (1998), Geweke (1999), Chen, Shao, and Ibrahim (2000), Carlin and Louis (2000), Geweke and Keane (2001), Chib (2001), Koop (2003), Lancaster (2004), Gelman, Carlin, Stern, and Rubin (2004), Poirier and Tobias (2006), and Koop, Poirier, and Tobias (2007) (among others) for detailed and comprehensive descriptions of these and other methods. 11. Note that the largest cutpoints have been taken out of each cutpoint vector and these largest cutpoints are replaced by s1 and s0 in this alternate parameterization. 12. It is useful to note that, conditional on the latent variables, our model is essentially a seemingly unrelated regressions (SUR) model except for the restriction that one diagonal element of the covariance matrix is fixed at one. 13. Note that sampling the cutpoints in this way enforces the ordering restriction on the cutpoint values. 14. For example, one can calculate posterior predictive p-values (Gelman et al., 2004), QQ plots and other standard diagnostic criteria (e.g., Lancaster, 2004; Koop et al., 2007) to evaluate the appropriateness of the Normality assumption. For more on the performance of related models under non-Normality, see, for example, Goldberger (1983) or Paarsch (1984). 15. The inverted Gamma (IG) random variable is parameterized as follows (see Poirier, 1995, p. 111): p(x)px(a+1) exp[1/(bx)]. 16. See, e.g., McLachlan and Peel (2000). There is an important issue about local non-identifiability of the mixture model parameters; the parameters are not identified up to a permutation of the mixture components. To aid in identification, priors can be used that impose an ordering restriction on the variance parameters, regression parameters or component probabilities. In some cases, there is little concern for ‘‘component switching,’’ but in other cases, this issue may be a significant concern.
88
MINGLIANG LI AND JUSTIN L. TOBIAS
17. In some cases, however, it may be. For example, one could use an ordered model to analyze, say, years of schooling completed, and thus remain true to the integer-valued nature of the education data. In this case, the ordered variable has a natural cardinal interpretation, and thus the conventional ATE parameter would be of interest. 18. In practice, these integrals can be approximated quite accurately (and quickly) using relatively few draws from the truncated Normal distribution. A routine for drawing from such a distribution was provided in step 4 of Section 3.3. 19. Heckman et al. (2003) provide a similar definition of LATE in a parametric latent variable selection model. 20. Related work has sought to expand the focus beyond mean effects and identify outcome gain distributions. See, for example, Heckman and Honore´ (1990), Heckman, Smith, and Clements (1997), Heckman and Smith (1998), and Carneiro, Hansen, and Heckman (2003). 21. See Poirier (1998) for more on learning about non-identifiable parameters through prior information. Poirier and Tobias (2000) contain related material describing the implications of prior restrictions on r(10). 22. This is particularly true, as Chib and Hamilton (2000) point out, in, say, panel models where most of the variation is captured through fixed or random effects, and one would suspect that any remaining correlation among the unobservables was minimal. 23. Heckman et al. (1997), for example, informally discuss plausible prior beliefs for r(10). They write (p. 510) ‘‘In considering outcomes like employment and earnings, many plausible models of program participation suggest that outcomes in the treatment state are ‘‘positively related’’ to outcomes in the non-treatment state. y There is a widely held belief that good persons are good at whatever they do.’’ 24. We do not address the ‘‘weak instruments’’ problem here, but to fix ideas, we consider the case where the instrument plays a significant role in the treatment decision.
ACKNOWLEDGMENTS We would like to thank for helpful comments and suggestions from two anonymous referees, the editor Ed Vytlacil and participants at the 4th Annual Advances in Econometrics Conference. All errors are our own.
REFERENCES Angrist, J. D., & Krueger, A. B. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic Perspectives, 15, 69–85. Albert, J., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.
Bayesian Analysis of Treatment Effects
89
Carlin, B., & Polson, N. (1991). Inference for nonconjugate Bayesian models using the Gibbs sampler. Canadian Journal of Statistics, 19, 399–405. Carlin, B. P., & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). Chapman & Hall/CRC. Carneiro, P., Hansen, K., & Heckman, J. (2003). Estimating distributions of treatment effects with an application to the returns to schooling and measurement of the effects of uncertainty on college choice. International Economic Review, 44(2), 361–422. Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46, 167–174. Chen, M.-H., Shao, Q.-M., & Ibrahim, J. G. (2000). Monte Carlo methods in Bayesian computation. Springer-Verlag. Chib, S. (2001). Markov chain Monte Carlo methods: Computation and inference. In: J. J. Heckman & E. Leamer (Eds), Handbook of econometrics (Vol. 5, pp. 3569–3649). North-Holland: Elsevier Science. Chib, S., & Greenberg, E. (1995). Understanding the Metropolis–Hastings algorithm. The American Statistician, 49, 327–335. Chib, S., & Hamilton, B. H. (2000). Bayesian analysis of cross-section and clustered data treatment models. Journal of Econometrics, 97(1), 25–50. Chib, S., & Hamilton, B. H. (2002). Semiparametric Bayes analysis of longitudinal data treatment models. Journal of Econometrics, 110, 67–89. Cowles, M. (1996). Accelerating monte carlo markov chain convergence for cumulative-link generalized linear models. Statistics and Computing, 6, 101–111. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Chapman & Hall/CRC. Geweke, J. (1993). Bayesian treatment of the independent student-t linear model. Journal of Applied Econometrics, S19–S40. Geweke, J. (1999). Using simulation methods for Bayesian econometric models: Inference, development and communication (with discussion and reply). Econometric Reviews, 18, 1–127. Geweke, J., & Keane, M. (2001). Computationally intensive methods for integration in econometrics. In: J. J. Heckman & E. Leamer (Eds), Handbook of econometrics (Vol. 5, pp. 3463–3568). North-Holland: Elsevier Science. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds). (1998). Markov Chain Monte Carlo in practice. Boca Raton: Chapman & Hall/CRC. Goldberger, A. S. (1983). Abnormal selection bias. In: S. Karlin, T. Amemiya & L. Goodman (Eds), Studies in econometrics, time series and multivariate statistics (pp. 67–84). New York: Academic Press. Gould, E. D. (2002). Rising wage inequality, comparative advantage, and the growing importance of general skills in the United States. Journal of Labor Economics, 20(1), 105–147. Gould, E. D. (2005). Inequality and ability. Labour Economics, 12, 169–189. Heckman, J. (2004). Micro data, heterogeneity and the evaluation of public policy, part 1. The American Economist, 48(2), 3–25. Heckman, J., & Honore´, B. (1990). The empirical content of the Roy model. Econometrica, 50, 1121–1149. Heckman, J., & Sedlacek, G. L. (1985). Heterogeneity, aggregation and market wage functions: An empirical model of self-selection in the labor market. Journal of Political Economy, 93(6), 1077–1125.
90
MINGLIANG LI AND JUSTIN L. TOBIAS
Heckman, J., & Smith, J. (1998). Evaluating the welfare state. NBER Working Paper no. 6542, National Bureau of Economic Research, Inc. Heckman, J., Smith, J., & Clements, N. (1997). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impacts. Review of Economic Studies, 64, 487–535. Heckman, J., Tobias, J. L., & Vytlacil, E. (2003). Simple estimators for treatment parameters in a latent-variable framework. Review of Economics and Statistics, 85(3), 748–755. Heckman, J., & Vytlacil, E. (1999). Local instrumental variables and latent variable models for identifying and bounding treatment effects. Proceedings of the National Academy of Sciences, 96, 4730–4734. Heckman, J., & Vytlacil, E. (2000). The relationship between treatment parameters within a latent variable framework. Economics Letters, 66, 33–39. Imbens, G., & Angrist, J. (1994). Identification and estimation of local average treatment effects. Econometrica, 62, 467–475. Koop, G. (2003). Bayesian econometrics. Wiley. Koop, G., & Poirier, D. J. (1997). Learning about the across-regime correlation in switching regression models. Journal of Econometrics, 78, 217–227. Koop, G., Poirier, D. J., & Tobias, J. L. (2007). Bayesian econometrics. Cambridge University Press. Koop, G., & Tobias, J. L. (2006). Semiparametric Bayesian inference in smooth coefficient models. Journal of Econometrics, 134, 283–315. Lancaster, T. (2004). An introduction to modern Bayesian econometrics. Blackwell. Li, K. (1998). Bayesian inference in a simultaneous equation model with limited dependent variables. Journal of Econometrics, 85(2), 387–400. Li, M., Poirier, D. J., & Tobias, J. L. (2004). Do dropouts suffer from dropping out? Estimation and prediction of outcome gains in generalized selection models. Journal of Applied Econometrics, 19, 203–225. Li, M., & Tobias, J. L. (2006). Bayesian analysis of structural effects in an ordered equation system. Studies in Nonlinear Dynamics and Econometrics, 10(4). Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics. Cambridge: Cambridge University Press. Manski, C. (1990). Nonparametric bounds on treatment effects. American Economic Review (Papers and Proceedings), 80, 319–323. Manski, C. (1994). The selection problem. In: C. Sims (Ed.), Advances in econometrics: Sixth World Congress (pp. 143–170). Cambridge: Cambridge University Press. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Munkin, M. K., & Trivedi, P. K. (2003). Bayesian analysis of a self-selection model with multiple outcomes using simulation-based estimation: An application to the demand for healthcare. Journal of Econometrics, 114(2), 197–220. Nandram, B., & Chen, M.-H. (1996). Reparameterizing the generalized linear model to accelerate Gibbs sampler convergence. Journal of Statistical Computation and Simulation, 54, 129–144. Nobile, A. (2000). Comment: Bayesian multinomial probit models with a normalization constraint. Journal of Econometrics, 99(2), 335–345. Paarsch, H. J. (1984). A monte carlo comparison of estimators for censored regression models. Journal of Econometrics, 24, 197–213.
Bayesian Analysis of Treatment Effects
91
Poirier, D. J. (1995). Intermediate statistics and econometrics. Cambridge: MIT Press. Poirier, D. J. (1998). Revising beliefs in non-identified models. Econometric Theory, 14, 483–509. Poirier, D. J., & Tobias, J. L. (2000). Across-regime covariance restrictions in treatment response models. Working Paper. Department of Economics, University of California-Irvine. Poirier, D. J., & Tobias, J. L. (2003). On the predictive distributions of outcome gains in the presence of an unidentified parameter. Journal of Business and Economic Statistics, 21(2), 258–268. Poirier, D. J., & Tobias, J. L. (2006). Bayesian econometrics. In: K. Patterson & T. C. Mills (Eds), Palgrave handbook of econometrics: Theoretical econometrics (Vol. 1, pp. 841–870). New York: Palgrave Macmillan. Roy, A. D. (1951). Some thoughts on the distribution of earnings. Oxford Economic Papers, 3, 135–146. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–540. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Annals of Statistics, 22, 1701–1762. Vijverberg, W. P. M. (1993). Measuring the unidentified parameter of the extended Roy model of selectivity. Journal of Econometrics, 57, 69–89.
INSTRUMENTAL VARIABLES ESTIMATION OF THE AVERAGE TREATMENT EFFECT IN THE CORRELATED RANDOM COEFFICIENT MODEL Jeffrey M. Wooldridge ABSTRACT I propose a general framework for instrumental variables estimation of the average treatment effect in the correlated random coefficient model, focusing on the case where the treatment variable has some discreteness. The approach involves adding a particular function of the exogenous variables to a linear model containing interactions in observables, and then using instrumental variables for the endogenous explanatory variable. I show how the general approach applies to binary and Tobit treatment variables, including the case of multiple treatments.
1. INTRODUCTION Recent work has focused on estimating the population average of the random slope coefficient in the correlated random coefficient (CRC) model. Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 93–116 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00004-7
93
94
JEFFREY M. WOOLDRIDGE
In the CRC model with a single endogenous explanatory variable, the explanatory variable of interest – which I call the ‘‘treatment’’ variable for brevity – interacts with unobserved heterogeneity, and the treatment variable and unobserved heterogeneity are generally correlated. In models where unobserved heterogeneity interacts with an endogenous explanatory variable, the leading approach to estimation is the so-called control function (CF) approach. In the CF approach, the expected value of the response (y) given the exogenous covariates (x), the exogenous instrumental variables (z), and the endogenous treatment variable (w) is found. This leads to an estimating equation that contains a CF. Regressionbased methods, requiring two-step estimation, are typically applied. The two leading examples of the CF approach are with a continuous treatment and a binary treatment. Garen (1984) considered the case where the treatment variable is continuous and, in particular, has a conditionally homoskedastic normal distribution with linear conditional expectation. Wooldridge (1997) demonstrated that, under weaker assumptions than those used by Garen, the standard instrumental variable (IV) estimator that ignores the interaction between unobservables and the treatment variable consistently estimates the average treatment effect (ATE). Heckman and Vytlacil (1998), in a framework that extends Garen’s (1984) by allowing multiple treatments, used assumptions weaker than those in Wooldridge (1997) to derive estimators based on inserting first-stage fitted values as alternatives to standard IV. Wooldridge (2003) further considered the continuous, multiple treatment case and relaxed the assumptions under which standard IV estimators identify the ATEs. As things currently stand, the usual IV estimator is known to be consistent under assumptions weaker than those assumed by Heckman and Vytlacil (1998) for their plug-in estimators. Card (2001) reviews various methods for estimating the average return to education in a model where the return is individual-specific). When the treatment is binary – in which case the CRC model becomes the switching regression model – the assumptions used by Wooldridge (1997, 2003) and Heckman and Vytlacil (1998) are known not to hold, and the usual IV estimator is generally inconsistent for the ATE. Nevertheless, via simulations, Angrist (1991) found that, in a model without additional covariates, the usual IV estimator can provide a good estimate of the ATE, even when sufficient conditions for consistency of IV do not hold. But Angrist’s simulations are special: in addition to only considering a handful of possible generating mechanisms, Angrist only considers a framework without exogenous covariates. (In Section 4.1, I shed some light on Angrist’s simulation findings).
Instrumental Variables Estimation of the Average Treatment Effect
95
A more common approach to estimation in the binary treatment case is to find a suitable CF – a function of the exogenous variables and the endogenous treatment – and add it to the regression. Under joint normality assumptions on the unobservables, the CF can be found (and is a function of the inverse Mills ratio). The two-step method proposed by Heckman (1976) produces a consistent, asymptotically normal estimator of the ATE. Heckman, Tobias, and Vytlacil (2003) derive the CF under less restrictive assumptions. Semiparametric approaches have also been developed; see Heckman and Navarro-Lozano (2004) for references. In this chapter, I develop a middle ground between the usual IV approach and the traditional CF approaches. In particular, I show how the usual equation estimated by IV can be augmented by a ‘‘correction function,’’ which depends only on exogenous variables, in order to produce a consistent estimator of the ATE. (I use the name ‘‘correction function’’ so as to avoid confusion between the new approach and the traditional CF approach). In a parametric setting, obtaining the new correction function requires making a distributional assumption on the reduced form of the treatment variable. Nevertheless, the new IV estimator is simple and leads to a straightforward specification test for the usual IV estimator. Plus, the auxiliary distribution can be made flexible; in fact, the form of the estimating equation suggests the possibility of semiparametric methods that would require fewer assumptions. As a special case, I derive a new IV estimator for the ATE in the switching regression model when the treatment follows a probit reduced form. As I show, the assumptions needed to establish consistency of the new IV estimator are weaker than those for the usual CF approach. But this comes at a price: if the available IV has too little variation, the new approach loses identification, whereas the traditional CF approach would still identify the ATE. As a new example, I develop a simple, new IV estimator of the ATE when the reduced form of the treatment is a standard censored Tobit model. An important benefit of the current approach is that it more easily applies in situations with multiple endogenous treatments. In particular, only a reduced form for each treatment separately needs to be specified. One need not compute expectations conditional on the entire set of treatments – often an intractable problem. The rest of the chapter is organized as follows. Section 2 introduces the setup for the case of a single treatment, and derives estimating equations under fairly general assumptions. I look at the issue of obtaining the correction function in Section 3. Section 4 considers the special cases of treatments with probit and Tobit reduced form distributions. I take up
96
JEFFREY M. WOOLDRIDGE
multiple treatments in Section 5. Section 6 contains concluding remarks, including some conjectures on how the approach can be made less parametric.
2. MODEL, ASSUMPTIONS, AND ESTIMATING EQUATIONS FOR A SINGLE TREATMENT I begin with the case of a univariate treatment variable, w, and a univariate response, y. I adopt the setup in Wooldridge (2003), where I considered the continuous treatment case. The structural model is Eðyja; b; wÞ ¼ a þ bw
(1)
where the intercept, a, and the slope, b, can depend on both observed and unobserved heterogeneity. The interesting case is when b actually depends on unobserved heterogeneity; otherwise, standard IV methods can be applied. Given Eq. (1), one can always write the equation in error form as y ¼ a þ bw þ e;
Eðeja; b; wÞ ¼ 0
(2)
For a random draw i from the population, one can write yi ¼ ai þ bi wi þ ei
(3)
which emphasizes that the intercept and slope are individual-specific. Except when w is binary, Eq. (1) imposes a functional form restriction: the expectation is linear in w. Nevertheless, Eq. (1) encompasses the usual random coefficient model; it is more general because I have not (yet) specified how a and b depend on observed and unobserved heterogeneity. I call Eq. (1) the CRC model because b and w are allowed to be arbitrarily correlated. Card (2001) shows how a theoretical model for wage determination and schooling decisions leads to a CRC model relating log earnings to schooling. Because b is generally a function of unobserved heterogeneity, one cannot hope to estimate the slope, bi, for any particular cross-sectional unit, i. Instead, I focus on the average partial effect or ATE, b EðbÞ ¼ Eðbi Þ
(4)
where the average is over the relevant population. (As will become clear below, one can also estimate the ATE conditional on the observed
Instrumental Variables Estimation of the Average Treatment Effect
97
covariates, x: ATE(x)E(b|x)). In Wooldridge (2004), I studied estimation of b under ignorability assumptions, which effectively means one has good predictors of treatment or good proxies for the unobserved heterogeneity. Here my approach to estimating b is based on the availability of IVs, z, in addition to covariates, x. My first assumption concerning the covariates and the IVs is a standard exclusion restriction: Assumption 1. The covariates, x, and the IVs, z, are redundant in Eq. (1): Eðyja; b; w; x; zÞ ¼ Eðyja; b; wÞ
’
(5)
While it may not look especially familiar, Assumption 1 is implied by all other assumptions used for IV estimation of the CRC model. That the covariates x are excluded from Eq. (5) is really no assumption at all; I will let a and b both be functions of x. The real restriction in Eq. (5) is that the instruments are assumed to be redundant in the structural conditional expectation. This is entirely analogous to the usual kinds of exclusion restrictions that are routinely made for IV estimation. Assumption 1 does not distinguish the covariates, x, and the IVs, z. The next assumption clearly separates the two and defines the important sense in which z is exogenous: conditional on observed covariates, the unobserved heterogeneity is mean independent of z. I also impose linear functional form assumptions. Assumption 2. z is redundant for a and b, conditional on x. Further, the conditional expectations are linear in x: Eðajx; zÞ ¼ EðajxÞ ¼ g0 þ xc
(6)
Eðbjx; zÞ ¼ EðbjxÞ ¼ bj þ ðx lx Þdj;
(7)
where lxE(x) is the population mean of x. The first equalities in Eqs. (6) and (7) show that while the intercept and slope may depend on the observed covariates, each is mean independent of z, conditional on x. Therefore, determining the observables that belong in x and those that can be included in z amounts to the usual reasoning about exogeneity and exclusion restrictions. If y is an earnings measure and w is years of schooling, x typically includes experience, tenure, demographic variables such as marital status and race, and possibly geographic controls. (Several of these controls are arguably endogenous; I discuss the case of multiple endogenous treatments in Section 5). Important unobserved factors
98
JEFFREY M. WOOLDRIDGE
contained in a and b might include cognitive ability and motivation. Therefore, one needs instruments such that the unobserved heterogeneity is mean independent of the instruments once observed productivity and demographic characteristics are controlled for. Angrist and Krueger (1991) suggest quarter of birth indicators (although they do not have a very rich set of controls) while Card (1995), who has a good set of controls, proposes college proximity indicators as instruments. Eqs. (6) and (7) show expectations that are linear in x, but this is for convenience. Any expectations that are linear in parameters lead to convenient estimators. I could allow for expectations that are nonlinear in parameters but, as will become clear, that would complicate the estimation procedure. Under Eqs. (6) and (7), one can write a ¼ g0 þ xc þ c;
Eðcjx; zÞ ¼ 0
b ¼ b þ ðx lx Þd þ v;
Eðvjx; zÞ ¼ 0
(8) (9)
Further, with e defined as in Eq. (2), E(e|a,b,w,x,z)=0 under Assumption 1, which, of course, implies E(e|x,z)=0. Plugging Eqs (8) and (9) into Eq. (2) gives y ¼ g0 þ xc þ bw þ wðx lx Þd þ c þ wv þ e
(10)
By demeaning the covariates x before interacting them with w, I ensure that b is the ATE. The composite error in Eq. (10) is c+wv+e. Under the assumptions I have made, E(c|x,z)=E(e|x,z)=0. Standard IV estimation of Eq. (10) is complicated by the interaction term wv because, generally, E(wv|x,z)6¼0. When w is a continuous treatment, it is possible – see Wooldridge (1997, 2003) and Heckman and Vytlacil (1998) – that the conditional covariance does not depend on (x,z). Wooldridge (2003) used Assumptions 1, 2, and Covðw; bjx; zÞ ¼ a0
(11)
to show that a standard IV estimator consistently estimates b. In other words, w and b are allowed to be arbitrarily correlated, but the conditional covariance cannot depend on (x,z). Card (2001) argues that, when y=log(earnings) and w=schooling, Eq. (11) does not appear to be a good assumption. Using a binary indicator for proximity to a four-year college as the instrument, and using IQ score as a proxy for unobserved ability (contained in b), Card rejects Eq. (11).
Instrumental Variables Estimation of the Average Treatment Effect
99
So, even with a (roughly) continuous treatment variable, Eq. (11) can be restrictive. Using the CF approach, Wooldridge (2005) shows how Eq. (11) can be relaxed when w is continuous, but the approach does not work when w has notable discreteness. How might one relax Eq. (11) to allow discrete or corner solution treatments? Eq. (10) suggests a general strategy. Let gðx; zÞ Eðwvjx; zÞ
(12)
r wv gðx; zÞ
(13)
y ¼ g0 þ xc þ bw þ wðx lx Þd þ gðx; zÞ þ q
(14)
q¼cþrþe
(15)
and plug into Eq. (10):
where
Because each component of q has a zero mean conditional on (x,z), Eq. (14) can be estimated by IV, where g(x,z) serves as its own instrument. The problem is that g(x,z) is usually unknown. This motivates the following assumption. Assumption 3. For a known parametric function h(x,z;h) and parameter r, Eðwvjx; zÞ ¼ rhðx; z; hÞ. Under Assumptions 1–3, one can write, in the population, y ¼ g0 þ xc þ bw þ wðx lx Þd þ rhðx; z; hÞ þ q
(16)
Eðqjx; zÞ ¼ 0
(17)
This means that any function of (x,z) is valid as an IV in Eq. (16), including x and h(x,z;h). Therefore, assuming that h^ is a consistent estimator of h, one can consistently estimate b by applying 2SLS to the equation ^ þ errori yi ¼ g0 þ xi c þ bwi þ wi ðxi xÞd ¯ þ rhðxi ; zi ; hÞ
(18)
where, for emphasis, I include i to denote a generic cross-section observation. Natural IVs are ^ ½1; xi ; zi ; xi1 xi ; . . . ; xiK xi ; zi1 xi ; . . . ; ziL xi ; hðxi ; zi ; h
(19)
100
JEFFREY M. WOOLDRIDGE
or the smaller set ^ ½1; xi ; w^ i ; w^ i xi ; hðxi ; zi ; hÞ
(20)
where w^p i is ffiffiffiffiffi an estimate – and not necessarily a consistent one – of E(wi|xi,zi). If h^ is N -asymptotically normal, so is the IV estimator under standard regularity conditions. I call h(x,z;h) a ‘‘correction function’’ because adding it to the estimating equation corrects for the omitted variable bias that plagues the usual IV estimator. Even if we treat h as known, Eqs. (16) and (17) will sometimes fail to identify b. For example, if E(w|x,z) is a multiple of h(x,z;h) then the rank condition for IV estimation fails. This implies that a fully nonparametric approach to g(x,z) does not allow identification of b. In this chapter, I rely heavily on parametric assumptions. In the conclusion I suggest some ways that identification of b may hold when g(x,z) is restricted to a certain class of functions – motivated by features of h(x,z;h) in parametric cases – but, at this stage, the discussion is just conjecture. It is important to see that the ‘‘correction function’’ h(x,z;h) is not what is traditionally called a ‘‘control function’’ in the context of the CRC model. The key difference is here I compute the expected value of wv conditional only on (x,z), whereas the usual CF approach finds E(wv|w,x,z); see, for example, Heckman (1976), Lee (1978), Garen (1984), Heckman and Robb (1986), Vella and Verbeek (1999), and Heckman et al. (2003). Consequently, adding h(x,z;h) to the original equation does not produce an estimating equation in which w and w(xlx) are exogenous. IVs are still needed for the terms involving w in Eq. (18). The approach here is also distinct from Smith and Blundell (1986) for Tobit responses and Rivers and Vuong (1988) for binary responses, for a few reasons. First, Eq. (2) along with linear conditional expectations for the heterogeneity typically rules out cases where y is discrete. Second, Smith and Blundell (1986) and Rivers and Vuong (1988) assume a continuous treatment variable; here I focus on treatments with some discreteness. Finally, Smith and Blundell (1986) and Rivers and Vuong (1988) adopt CF approaches under essentially full distributional assumptions. Several more recent semiparametric and nonparametric approaches to endogenous explanatory variables in nonlinear models also take the CF approach. In effect, they seek to relax the distributional assumptions in Smith and Blundell (1986) and Rivers and Vuong (1988). These include Blundell and Powell (2004) and Imbens and Newey (2002). Importantly, such approaches are not generally applicable to treatments with discreteness
Instrumental Variables Estimation of the Average Treatment Effect
101
because the treatment is assumed to be a strictly monotonic function of the underlying latent variable. (In standard binary response and corner solution models, monotonicity fails). Chesher (2003) studies identification under quantile restrictions, but he also assumes a strictly monotonic relationship between the treatment and latent variable. Chesher (2005) explicitly allows discrete treatments, but he is only able to identify intervals for the ATE. Besides, Chesher (2005) restricts attention to discrete responses. The nonparametric IVs framework in Chernozhukov, Imbens, and Newey (2004) allows for continuous responses and discrete treatments, but the authors assume that the response is a strictly monotonic function of a scalar unobserved heterogeneity, something violated by the random coefficient model I study here (which has two unobserved heterogeneity terms). In the context of quantile treatment effects, Chernozhukov and Hansen (2005) explicitly consider a binary treatment without the strict monotonicity assumption in the unobserved heterogeneity. But Chernozhukov and Hansen assume either that there is only one unobserved heterogeneity term or that the heterogeneity terms in the two states (treated and untreated) are identically distributed conditional on exogenous variables and unobservables affecting treatment. I do not require such an assumption. In some early literature on the self-selection problem – see, in particular, Barnow, Cain, and Goldberger (1980) – a ‘‘control function’’ was defined differently from today’s usage. Rather than being a function of the endogenous treatment, the ‘‘control function’’ was a function of exogenous variables, and adding it to a regression was assumed to render the treatment variable exogenous in the estimating equation. In other words, treatment was assumed to be ‘‘ignorable’’ conditional on observed covariates. Here, I am not asserting that adding h(x,z;h) renders w exogenous in Eq. (16): my ‘‘correction function’’ is a function only of exogenous variables, and its inclusion in Eq. (16) does not make w exogenous. In addition to estimating b, the unconditional ATE, one might be interested in the elements of d. This is because the ATE conditional on x is E(b|x)=b+(xlx)d, which means one can estimate the ATE for any value of x. For example, one can estimate the effect of job training by race or gender, or find the effect for people with pre-training earnings below a certain amount. Similarly, one might want to estimate the effect of class size by race, gender, or family income. There are two sources of estimation error in Eq. (18). The first is due to replacing the population mean, E(x), by the sample average, x. ¯ As discussed in Wooldridge (2002, Chapter 6), this estimation error is typically swamped
102
JEFFREY M. WOOLDRIDGE
by variation in the error q in Eq. (16); therefore, one can safely ignore having estimated E(x). More relevant pffiffiffiffiffiis estimation of h in the first stage. If r6¼0, the asymptotic variance of N ðh^ hÞ generally affects the limiting distributions of the IV estimators from Eq. (18). However, when r=0, preliminary estimation of h can be ignored, at least as far as first-order asymptotics are concerned (see Wooldridge, 2002, Chapter 6). This means testing H0:r=0 is simple: one just uses the usual t-statistic, or the heteroskedasticity-robust form, from IV estimation. If r6¼0 is suspected, a correction should be made for the generated regressor problem. It is now well known how to make such corrections. See Newey and McFadden (1994) and Wooldridge (2002, Chapter 6) for the delta method approach. The bootstrap can also be used.
3. A GENERAL APPROACH TO FINDING THE CORRECTION FUNCTION By making a parametric assumption about the distribution of w given (x,z), and by making a particular conditional mean assumption, one can find the correction function h(x,z;h) in Assumption 3 quite generally. While this entails making more assumptions than desirable, it does allow identification of the ATE and covers some popular and new setups. I assume throughout that Assumptions 1 and 2 hold and seek sufficient conditions for Assumption 3. In particular, I assume w ¼ f ðx; z; u; aÞ
(21)
ujx; z Gð; gÞ
(22)
Eðvju; x; zÞ ¼ EðvjuÞ ¼ ru
(23)
where f( ;a) is a known function and G( ;g) is a known distribution with density g( ;g). In other words, w is a known function of the observed exogenous variables (x and z), an unobservable (u), and a vector of parameters (a). For concreteness, I assume that u is a continuous, univariate random variable, but nothing important hinges on this assumption. Eqs. (21) and (22) imply a parametric density for D(w|x,z), the distribution of w given (x,z), and so the assumptions are restrictive. Nevertheless, the functions and distributions can be chosen to be flexible. As I show in Section 4, under standard models for the treatment variable new IV
Instrumental Variables Estimation of the Average Treatment Effect
103
estimators are available that have distinct advantages over existing approaches. It may seem odd to impose a distribution on D(w|x,z) while allowing the distributions D(a|x,z) and D(b|x,z) to be unspecified aside from their first moments. But there is an important difference between w and (a,b): w is observed but (a,b) is not. Because w is observed, one knows whether it is binary or a corner solution, and therefore good models for D(w|x,z) suggest themselves. By contrast, a and b are unobserved. My approach here is similar in spirit to propensity-score based treatment effect estimators under ignorable selection, where the probability of treatment is modeled without taking a stand on the distribution of unobserved heterogeneity. Applying the IV method outlined in Section 2 requires finding E(wv|x,z). Now, by iterated expectations, Eðwvjx; zÞ ¼ E½Eðwvju; x; zÞjx; z ¼ E½wEðvju; x; zÞjx; z ¼ rEðwujx; zÞ ¼ E½ f ðx; z; u; aÞujx; z Z ¼ r f ðx; z; m; aÞmgðm; gÞdm R
rhðx; z; hÞ
ð24Þ
where h(x,z;h) is the integral in Eq. (24). In principle, h(x,z;h) can be obtained because f( ;a) and g( ;g) are specified. As mentioned above, Eqs. (21) and (22) imply that one can obtain the density of w given (x,z), which depends on a and g. Assuming identification, a and g can be estimated by maximum likelihood. For concreteness, I have assumed a linear conditional expectation for E(v|u)=ru. Linearity holds when (u,v) is bivariate normal, but it can be a good approximation in other cases as well. It is fairly straightforward to allow E(v|u) to be a low-order polynomial in u (with known order) – for example, a quadratic – under the restriction that E(v)=0.
4. APPLICATIONS I now cover two cases where the function h(x,z;h) can be found (and is not constant). 4.1. Probit Treatment Variable Suppose that w is a binary treatment, in which case the setup is the switching regression model with endogenous switching variable, w. In addition to
104
JEFFREY M. WOOLDRIDGE
Assumptions 1 and 2 and Eq. (23), I make particular choices in Eqs. (21) and (22): w ¼ 1½h0 þ xh1 þ zh2 þ u 0
(25)
ujx; z Normal ð0; 1Þ
(26)
Eqs. (25) and (26) imply that w given (x,z) follows a standard probit model. Now Eq. (24) can be used to find the function h(x,z;h) in Assumption 3. Let r=(1,x,z), so that rh=h0+xh1+zh2. Then Z 1 Z 1 Eðwujx; zÞ ¼ 1½rh þ m 0mfðmÞdm ¼ mfðmÞdm 1
rh
¼ fðmÞj1 rh ¼ fðrhÞ ¼ fðrhÞ
ð27Þ
where f( ) is the standard normal density. Therefore, the correction function h(x,z;h) in Assumption 3 is simply h(x,z;h)=f(rh). Therefore, Eq. (16) can be written as y ¼ g0 þ xc þ bw þ w ðx lx Þd þ rfðrhÞ þ q
(28)
Estimation of h is straightforward from a first-stage probit of w on (x,z). The first-stage probit also suggests a natural instrument for w; see Eq. (20). Procedure 1. (i) Estimate h0, h1, and h2 from a probit of wi on (1,xi,zi). Form the ^ i , along with f^ i ¼ fðh^ 0 þ xi h^ 1 þ zi h^ 2 Þ; predicted probabilities, F i ¼ 1; 2; . . . ; N. (ii) Estimate the equation ^ þ errori yi ¼ g0 þ xi c þ bwi þ wi ðxi x ¯ Þd þ rf i ^ i; F ^ i ðxi x by IV, using instruments ½1; xi ; F ¯ Þ; f^ i :
(29)
’
Once one has the IV candidates, zi, Procedure 1 is very simple to ^ to the implement. An important feature of Eq. (29) is that it simply adds f i usual equation estimated by IV – that is, the equation that ignores the interaction term w[bE(b|x,z)]. (Eq. (29) also includes the interaction term wi ðxi xÞ, ¯ which is often omitted from analyses of linear models estimated by IV). If one ignores the estimation error in x, ¯ which, as discussed in Section 2, seems reasonable, then the heteroskedasticity-robust t-statistic on f^ i is an asymptotically valid test of H0:r=0. Therefore, one can test whether b and w are uncorrelated, conditional on (x,z), very simply. Importantly,
Instrumental Variables Estimation of the Average Treatment Effect
105
when r=0, consistency of Procedure 1 does not hinge on w given (x,z) following a probit, even though the IVs were chosen under this assumption. If r6¼0 then the probit assumption is generally necessary and, as mentioned in Section 2, the standard error of b^ should be adjusted for the first-stage estimation of h. For the estimation to be convincing, one should have IVs z that satisfy the exclusion and exogeneity restrictions in Assumptions 1 and 2. While the procedure does go through without z – provided xi varies sufficiently in the sample – identification of b hinges on the nonlinearity (in x) of Fðy0 þ xh1 Þ and fðy0 þ xh1 Þ. This kind of identification is tenuous and is usually discouraged. While w given (x,z) is assumed to follow a probit model, there are no other full distributional assumptions used in Procedure 1. I do assume E(v|u) is linear, although that could be relaxed to allow a general polynomial: one would compute E(wu j|x,z) for various values of j, which is tractable because of the normality of u. Except for linearity of some conditional expectations, the distributions of a and b – the unobserved heterogeneity – have not been restricted. In particular, D(y|x,z) can be fairly general. The usual two-step approach to estimating the switching regression model derives E(y|x,z,w) for w=1 and w=0 (see, for example, Heckman, 1976; Lee, 1978; Vella, 1993; Vella & Verbeek, 1999). This requires restricting the joint distribution of (u,a,b) conditional on (x,z); typically, joint normality is assumed. Then, a generalized residual – which plays the role of the CF – is added to the usual equation, which is then estimated by OLS. Procedure 1 offers a correction function approach that has some advantages over the usual CF approach. First, as mentioned above, while I explicitly restrict D(w|x,z), the distribution of the unobserved heterogeneity is not restricted beyond linearity of some conditional means. Second, Procedure 1 allows one to separate the effects of endogeneity of w from the effects of a nonconstant treatment effect. Another benefit of Procedure 1 is that it shows how the usual IV estimator could consistently estimate b even when b6¼b and b depend on unobserved heterogeneity in an arbitrary way. To see how, suppose there are no covariates, x, and just a single IV, z. (This imposes a strong form of exogeneity on z: a and b must be mean independent of z without conditioning on any observables). Then Eq. (28) becomes y ¼ g0 þ bw þ rfðy0 þ y1 zÞ þ q;
EðqjzÞ ¼ 0
(30)
The usual IV approach drops f(y0+y1z) and uses z as an IV for w. As is well known, if the IV is uncorrelated with an omitted variable, the IV estimator is
106
JEFFREY M. WOOLDRIDGE
still consistent. Therefore, if z and f(y0+y1z) are uncorrelated, the usual IV estimator consistently estimates b. It may seem unlikely that z and f(y0+y1z) are uncorrelated, but in practically relevant situations the correlation could be small. For example, suppose that z is symmetrically distributed about zero and y0=0. Then z and f(y1z) are uncorrelated, and ignoring the term (bb)w would consistently estimate b. Interestingly, this framework helps to explain some of Angrist’s (1991) simulation findings for the usual IV estimator. One of Angrist’s designs assumes that w given z follows a probit with y0=0, where z is a discrete uniform random variable taking on eight values from 3.5 to 3.5. In this setup, z and f(y1z) are uncorrelated because z is symmetrically distributed about zero, and so the IV estimator that omits the correction function is consistent for the ATE. This explains why Angrist finds in simulations that the IV estimator is almost unbiased when the sample size is 800. Eq. (30) illustrates a shortcoming of the new correction function approach compared with the traditional CF approach. Namely, Eq. (30) does not identify b if b6¼b and z is binary. Identification fails because, if z takes only two values, then z and f(y0+y1z) are perfectly collinear, and so the rank condition fails in Eq. (30). If the IV takes on more than two values then, under the maintained assumptions, ATE is identified. Adding the correction function to the IV estimation should provide some indication of how ATE and LATE (Imbens & Angrist, 1994) relate to each other if the parametric restrictions are satisfied. Plus, while Procedure 1 allows for exogenous covariates, analysis of LATE is more difficult when additional covariates, x, are required so that a and b are mean independent of z, conditional on x. (Abadie, 2003 makes recent headway on estimating local treatment effects when covariates are needed for the ignorability conditions on the instruments to hold). A good example of this is the return to a college degree for the population of working, high school graduates. One might propose distance to the nearest college as an IV for a binary college indicator, as in Card (1995). Unfortunately, distance to the nearest college is correlated with IQ score (an indicator of unobserved ability) unless, at a minimum, geographic location is controlled for. As a brief illustration of Procedure 1, I use the data from Card (1995), which is one of the datasets that comes with Wooldridge (2002). The dependent variable is log of monthly earnings. I restrict attention to high school graduates and study the return to having a four-year college degree. The data, for men born in 1950, are for 1976. I dropped observations with missing data on parents’ education, giving a final sample size of 1,938.
Instrumental Variables Estimation of the Average Treatment Effect
107
For simplicity, I use the special case d=0 in Eq. (29) – that is I do not allow for interactions between having a college degree and the observed covariates. The dependent variable is log earnings and, following Card (1995), the elements of x are a constant, a quadratic in experience, indicators for being black, living in the south, and living in an SMSA, and regional and SMSA indicators for 1966 – corresponding to each man’s place of residence at age 16. For my purposes, the indicator for living near a four-year college at age 16, nearc4, is not satisfactory as a lone IV for the college dummy, college, for two reasons. First, as I discussed above, without other covariates the ATE would not be identified. Second, with the other covariates included in a probit, college has a positive relationship with nearc4 but the t-statistic on nearc4 is only 1.81. Therefore, I add mother’s and father’s years of schooling to the instrument list. Each has a t-statistics over 4.5. ^ . ^ i and f After estimating the probit model in the first stage, I obtain the F i ^ i as For comparison, I first estimate Eq. (29) without the term f^ i using F an IV for collegei. The estimate of b is 0.535 (standard error=0.089) .When f^ i is added to the equation (and acts as its own instrument), the ATE ^ is r^ ¼ estimate falls to 0.448 (standard error=0.085). The coefficient on f i 0:409 (standard error=0.122). Therefore, the correction term is highly significant. ^ i is about 0.380. Further, the R-squared The correlation between f^ i and F ^ on F ^ from the regression of f and the exogenous variables in the wage i i equation is about 0.524. In other words, including f^ i in Eq. (29) does not introduce intractable multicollinearity in this example. Having almost 2,000 observations helps, too. As an analysis of estimating the return to a college degree, this exercise clearly has its limitations. Assuming parents’ education is exogenous to the wage equation is highly questionable (although it had been done before). My main purpose is to show that the method can produce sensible results when used as an extension to standard IV estimation.
4.2. Tobit Treatment Variable I now turn to estimation of b when the treatment, w, follows a reduced form Tobit model. This allows, say, number of hours spent in a jobtraining program to be the treatment (as opposed to just a participation indicator).
108
JEFFREY M. WOOLDRIDGE
I maintain Assumptions 1 and 2, and Eq. (23), but replace Eqs. (25) and (26) with w ¼ maxð0; a0 þ xa1 þ za2 þ uÞ
(31)
ujx; z Normalð0; s2 Þ
(32)
which imply that w given (x,z) follows a standard Tobit model. Again, these assumptions could be relaxed, along with Eq. (23), but they are standard and are useful to illustrate the approach. Again, the key is in deriving the correction function h(x,z;h), where h now consists of a0, a1, a2, and s2. The algebra is more tedious than in the probit case. One must evaluate Z 1 1 m maxð0; a0 þ xa1 þ za2 þ mÞm f dm (33) s s 1 Remarkably, after several steps, Eq. (33) reduces to ra Eðwujx; zÞ ¼ s2 F s where r=(1,x,z), as before. Therefore, Eq. (16) becomes h rai þq y ¼ g0 þ xc þ bw þ wðx lx Þd þ r s2 F s Further, because ra ra ðraÞ þ sf Eðwjx; zÞ ¼ F s s the following is a natural IV procedure.
(34)
(35)
Procedure 2. (i) Estimate a0, a1, a2, and s2 from a Tobit of wi on (l,xi,zi). Form ^ ; i ¼ 1; 2; . . . ; N. ^ i ¼ Fðri a^ =sÞ; ^ i ðri aÞ ^ þ s^ f ^ f^ i fðri a^ =sÞ, ^ and w^ i F F i (ii) Estimate the equation ^ i Þ þ errori yi ¼ g0 þ xi c þ bwi þ wi ðxi xÞd ¯ þ rðs^ 2 F
(36)
^ i . by IV, using instruments ½1; w^ i ; xi ; w^ i ðxi xÞ; ¯ F Procedure 2 gives a way to consistently estimate the ATE in the general CRC model when the treatment follows a Tobit reduced form. Most of the remarks following Procedure 1 apply to Procedure 2. In particular, a simple
Instrumental Variables Estimation of the Average Treatment Effect
109
t-test, probably made robust to heteroskedasticity, is valid for testing H0:r=0. When b=b, so that the partial effect of w is constant, Vella (1993) proposed a consistent estimation method when w follows a Tobit model. Vella uses the CF approach by finding E(y|w,x,z), just as in the standard approach to switching regression models for binary w. Implementing Vella’s approach involves adding a generalized residual – which is obtained under joint normality of all unobserved errors – to the usual equation and using ordinary least squares. Procedure 2 applied to the case of constant treatment effect is different from Vella’s estimator: because r=0, one simply drops the ^ i , and uses standard IV. Therefore, even when correction function, s^ 2 F restricted to the constant treatment effect case, Procedure 2 is more robust than Vella’s approach. While the IVs for wi [and wi ðxi xÞ] ¯ are obtained under a Tobit reduced form for wi, the resulting IV estimator is consistent even if the Tobit assumption fails. Nevertheless, if Var(q|x,z) is constant and w given (x,z) follows a Tobit, Procedure 2 produces the optimal IV estimator, and inference is standard. In Vella’s approach, one must account for the generated regressor. As in the binary treatment case, a convincing analysis requires an exclusion restriction in the structural equation. That is, at least one element of z should be significant in the reduced form Tobit (and, by assumption, z is excluded from the structural equation). There is one difference between the probit and Tobit cases that may be practically important. As discussed earlier, when w follows a probit model, the instrument is a strictly increasing function in the index, ra, while the correction function is not. When w follows a Tobit model, E(wu|x,z) and E(w|x,z) are both strictly increasing functions of ra, although the former has an asymptote of s2 while the latter is unbounded with an increasing slope. Hopefully, the estimated index ri a^ has sufficient variation so that multicollinearity is not a problem.
5. MULTIPLE TREATMENTS An important advantage of the current approach over the usual CF approach is that the new approach allows any number of treatments, which can have different distributional characteristics (continuous, binary, Tobit), with very little additional work. Provided one is willing to assume a reduced form for each treatment, one needs to make assumptions only about univariate distributions and compute univariate integrals.
110
JEFFREY M. WOOLDRIDGE
Let w be the 1 M row vector of treatments and let b be the column vector of slopes. Write Eðyja; b; wÞ ¼ a þ wb
(37)
The first two assumptions are straightforward extensions of Assumptions 1 and 2: Assumption 4. E(y|a,b,w,x,z)=E(y|a,b,w). Assumption 5. z is redundant for a and b, conditional on x. Further, the conditional expectations are linear in x: Eðajx; zÞ ¼ EðajxÞ ¼ g0 þ xc
(38)
and, for j=1,y,M, Eðbj jx; zÞ ¼ Eðbj jxÞ ¼ bj þ ðx lx Þdj
’
(39)
Now, with multiple treatments, write y ¼ g0 þ xc þ b1 w1 þ w1 ðx lx Þd1 þ þ bM wM þ wM ðx lx ÞdM þ w1 v1 þ þ wM vM þ c þ e
ð40Þ
where E(c|x,z)=E(e|x,z)=0. In order to handle the terms wjvj I modify Assumption 3: Assumption 6. For j=1,y,M, wj ¼ f j ðx; z; uj ; aj Þ
(41)
uj jx; z G j ð; gj Þ
(42)
Eðvj juj ; x; zÞ ¼ Eðvj juj Þ ¼ rj uj
’
(43)
As in Section 3, Assumption 6 allows one to obtain E(wjvj|x,z) for each j: Eðwj vj jx; zÞ ¼ rj E½ f j ðx; z; uj ; aj Þuj jx; z rj hj ðx; z; hj Þ
(44)
where hj includes aj and gj. Given Eqs. (37) and (42), the parameters aj and gj can generally be estimated by MLE. Importantly, this can be a univariate MLE for each j. In some cases, one might want to have each wj depend on
Instrumental Variables Estimation of the Average Treatment Effect
111
common unobservables, as would be the case if wj are indicators following an ordered probit model. Under Assumptions 4–6, the estimating equation is yi ¼ g0 þ xc þ b1 w1 þ w1 ðx lx Þd1 þ þ bM wM þ wM ðx lx ÞdM þ r1 h1 ðx; z; h1 Þ þ þ rM hM ðx; z; hM Þ þ q
ð45Þ pffiffiffiffiffi where E(q|x,z)=0. Assuming that one can obtain consistent, Nasymptotically normal estimators of h1 ; . . . ; hM , a two-step procedure is clear. First, estimate each hj by maximum likelihood. Then, estimate yi ¼ g0 þ xi c þ b1 wi1 þ wi1 ðxi xÞd ¯ 1 þ þ bM wiM þ wiM ðxi xÞd ¯ M ^ þ r h1 ðxi ; zi ; h1 Þ þ þ r hM ðxi ; zi ; h^ M Þ þ errori M
1
ð46Þ
by IV. A joint test of the last M terms helps to determine whether the slope coefficients do depend on unobserved heterogeneity. Example 1. Suppose {wj: j=1,y, M} is a set of binary indicators where each follows a probit reduced form. Provided one has enough instruments (at least M) with enough variation, M terms of the form ^ ¼ fðri h^ j Þ; f ij
ri ð1; xi ; zi Þ
(47)
can be added to the original equation. So, the estimating equation becomes yi ¼ g0 þ xi c þ b1 wi1 þ wi1 ðxi xÞd ¯ 1 þ ^ þ errori þ bM wi þ wiM ðxi xÞd ¯ M þ r1 f^ i1 þ þ rM f iM
ð48Þ
and the instruments are ^ ^ i1 ; F ^ i1 ðxi xÞ; ^ iM ; F ^ iM ðxi xÞ; ½1; xi ; F ¯ ...;F ¯ f^ i1 ; . . . ; f iM
(49)
^ ;...;f ^ for joint significance using a Again, one can test the M terms f i1 iM standard Wald test in the context of IV estimation, probably made robust to heteroskedasticity. One can modify the procedure to apply when the treatment variable is an ordered outcome, that is, when wj represents mutually exclusive, increasing intensity levels. For example, wj can be the indicators from an ordered probit model (with w0=1 denoting the base or control group).
112
JEFFREY M. WOOLDRIDGE
Then, letting r(x,z) (and no constant), w0 ¼ 1½rn þ u a1 w1 ¼ 1½a1 orn þ u a2 ¼ 1½u a2 rn 1½uorn a1 w2 ¼ 1½a2 orn þ u a3 ¼ 1½u a3 rn 1½uorn a2 .. . wM ¼ 1½rn þ u4aM ¼ 1½uorn aM
(50)
where u has a standard normal distribution and the aj are the cut points. If one assumes Eq. (43) (for uj=u, all j) then essentially the same calculation in Section 4.1 gives Eðwj vj jx; zÞ ¼ rj ½fðajþ1 rnÞ fðrn aj Þ;
j ¼ 1; . . . ; M 1
EðwM vM Þ ¼ rM fðrn aM Þ Because f( ) is symmetric about zero, it suffices to define ^ ¼ fðrn^ a^ j Þ; j ¼ 1; . . . ; M f ij and uses these in Eqs. (48) and (49), where n^ and a^ 1 ; . . . ; a^ M are the ordered probit MLEs. (That is, one may use a nonsingular linear ^ j vj jx; zÞ.) transformation of the original functions, Eðw A more general approach with an ordered outcome is to define a new set of indicators, sj w0 þ w1 þ . . . þ wj ; j ¼ 1; . . . ; M 1 sM wM If the ordered probit model is true – and more generally, including failure of the ‘‘parallel’’ assumption imposed by ordered probit – each sj follows a probit model. So, one can estimate a probit for sj, j=1,y, M, with ^ ^ (which now includes an intercept). Then, just use f coefficients, say, w j ij computed from the probits for sij in Eqs. (48) and (49). Because the sj are just linear functions of the wj, this is the same as including the estimated E(wjvj|x,z). However, note that one is probably most interested in the coefficients on the wj – which measure the ATE of the jth treatment level relative to the control group, and this argues for leaving the wj in Eq. (48).
Instrumental Variables Estimation of the Average Treatment Effect
113
If the mutually exclusive treatments are from a multinomial probit model, then there are no simple reduced forms for each treatment. Assuming each follows a probit model, and ignoring the adding-up restriction, might be an acceptable approximation. Example 2. Suppose that M=2, w1 follows a reduced form probit, and w2 follows a reduced form Tobit. The procedure is to first estimate a probit ^ fðri h^ 1 Þ. ^ i1 Fðri h^ 1 Þ and f model of w1 on (l,x,z) and obtain F i1 ^ i2 ¼ Then estimate a Tobit of w2 on (1,x,z), obtain a^ 2 and s^ 2 , and form F ^ ^ ^ Fðri a^ 2 =s^ 2 Þ, fi2 fðri a^ 2 =s^ 2 Þand w^ i2 Fi2 ðri a^ 2 Þ þ s^ 2 fi2 ; i ¼ 1; 2; . . . ; N. Finally, estimate the equation yi ¼g0 þ xi c þ b1 wi1 þ b2 wi2 þ wi1 ðx xÞd ¯ 1 þ wi2 ðxi xÞd ¯ 2 ^ i2 þ errori þ r f^ þ r F 1
i1
2
by IV, using instruments ^ i1 ; w^ i2 ; F ^ i1 ðxi xÞ; ^ i2 ½1; xi ; F ¯ w^ i2 ðxi xÞ; ¯ f^ i1 ; F If the elements of zi have enough variation, the two ATEs are identified. As usual, the joint significance of the last two terms can be used to decide whether the slopes depend on individual-specific heterogeneity.
6. CONCLUSIONS AND DIRECTIONS FOR FURTHER RESEARCH I have developed a framework for estimating the ATE in the correlated random effects model. The assumptions put the focus on consistently estimating the ATEs (or the ATEs conditional on the covariates, x), and so are in the spirit of the recent causal effect literature. Rather than just using IV to estimate the local ATE, or putting potentially wide bounds on the ATE, the approach here is to impose assumptions that identify the average effect for general kinds of treatments. The key difference between the ‘‘correction function’’ approach here and the more typical CF approach is that I find the expected value of the interaction between the treatment and the unobserved heterogeneity conditional on exogenous variables only – and not conditional on the treatment, too. One advantage of the new approach is that a simple variable addition test, in the context of IV, allows one to determine whether
114
JEFFREY M. WOOLDRIDGE
unobserved heterogeneity interacting with an endogenous treatment seems to be important in a particular application. The difference in the IV estimates of the ATE, with and without the correction function, can be valuable information. The correction function approach should be viewed as complimentary to the usual CF approach, as the former requires fewer assumptions when the instruments have sufficient variation, yet loses identification in special cases where the CF approach does not. For two important cases – a treatment that follows a reduced form probit, and one that follows a reduced form Tobit – I derive the correction functions, and show that they have remarkably simple forms. It is possible to make the distributions and conditional expectations flexible, although this increases the computational requirements and could affect identification. Allowing more flexible functional forms generally requires more variation in the IVs. I have also shown how simple it is to use the correction function approach for multiple treatments that can have very different characteristics. With multiple treatments it is typically much easier to derive the set of correction functions – which requires calculations using univariate distributions – than to find the CF, which depends on all of the endogenous treatments. Still, whether the correction function approach works well in practice remains to be seen. The example I worked through in Section 4.1 shows that it is promising. Future work could focus on relaxing the distributional and functional form assumptions. For example, in the case of a binary treatment w, and no covariates, one can write Eq. (30) without any distributional assumptions as y ¼ g0 þ bw þ gðzÞ þ q;
EðqjzÞ ¼ 0
(51)
where g(z)=E [(bb)w|z] is an unknown function. Even when z takes on more than two values, Eq. (37) does not identify b without further restrictions: if E(w|z) is linear in g(z), any attempt at IV would violate the rank condition, because if k(z) is the proposed vector of instruments for w, the linear projection of w on [1, k(z), g(z)] would depend only on g(z). The assumptions in Section 4.1 rule out this possibility because EðwjzÞ ¼ Fðy0 þ y1 zÞ is strictly monotonic in z (when y16¼0) while gðzÞ ¼ rfðy0 þ y1 zÞ is symmetric about its maximum, z y0 =y1 . In many applications, it seems reasonable to expect monotonicity of E(w|z) – something one can nevertheless test, because E(w|z) is nonparametrically identified – and nonmonotonicity of E [(bb)w|z] (which cannot be tested). Intuitively, such shape restrictions should be enough to identify b from
Instrumental Variables Estimation of the Average Treatment Effect
115
EðyjzÞ ¼ g0 þ bGðzÞ þ gðzÞ, where, for the purposes of identification, G(z)E(w|z) can be treated as known. The function G(z), which one can estimate in a first stage, is monotonically increasing while g(z) is not. How one would specifically tackle estimation of b is an open question. Adding covariates, and allowing nonbinary treatments, would be even more challenging.
ACKNOWLEDGMENTS I would like to thank David Card, Edward Vytlacil, two anonymous referees, and the participants in the Michigan State University and New York University econometrics workshops for providing very helpful comments and suggestions on earlier drafts.
REFERENCES Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113, 231–263. Angrist, J. D. (1991). Instrumental variables estimation of average treatment effects in econometrics and epidemiology. National Bureau of Economic Research Technical Working Paper no. 115. Angrist, J. D., & Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics, 106, 979–1014. Barnow, B., Cain, G., & Goldberger, A. (1980). Issues in the analysis of selectivity bias. Evaluation Studies, 5, 42–59. Blundell, R. W., & Powell, J. L. (2004). Endogeneity in semiparametric binary response models. Review of Economic Studies, 71, 655–679. Card, D. (1995). Using geographic variation in college proximity to estimate the return to schooling. In: L. N. Christophides, E. K. Grant & R. Swidinsky (Eds), Aspects of labour market behavior: Essays in honour of John Vanderkamp (pp. 201–222). Toronto: University of Toronto Press. Card, D. (2001). Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica, 69, 1127–1160. Chernozhukov, V., & Hansen, C. (2005). An IV model for quantile treatment effects. Econometrica, 73, 245–261. Chernozhukov, V., Imbens, G. W., & Newey, W. K. (2004). Instrumental variable identification and estimation of nonseparable models via quantile conditions. Working Paper, MIT Department of Economics. Chesher, A. (2003). Identification in nonseparable models. Econometrica, 71, 1405–1441. Chesher, A. (2005). Nonparametric identification under discrete variation. Econometrica, 73, 1525–1550. Garen, J. (1984). The returns to schooling: A selectivity bias approach with a continuous choice variable. Econometrica, 52, 1199–1218.
116
JEFFREY M. WOOLDRIDGE
Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement, 5, 475. Heckman, J. J., & Navarro-Lozano, S. (2004). Using matching, instrumental variables, and control functions to estimate economic choice models. Review of Economics and Statistics, 86, 30–57. Heckman, J. J., & Robb, R. (1986). Alternative methods for solving the problem of selection bias in evaluating the impact of treatments on outcomes. In: H. Wainer (Ed.), Drawing inferences from self-selected samples (pp. 63–113). New York: Springer-Verlag. Heckman, J. J., Tobias, J. L., & Vytlacil, E. (2003). Simple estimators for treatment parameters in a latent-variable framework. Review of Economics and Statistics, 85, 748–755. Heckman, J. J., & Vytlacil, E. (1998). Instrumental variables methods for the correlated random coefficient model. Journal of Human Resources, 33, 974–987. Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62, 467–476. Imbens, G. W., & Newey, W. K. (2002). Identification and estimation of triangular simultaneous equations models without additivity. National Bureau of Economic Research Technical Working Paper no. 285. Lee, L.-F. (1978). Unionism and wage rates: A simultaneous equation model with qualitative and limited dependent variables. International Economic Review, 19, 415–433. Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In: R. F. Engle & D. McFadden (Eds), Handbook of econometrics (Vol. 4, pp. 2111–2245). Amsterdam: North Holland. Rivers, D., & Vuong, Q. H. (1988). Limited Information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics, 39, 347–366. Smith, R., & Blundell, R. (1986). An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica, 54, 679–685. Vella, F. (1993). A simple estimator for simultaneous equations models with censored endogenous regressors. International Economic Review, 34, 441–457. Vella, F., & Verbeek, M. (1999). Estimating and interpreting models with endogenous treatment effects. Journal of Business and Economic Statistics, 17, 473–478. Wooldridge, J. M. (1997). On two stage least squares estimation of the average treatment effect in a random coefficient model. Economics Letters, 56, 129–133. Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press. Wooldridge, J. M. (2003). Further results on instrumental variables estimation of average treatment effects in the correlated random coefficient model. Economics Letters, 79, 185–191. Wooldridge, J. M. (2004). Estimating average partial effects under conditional moment independence assumptions. Working Paper, Michigan State University Department of Economics. Wooldridge, J. M. (2005). Unobserved heterogeneity and estimation of average partial effects. In: D. W. K. Andrews & J. H. Stock (Eds), Identification and inference for econometric models: Essays in honor of Thomas Rothenberg (pp. 27–55). Cambridge: Cambridge University Press.
EVALUATING THE EFFECTS OF JOB TRAINING PROGRAMS ON WAGES THROUGH PRINCIPAL STRATIFICATION Junni L. Zhang, Donald B. Rubin and Fabrizia Mealli ABSTRACT In an evaluation of a job training program, the causal effects of the program on wages are often of more interest to economists than the program’s effects on employment or on income. The reason is that the effects on wages reflect the increase in human capital due to the training program, whereas the effects on total earnings or income may be simply reflecting the increased likelihood of employment without any effect on wage rates. Estimating the effects of training programs on wages is complicated by the fact that, even in a randomized experiment, wages are truncated by nonemployment, i.e., are only observed and well-defined for individuals who are employed. We present a principal stratification approach applied to a randomized social experiment that classifies participants into four latent groups according to whether they would be employed or not under treatment and control, and argue that the average treatment effect on wages is only clearly defined for those who would be employed whether they were trained or not. We summarize large sample
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 117–145 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00005-9
117
118
JUNNI L. ZHANG ET AL.
bounds for this average treatment effect, and propose and derive a Bayesian analysis and the associated Bayesian Markov Chain Monte Carlo computational algorithm. Moreover, we illustrate the application of new code checking tools to our Bayesian analysis to detect possible coding errors. Finally, we demonstrate our Bayesian analysis using simulated data.
1. INTRODUCTION Consider the evaluation of a job training program through a social experiment with eligible applicants randomized into either a treatment group receiving training or a control group receiving no training. To evaluate the social benefit of the training, we wish to estimate the treatment’s effect on wages. This estimation, however, is complicated by the fact that wages are truncated by nonemployment, that is, not welldefined for those who are not employed. This is a well-known problem in econometrics, often referred to as sample selection (Heckman, 1979). An obvious, but generally incorrect, approach to adjust for such truncation is to compare the employed groups under the two treatment arms, either through direct comparison of means or through regression adjusted comparisons, using the treatment indicator as a regressor (O’Leary, 1995; Decker & Corson, 1995; Kodrzycki, 1997). This approach is generally invalid even in a well-conducted randomized experiment, because even though treated and control groups are comparable at the baseline, they may be different conditional on being employed in a period after treatment assignment. This is because the observed and unobserved characteristics of those who are employed in the treatment and control groups may differ between the groups. Specifically for this reason, most of the evaluation studies of job training focus either on employment or on total earnings, setting earnings to zero for those who are not working (Heckman, Lalonde, & Smith, 1999). An alternative approach to account for the nonobservability of wages for those who are not employed is the use of selection models (Heckman, 1979). The rationale is that the decision to work or not to work was made by the individual, and thus those who are working comprise a self-selected sample. More specifically, for individual i, let Zi denote the indicator for treatment assignment: Zi=1 for assignment to treatment, Zi=0 for assignment to control; let Si denote the observed employment indicator: Si=1 for employed and 0 otherwise; and let Yi denote the individual’s wage, which
Evaluating the Effects of Job Training Programs on Wages
119
is only observed as a real number if Si=1. The selection model assumes that there exists an underlying regression relationship logðY i Þ ¼ b0 þ b1 Z i þ bT X 1i þ u1i where the dependent variable Yi is observed only if Si=1, and the selection equation is Si ¼ g0 þ g1 Z i þ cT X 2i þ u2i Si ¼ 1;
if S ni 0
Si ¼ 0;
if S i o0
(1)
n
Here X1i and X2i are two possibly overlapping sets of covariates, u1i and u2i are jointly normal variables with mean 0 and covariance matrix S, and S ni is a continuous latent variable underlying the observed employment indicator. Implicitly, the selection model assumes that the wages the nonemployed people would have if they were exposed to the same treatment but, counterfactually, were employed, can be determined just as the wages for those who would be employed (note that this assumption is based on quantities that can never be observed). Under such an assumption, the selection model is used to estimate the average treatment effect on wages for all people. Extensions of this typical selection model include models with heterogeneous treatments effects, and in particular, models with two versions of the S ni and logðY i Þ corresponding to the two wages associated with the two treatment levels Zi=0 and Zi=1. If joint normality is assumed for logðY i Þ and Sni , identification is driven by this assumption, and even in cases where normality holds, identification is usually weak and estimates can be very sensitive to the inclusion/exclusion of covariates and their interactions (see, e.g., Little, 1985; Copas & Li, 1997). If normality is relaxed, identification of selection models usually relies on an instrument (or a set of instruments) that determines the employment status but does not affect wages after controlling for other X variables (Heckman, 1990; Heckman & Smith, 1998; Heckman & Vytlacil, 1999). Finding such a variable, however, is not an easy task. In the absence of valid instruments, Lee (2005) developed sharp bounds on treatment effects on wages, which are a special case of those derived in Zhang and Rubin (2003) and discussed in Section 3. Following Zhang and Rubin (2003) and Zhang, Rubin, and Mealli (2005), we use a principal stratification (Frangakis & Rubin, 2002) framework to address the issue of sample selection (i.e., truncation by nonemployment), where the primary target of estimation is the average treatment effect on wages for those who would be employed irrespective of the treatment
120
JUNNI L. ZHANG ET AL.
assignment. Principal stratification defines causal estimands about which experimental data can provide direct evidence, and defines meaningful subgroups of people, and expresses assumptions that are explicit behavioral hypotheses. Another more standard example of the application of principal stratification is noncompliance with assigned treatment. Principal stratification (Frangakis & Rubin, 2002) has been developed within randomized studies in order to deal with posttreatment complications (such as noncompliance, missing outcomes, truncation by death, surrogate outcomes) but can, in principle, be extended to deal with such problems in observational studies (Grilli & Mealli, 2005), which of course would require additional assumptions on treatment assignment. The full Bayesian approach to the problem of noncompliance is proposed in Imbens and Rubin (1997) and is further developed in Hirano, Imbens, Rubin, and Zhou (2000). The full Bayesian approach to our problem followed in this chapter is more complex than is needed to deal with noncompliance. Also, it goes beyond that of finding bounds on treatment effects, because it can help remove assumptions or/and assess the sensitivity of results to deviations from such assumptions. Here, we also describe the Markov Chain Monte Carlo (MCMC) code for implementing the Bayesian analysis, and, moreover, illustrate the use of new software-checking tools to help ensure the correctness of the code. Using a small frequentist simulation, we illustrate that our Bayesian analysis and code for it can correctly estimate the causal effect of interest. The rest of the chapter is organized as follows. In Section 2, we present our causal inference framework. In Section 3, we display nonparametric large sample bounds for the average treatment effect of interest following Zhang and Rubin (2003), and describe two assumptions that can help tighten these bounds. In Section 4, we present parametric models including prior distributions for the parameters. Computational aspects for the Bayesian analysis are addressed in Section 5. We then present an illustrative simulated example in Section 6. Section 7 concludes the chapter.
2. THE CAUSAL INFERENCE FRAMEWORK: POTENTIAL OUTCOMES, THE OBSERVED DATA, AND PRINCIPAL STRATIFICATION We use the potential outcomes framework to define causal effects, and place a Bayesian model on the potential outcomes; this approach is known as the
Evaluating the Effects of Job Training Programs on Wages
121
Rubin Causal Model. It defines causal effects by the comparison of treated and control potential outcomes on a common set of units (Rubin, 1974), specifies an assignment mechanism as a stochastic function of covariates and possibly potential outcomes (Rubin, 1975), and derives causal inferences using the posterior predictive distribution of the causal effects (Rubin, 1978).1 Let N denote the total number of participants in the social experiment for the evaluation of a job training program. For participant i, let Xi denote the pretreatment covariates; let Si(1) denote the potential employment indicator when participant i is trained, and let Yi(1) denote the potential wage when participant i is trained; let Si(0) and Yi(0), respectively, denote the potential employment indicator and the potential wage when participant i is not trained. For z ¼ 1; 0, Y i ðzÞ 2 Rþ (the set of positive real numbers) when Si(z)=1, and Y i ðzÞ ¼ when Si ðzÞ ¼ 0. The participants can be classified into four principal strata (Frangakis & Rubin, 2002) according to the joint values of the two potential employment indicators: EE ¼ fijS i ð1Þ ¼ S i ð0Þ ¼ 1g, those who would be employed whether trained or not; EN ¼ fijS i ð1Þ ¼ 1; S i ð0Þ ¼ 0g, those who would be employed when trained, but not employed when not trained; NE ¼ fijS i ð1Þ ¼ 0; Si ð0Þ ¼ 1g, those who would be not employed when trained, but employed when not trained; NN ¼ fijS i ð1Þ ¼ S i ð0Þ ¼ 0g, those who would be not employed whether trained or not. Note that the principal strata may result from both observable and unobservable characteristics that can also affect wages; it is precisely because there can be unobservables influencing both posttreatment employment and wages that principal strata are defined. Because a unit-level causal effect is defined by a comparison of potential outcomes, the treatment effect on wages for participant i is a comparison of Yi(1) and Yi(0), for example, Y i ð1Þ Y i ð0Þ, which is well defined on R (the set of real numbers) only for the EE group. Consequently, we consider the treatment effect of primary interest to be ATEEE ¼ Y¯ EE ð1Þ Y¯ EE ð0Þ, the average difference between Y i ð1Þ and Y i ð0Þ for the EE group. Some labor economists might argue that it could also be meaningful to talk about a treatment effect on wages for the NE group, the EN group or the NN group, because for them, wages for individuals who are not employed can be defined, and the reason why these wages are not observed is that they are
122
JUNNI L. ZHANG ET AL.
lower than the individuals’ ‘‘reservation wages.’’ Here we do not want to argue for this rationale, but want to emphasize that, at least in the experiment we examine, individuals in the EN group or in the NN group are never observed as employed when not trained (and so their wage is never observed when not trained), and individuals in the NE group or in the NN group are never observed as employed when trained (and so their wage is never observed when trained). Thus, only by adding assumptions on quantities that are never observed in this experiment is it possible to use the observed data to estimate an effect of training on wages for the EN, NE or NN groups. We prefer to avoid a priori defining the problem in terms of these assumptions and focus therefore on inferences about ATEEE . As in Section 1, for each participant i, let Zi denote the observed treatment assignment indicator. Because we can only observe the employment indicator and wage under the assigned treatment arm, the observed data are fðX i ; Zi ; S obs ¼ S i ðZ i Þ; Y obs ¼ Y i ðZi ÞÞ; i ¼ 1; . . . ; Ng. Note that the i i notation Si in Section 1 is replaced by Sobs to reinforce that employment is i generally affected by Zi: Sobs ¼ Z S ð1Þ þ ð1 Z i ÞS i ð0Þ. Classified by Zi and i i i S obs , the following four groups can be observed: i Oð1; 1Þ ¼ fijZ i ¼ 1; S obs ¼ 1g, those i group and were employed; Oð1; 0Þ ¼ fijZ i ¼ 1; S obs ¼ 0g, those i group and were not employed; Oð0; 1Þ ¼ fijZ i ¼ 0; S obs ¼ 1g, those i group and were employed; Oð0; 0Þ ¼ fijZ i ¼ 0; S obs ¼ 0g, those i group and were not employed.
who were assigned to the training who were assigned to the training who were assigned to the control who were assigned to the control
For each observed group, Table 1 shows the observed data pattern and corresponding principal strata. We can see that the principal stratum Gi for participant i usually cannot be directly observed; hence, the EE group Table 1.
Observed Data Pattern and Principal Strata for the Observed Groups.
Observed Group O(Z, Sobs)
Zi
Sobs i
Y obs i
O(1,1) O(1,0) O(0,1) O(0,0)
1 1 0 0
1 0 1 0
2 Rþ
2 Rþ
Principal Stratum Gi EE NE EE EN
or or or or
EN NN NE NN
Evaluating the Effects of Job Training Programs on Wages
123
is not an observable subgroup of participants although it is a meaningful subgroup.
3. LARGE SAMPLE BOUNDS AND IDENTIFYING ASSUMPTIONS Zhang and Rubin (2003) showed that Manski’s approach (Manski, 1989, 1990, 1995, 2003) can be applied to obtain nonparametric large sample bounds for ATEEE in the case when covariates do not exist and the proportions of the principal strata are simply given by ðpEE ; pEN ; pNE ; pNN Þ, where pg is the proportion of individuals belonging to principal stratum g. Because the treatment is randomly assigned, we have pEE þ pEN ¼ PrðS obs ¼ 1jZ ¼ 1Þ pNE þ pNN ¼ PrðS obs ¼ 0jZ ¼ 1Þ pEE þ pNE ¼ PrðS obs ¼ 1jZ ¼ 0Þ pEN þ pNN ¼ PrðS obs ¼ 0jZ ¼ 0Þ Also, pEE þ pEN þ pNE þ pNN ¼ 1. Therefore, for any value of pNE that satisfies max½0; PrðS obs ¼ 1jZ ¼ 0Þ PrðS obs ¼ 1jZ ¼ 1Þ pNE min½PrðS obs ¼ 1jZ ¼ 0Þ; PrðS obs ¼ 0jZ ¼ 1Þ
(2)
we have pEE ¼ PrðS obs ¼ 1jZ ¼ 0Þ pNE pEN ¼ PrðS obs ¼ 1jZ ¼ 1Þ PrðS obs ¼ 1jZ ¼ 0Þ þ pNE pNN ¼ PrðS obs ¼ 0jZ ¼ 1Þ pNE Because the Oð1; 1Þ group is a pEE =ðpEE þ pEN Þ and pEN =ðpEE þ pEN Þ mixture of the EE and EN principal strata, the upper bound for Y¯ EE ð1Þ ¼ E½Y obs jG ¼ EE; Z ¼ 1 is the average value of Y obs in the p1 ðpNE Þ ¼
pEE PrðS obs ¼ 1jZ ¼ 0Þ pNE ¼ pEE þ pEN PrðSobs ¼ 1jZ ¼ 1Þ
(3)
fraction of the Oð1; 1Þ group with the largest value of Y obs, which is denoted by Y¯ Oð1;1Þ ½max jp1 ðpNE Þ. Analogously, the lower bound for Y¯ EE ð1Þ is the
124
JUNNI L. ZHANG ET AL.
average value of Y obs in the p1 fraction of the Oð1; 1Þ group with the smallest value of Y obs, which is denoted by Y¯ Oð1;1Þ ½min jp1 ðpNE Þ. Analogously, because the Oð0; 1Þ group is a pEE =ðpEE þ pNE Þ and pNE =ðpEE þ pNE Þ mixture of the EE and NE principal strata, the upper bound and lower bound for Y¯ EE ð0Þ are given by Y¯ Oð0;1Þ ½max jp0 ðpNE Þ and Y¯ Oð0;1Þ ½min jp0 ðpNE Þ, respectively, where p0 ðpNE Þ ¼
pEE pNE ¼1 obs pEE þ pNE PrðS ¼ 1jZ ¼ 0Þ
(4)
Hence, the lower bound for ATEEE is minfY¯ Oð1;1Þ ½min jp1 ðpNE Þ Y¯ Oð0;1Þ ½max jp0 ðpNE Þg pNE
(5)
and the upper bound for ATEEE is maxfY¯ Oð1;1Þ ½max jp1 ðpNE Þ Y¯ Oð0;1Þ ½min jp0 ðpNE Þg pNE
(6)
where the range for pNE is given by Eq. (2), and p1 and p0 are given by Eqs. (3) and (4) respectively. Zhang and Rubin (2003) also examined two explicit assumptions that may be considered plausible in specific applications and can help sharpen the bounds. The first, in the context of evaluation of job training programs, assumes that the training is not harming anyone in the following sense: no one would be employed when not trained, but not employed when trained. Assumption 1. (Monotonicity): There is no NE group. This assumption is less plausible in the job training context than the analogous monotonicity assumption typically made in instrumental variable (IV) analysis (e.g., with noncompliance to assigned treatment, see Angrist, Imbens, & Rubin, 1996), because here monotonicity rules out, a priori, the possibility of a negative treatment effect on employment, whereas in the IV context the assumption rules out defiers, those who would do the opposite of their assignment. It is plausible in practice that there are individuals who would be employed when not trained but not employed when trained. This result is especially possible for evaluations that look only at short-run outcomes, because training programs produce what is known as lock-in effects, in which control group members find work while treatment group members are taking training. Also, some people may take any job that is available without training, but, after being trained, choose to wait for a
Evaluating the Effects of Job Training Programs on Wages
125
better job, and thus choose to be not employed (i.e., the training increases their reservation wages). Note that when g1 40 in the selection Eq. (1), then the monotonicity assumption is implicitly made, and this assumption is also used by Lee (2005). Under the monotonicity assumption, pNE ¼ 0, and hence the lower bound and upper bound for ATEEE, respectively, are obtained by using pNE ¼ 0 in Eqs. (5) and (6). The resulting bounds are PrðSobs ¼ 1jZ ¼ 0Þ ¯ Y¯ Oð0;1Þ Y Oð1;1Þ min PrðSobs ¼ 1jZ ¼ 1Þ and PrðS obs ¼ 1jZ ¼ 0Þ ¯ Y¯ Oð0;1Þ Y Oð1;1Þ max PrðS obs ¼ 1jZ ¼ 1Þ where Y¯ Oð0;1Þ are equal to the average value of Yobs in the Oð0; 1Þ group. These bounds are equal to those in Lee (2005), and hence our bounds in Eqs. (5) and (6) are more general than those in Lee (2005) and can be applied when the monotonicity assumption is not made. A second assumption that can help sharpen the bounds is the stochastic dominance condition, which is related to, but different from, both the stochastic dominance assumption in Manski (1995), which assumes that the wage under training stochastically dominates that under no training, and the perfect positive rank correlation in Heckman, Smith, and Clements (1997). For any principal stratum g 2 fEE; EN; NE; NNg and z ¼ 0; 1, let Y g ðzÞ denote the wage under treatment assignment z for a random member of g. Assumption 2. (Stochastic dominance): For any real number t, PrðY EE ð1Þ tÞ PrðY EN ð1Þ tÞ and PrðY EE ð0Þ tÞ PrðY NE ð0Þ tÞ. This assumption formalizes the notion that the EE group is more capable with higher wages than either the EN group or the NE group; the EE group would be employed under either treatment arm, whereas the EN group or the NE group would be not employed under one treatment arm. Because ability tends to be positively correlated with wages, Assumption 2 may often be plausible in practice. The stochastic dominance assumption remains invariant with respect to monotonically increasing transformations of Y, such as the logarithm transformation. In other words, assuming stochastic dominance for the logarithm of wages is equivalent to assuming stochastic dominance for raw wages.
126
JUNNI L. ZHANG ET AL.
Under the stochastic dominance assumption, Y¯ EE ð1Þ achieves its minimum when it equals Y¯ EN ð1Þ and hence equals Y¯ Oð1;1Þ . Analogously, the lower bound for Y¯ EE ð0Þ is Y¯ Oð0;1Þ . So the lower bound and upper bound for ATEEE without the monotonicity assumption, respectively, are Y¯ Oð1;1Þ max Y¯ Oð0;1Þ ½max jp0 ðpNE Þ pNE
and max Y¯ Oð1;1Þ ½max jp1 ðpNE Þ Y¯ Oð0;1Þ pNE
where the range for pNE is again given by Eq. (2). When the monotonicity assumption is also made, these bounds become Y¯ Oð1;1Þ Y¯ Oð0;1Þ and PrðS obs ¼ 1jZ ¼ 0Þ Y¯ Oð1;1Þ max Y¯ Oð0;1Þ PrðS obs ¼ 1jZ ¼ 1Þ
4. PARAMETRIC MODELS AND PRIOR DISTRIBUTIONS 4.1. Parametric Models To sharpen inferences with real data and finite samples, we can specify a parametric model. Note that, even if parametric models are required to achieve identification, the nature of our parametric assumptions is different from the usual distributional assumptions employed in selection models: our assumptions are related to variables observed on meaningful subgroups of individuals, rather than to error terms, whose meaning strongly depends upon and varies with functional form details of the model specifications. In addition, covariates are included for two reasons. First, covariates generally improve the precision of parameter estimates because they improve the prediction of the missing potential outcomes, both wage and employment. Second, covariates generally make assumptions more plausible because they are conditional rather than marginal; for discussion, see Frangakis and Rubin (1999) on the relative plausibility of latent ignorability versus ignorability. Specific estimation and computational complications when
Evaluating the Effects of Job Training Programs on Wages
127
dealing with parametric mixture models, such as those used by us, are detailed in Section 5. Given covariates X, we first model the potential outcomes f ðSð1Þ; Sð0Þ; Y ð1Þ; Y ð0ÞjXÞ ¼ f ðSð1Þ; Sð0ÞjXÞf ðY ð1Þ; Y ð0ÞjSð1Þ; Sð0Þ; XÞ where the first factor is the conditional distribution of the principal strata given the covariates, and the second factor is the conditional distribution of the potential wages ðY ð1Þ; Y ð0ÞÞ given the principal stratum and the covariates. Note that we specify the conditional distribution of the potential wages given the principal strata, which are defined by the joint values of the potential employment indicators, both observed and missing. We then model the treatment assignment mechanism that determines how the participants are assigned to the training group or the control group, and thereby which potential outcomes are observed f ðZjSð1Þ; Sð0Þ; Y ð1Þ; Y ð0Þ; XÞ In randomized studies with known covariates used to make decisions of treatment assignment, the treatment assignment mechanism is ignorable, i.e., it can be ignored when drawing Bayesian or likelihood-based inferences about the treatment effects (Rubin, 1976, 1978). For illustration, we model the probabilities of the principal strata using a multinomial logit: expðag þ bTg X i Þ PrðG i ¼ gjX i ; hÞ ¼ P expðag0 þ bTg0 X i Þ
for g 2 fEE; EN; NE; NNg
g0
where the EE group is taken as the baseline group (aEE ¼ 0; bEE ¼ 0), and h represents all model parameters. Note that the usual weakness of the Independence of Irrelevant Alternatives (IIA) property is not a concern here because the ‘‘choices’’ are not real choices, instead they are memberships in principal strata.2 Other models such as the multinomial probit could also be used. The log potential wages are modeled using standard normal linear regressions: for ðg; zÞ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þg logðY i ðzÞÞjG i ¼ g; X i ; h Nðmg;z þ gTg;z X i ; s2g;z Þ The family of t-based distributions could also be used, where the regressions are still linear (Liu & Rubin, 1998).
128
JUNNI L. ZHANG ET AL.
The collection of all parameters is h ¼ fðag ; bg Þ; g 2 fEN; NE; NNgg [ fðmg;z ; gg;z ; s2g;z Þ; ðg; zÞ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þgg
ð7Þ
Let X ¼ fX i ; i ¼ 1; . . . ; Ng, Z ¼ fZ i ; i ¼ 1; . . . ; Ng, Sobs ¼ fSobs i ; i ¼ 1; . . . ; Ng, and Yobs ¼ fY obs ; i ¼ 1; . . . ; Ng. Let p ¼ PrðG ¼ gjX ; hÞ for g 2 fEE; EN; g:i i i i NE; NNg, and let N i ðm; t2 Þ denote the probability density function of a normal distribution with mean m and variance t2 evaluated at logðY obs i Þ. From Table 1, we can see that each of the four observed groups is generally a mixture of two principal strata; hence, the likelihood function is LðhjX; Z; Sobs ; Yobs Þ i Y h pEE:i N i ðmEE;1 þ gTEE;1 X i ; s2EE;1 Þ þ pEN:i N i ðmEN;1 þ gTEN;1 X i ; s2EN;1 Þ / i2Oð1;1Þ
Y
ðpNE:i þ pNN:i Þ
i2Oð1;0Þ
i Y h pEE:i N i ðmEE;0 þ gTEE;0 X i ; s2EE;0 Þ þ pNE:i N i ðmNE;0 þ gTNE;0 X i ; s2NE;0 Þ
i2Oð0;1Þ
Y
ðpEN:i þ pNN:i Þ
ð8Þ
i2Oð0;0Þ
4.2. Incorporating Assumptions into Parametric Models When the monotonicity assumption is made, the probabilities of the principal strata and the potential wages can be modeled similarly, except that the models are restricted to three principal strata – EE, EN, and NN. In this case, members of the Oð1; 0Þ group all belong to the NN stratum, whereas members of the Oð0; 1Þ group all belong to the EE stratum, and therefore only the Oð1; 1Þ group and the Oð0; 0Þ group involve mixture of principal strata. The stochastic dominance assumption considering covariates is as follows: Assumption 2u (Stochastic dominance). (2a) (Stochastic dominance under training): For any two individuals i and j with G i ¼ EE and Gj ¼ EN and identical values of the
Evaluating the Effects of Job Training Programs on Wages
129
covariates X i ¼ X j ¼ X , PrðY i ð1Þ tjG i ¼ EE; X i ¼ X ; hÞ PrðY j ð1Þ tjG j ¼ EN; X j ¼ X ; hÞ for any real number t. (2b) (Stochastic dominance under no training): For any two individuals i and j with Gi ¼ EE and Gj ¼ NE and identical values of the covariates X i ¼ X j ¼ X , PrðY i ð0Þ tjG i ¼ EE; X i ¼ X ; hÞ PrðY j ð0Þ tjG j ¼ NE; X j ¼ X ; hÞ for any real number t. Consider the most general case when covariates are unbounded and can have both positive and negative values. When the monotonicity assumption is not made, the stochastic dominance assumption under training implies that gEE;1 ¼ gEN;1 ð¼ g:;1 Þ;
s2EE;1 ¼ s2EN;1 ð¼ s2:;1 Þ;
mEE;1 mEN;1
(9)
and the stochastic dominance assumption under no training implies that gEE;0 ¼ gNE;0 ð¼ g:;0 Þ;
s2EE;0 ¼ s2NE;0 ð¼ s2:;0 Þ;
mEE;0 mNE;0
(10)
The readers are referred to the appendix for proof. If covariates are bounded, then the implications of the stochastic dominance assumption depend on the particular bounds of the covariates and could be very tedious to derive. Hence, we still restrict our model to the parallel regression case in Eqs. (9) and (10). Modeling under the stochastic dominance assumption in the presence of the monotonicity assumption can be done similarly.
4.3. Prior Distributions for the Parameters Because job training datasets usually involve large sample sizes, the influence of prior distributions should be relatively small. We will use some relatively uninformative but convenient prior distributions for the parameters h. The simulation results in Section 6.1 show that these prior distributions have little effect on the resulting estimates. For more careful calibration and sensitivity analysis of the prior distribution, the readers are referred to Bernardo and Smith (2000).
130
JUNNI L. ZHANG ET AL.
For the coefficients in the multinomial logit model, we use the following independent prior distributions: for g 2 fEN; NE; NNg;
ðag ; bg Þ Nð0; K 0 IÞ
where Nðk; RÞ stands for the multivariate normal distribution with mean vector k and covariance matrix R, K0 is a relatively large constant for the prior to be relatively uninformative (we use K0=100), and I denotes an identity matrix of appropriate dimension. For other parameters in the model for log potential wage, we use the following conjugate prior distributions (given the principal stratum g):3 for
ðg; zÞ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þg, s2g;z Inv w2 ðv0 ; s20 Þ;
ðmg;z ; gTg;z ÞT js2g;z Nð0; K 0 s2g;z IÞ
where Inv w2 ðv0 ; s20 Þ stands for the scaled inverse chi-square distribution with v0 degrees of freedom and scale parameter s20 , and where v0 and s20 are small constants for the prior to be relatively uninformative. We use v0 ¼ 2 and s20 ¼ 1. When the monotonicity assumption is made, we use similar prior distributions as above except without those for ðaNE ; bNE Þ and ðmNE;0 ; gNE;0 ; s2NE;0 Þ. When the stochastic dominance assumption is made without the monotonicity assumption, the prior distributions for ðag ; bg Þ remain the same as above, but the prior distributions for the other parameters become: s2 ;1 Inv w2 ðv0 ; s20 Þ;
ðmEE;1 ; mEN;1 ; gT ;1 ÞT js2 ;1 Nð0; K 0 s2 ;1 IÞ
for mEE;1 mEN;1
s2 ;0
ðmEE;0 ; mNE;0 ; gT ;0 ÞT js2 ;0
for mEE;0 mNE;0
Inv w
2
ðv0 ; s20 Þ;
Nð0; K 0 s2 ;0 IÞ
When the stochastic dominance assumption is made in the presence of monotonicity assumption, similar prior distributions can be used.
5. COMPUTATIONAL ASPECTS Because the likelihood function involves mixture distributions, we may encounter multimodality of the posterior distribution of h. Because our primary objective here is to illustrate the usefulness of the straightforward Bayesian approach within the principal stratification framework, we focus our discussion on cases when the possible modes are well separated
Evaluating the Effects of Job Training Programs on Wages
131
and a single mode dominates the other modes in probability mass. Usual inferences about ATEEE, such as those using the posterior median and posterior intervals, can then be made in the region of the dominating mode.
5.1. Strategy Our computations follow the general strategy of Gelman, Carlin, Stern, and Rubin (2003). We describe the computing strategy in the more difficult and general case, when neither the monotonicity assumption nor the stochastic dominance assumption is made. We first search for all posterior modes of h, and then estimate the probability mass around each mode as follows: 1. Find the posterior modes hð1Þ ; . . . ; hðMÞ . 2. Approximate the posterior distribution around each mode using a multivariate normal distribution centered at that mode with approximate covariance matrix SðmÞ .4 The posterior density of h can then be approximated by M X
fm NðhjhðmÞ ; RðmÞ Þ
(11)
m¼1
where fm denotes the probability mass around the mode hðmÞ and is estimated in step 3 below. 3. Let qð1Þ ; . . . ; qðMÞ denote the (unnormalized) posterior densities at the posterior modes. f1 ; . . . ; fM can be estimated by equating the actual (unnormalized) posterior density q(m) at each mode hðmÞ to the approximate density (11) at that mode. When the modes are well separated, we obtain fm / qðmÞ jSðmÞ j1=2 . If the estimated probability mass of a single mode dominates that of the other modes, we then compute the posterior median and posterior intervals in the region around that mode as if it were the only mode. If there is more than one dominant mode, other more complicated strategies must be used, but such discussion is beyond the scope of this chpater. If we were considering making the monotonicity assumption and/or the stochastic dominance assumption in an application, we could choose among the models by computing the Bayes factors, or by conducting posterior
132
JUNNI L. ZHANG ET AL.
predictive checks (Rubin, 1984; Gelman, Meng, & Stern, 1996), or using other methods for model choice. Because model choice is a completely different issue than implementing Bayesian analysis in our principal stratification setting, further discussion of this topic is beyond the scope of this chapter.
5.2. Finding the Modes and Approximating their Masses We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to search for the posterior modes hð1Þ ; . . . ; hðMÞ (with the variance parameters being logarithm-transformed). To estimate the covariance matrix RðmÞ around hðmÞ for the normal approximation, one could use quadratic approximation algorithms such as SEM (Meng & Rubin, 1991). However, because we will use MCMC methods, e.g., the Gibbs sampler and the Metropolis–Hastings algorithm, to sample from the posterior distribution, we also apply MCMC to estimate the probability mass. We start C different chains from each mode hðmÞ , use the R statistic of Gelman and Rubin (1992) to judge ðmÞ convergence of these C chains, and then obtain W draws of h, hðmÞ 1 ; . . . ; hW , after approximate convergence; RðmÞ can be estimated by the sample covariance matrix. After the probability mass for each mode is approximated as described in Section 5.1, and a single dominating mode is presumably found, draws from the dominating mode can then be used for inference about ATEEE.
5.3. Markov Chain Monte Carlo Methods MCMC methods were used because it is difficult to draw the parameters directly from their posterior distribution. However, if the principal stratum indicators G were known, the conditional posterior distribution of h, given G, is simply f ðhjG; X; Z; Sobs ; Yobs Þ / pðhÞLðhjG; X; Z; Sobs ; Yobs Þ
(12)
where pðhÞ is the prior probability density of h, and LðhjG; X; Z; Sobs ; Yobs Þ is the ‘‘complete-data’’ likelihood function. The conditional posterior distribution (12) has a much simpler structure for inference than the posterior distribution of h. This suggests computing the posterior distribution of h using a data augmentation approach (Tanner & Wong, 1987), where we can
Evaluating the Effects of Job Training Programs on Wages
133
treat G as missing data, and iteratively (a) impute G given h and (b) draw h given G, in analogy with the EM algorithm. In the imputation step, a random draw of G is taken from the conditional distribution of the principal strata, which is given as follows: for i 2 Oð1; 1Þ f ðG i ¼ EEjh; X; Z; Sobs ; Yobs Þ ¼
pEE:i N i ðmEE;1 þ gTEE;1 X i ; s2EE;1 Þ pEE:i N i ðmEE;1 þ gTEE;1 X i ; s2EE;1 Þ þ pEN:i N i ðmEN;1 þ gTEN;1 X i ; s2EN;1 Þ
f ðG i ¼ ENjh; X; Z; Sobs ; Yobs Þ ¼
pEN:i N i ðmEN;1 þ gTEN;1 X i ; s2EN;1 Þ pEE:i N i ðmEE;1 þ gTEE;1 X i ; s2EE;1 Þ þ pEN:i N i ðmEN;1 þ gTEN;1 X i ; s2EN;1 Þ
f ðG i ¼ NEjh; X; Z; Sobs ; Yobs Þ ¼ f ðG i ¼ NNjh; X; Z; Sobs ; Yobs Þ ¼ 0 for i 2 Oð1; 0Þ pNE:i pNE:i þ pNN:i pNN:i f ðGi ¼ NNjh; X; Z; Sobs ; Yobs Þ ¼ pNE:i þ pNN:i f ðGi ¼ NEjh; X; Z; Sobs ; Yobs Þ ¼
f ðG i ¼ EEjh; X; Z; Sobs ; Yobs Þ ¼ f ðG i ¼ ENjh; X; Z; S obs ; Y obs Þ ¼ 0 for i 2 Oð0; 1Þ f ðG i ¼ EEjh; X; Z; Sobs ; Yobs Þ ¼
pEE:i N i ðmEE;0 þ gTEE;0 X i ; s2EE;0 Þ pEE:i N i ðmEE;0 þ gTEE;0 X i ; s2EE;0 Þ þ pNE:i N i ðmNE;0 þ gTNE;0 X i ; s2NE;0 Þ
f ðGi ¼ NEjh; X; Z; Sobs ; Yobs Þ ¼
pNE:i N i ðmNE;0 þ gTNE;0 X i ; s2NE;0 Þ pEE:i N i ðmEE;0 þ gTEE;0 X i ; s2EE;0 Þ þ pNE:i N i ðmNE;0 þ gTNE;0 X i ; s2NE;0 Þ
f ðG i ¼ ENjh; X; Z; Sobs ; Yobs Þ ¼ f ðG i ¼ NNjh; X; Z; Sobs ; Yobs Þ ¼ 0 for i 2 Oð0; 0Þ pEN:i pEN:i þ pNN:i pNN:i f ðG i ¼ NNjh; X; Z; Sobs ; Yobs Þ ¼ pEN:i þ pNN:i f ðG i ¼ ENjh; X; Z; Sobs ; Yobs Þ ¼
f ðG i ¼ EEjh; X; Z; Sobs ; Yobs Þ ¼ f ðG i ¼ NEjh; X; Z; Sobs ; Yobs Þ ¼ 0
134
JUNNI L. ZHANG ET AL.
Partition h into hþ ¼ fðag ; bg Þ; g 2 fEN; NE; NNgg and the rest, h. Given G, we can draw h as follows: 1. Use the Metropolis–Hastings algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970) to draw h+ from its conditional posterior distribution f ðhþ jh ; G; X; Z; Sobs ; Yobs Þ / pðhþ Þ Q Q pEE:i pEN:i i2Oð1;1Þ\EE
Q
i2Oð1;1Þ\EN
pEE:i
i2Oð0;1Þ\EE
Q
i2Oð0;1Þ\NE
Q i2Oð1;0Þ\NE
pNE:i
Q
Q
pNE:i
pNN:i
i2Oð1;0Þ\NN
Q
pEN:i
i2Oð0;0Þ\EN
pNN:i
i2Oð0;0Þ\NN
where p( ) again denotes the prior probability density function. For example, use the modified Newton–Raphson method to estimate K, the conditional posterior mode of h+, and O, the conditional posterior covariance matrix at the mode. We then use the multivariate t distribution td ðK; XÞ as the proposal distribution in the Metropolis– Hastings algorithm. To be more specific, suppose that hcur þ is the current cur parameter value, we then generate hnew þ from td ðK; XÞ and update hþ to new be hþ with probability
min 1;
obs f ðhnew ; Yobs Þtd ðhcur þ jh ; G; X; Z; S þ jK; XÞ
!
obs ; Yobs Þtd ðhnew f ðhcur þ jh ; G; X; Z; S þ jK; XÞ
where td ðhþ jK; XÞ is the probability density function of td ðK; XÞ evaluated at h+. 2. For ðg; zÞ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þg, draw s2g;z from its conditional posterior distribution, which is an Inv w2 distribution, then draw ðmg;z ; gTg;z ÞT , given s2g;z , from their joint conditional posterior distribution, which is a multivariate normal distribution.
5.4. Applying New Code Checking Tools Although there has been a rich literature on algorithms fitting Bayesian models, there is only limited work on checking the correctness of the
Evaluating the Effects of Job Training Programs on Wages
135
software code implementing these algorithms. Cook, Gelman, and Rubin (2006) proposed simulation methods that can partially fill this gap. We will apply these methods to our context. The software validation methods in Cook et al. (2006) are based on the posterior quantiles of the true values of scalar parameters over replications of the simulation. In each replication of the simulation, the ‘‘true’’ parameter values htrue are first generated from their prior distribution p(h), and the ‘‘observed’’ data, Sobs and Yobs, are then generated from pðSobs ; Yobs jhtrue ; X; ZÞ, where the covariates X and the treatment assignment indicators Z are fixed; W posterior draws ðh1 ; h2 ; . . . ; hW Þ are then generated using the to-be-tested software implementing the algorithms in Sections 5.1–5.3. Let j denote a generic scalar component or function of u, and let ^ true Þ ¼ ððPW Iðjtrue 4j ÞÞ=W Þ, the estimated quantile of jtrue. With bðj w w¼1 ^ true Þ converges to correctly written software, the distribution of bðj Uniform(0,1) as W ! 1. We can repeat the simulation of utrue , Sobs, Yobs, and ðh1 ; h2 ; . . . ; hW Þ for Rrep times, and calculate the estimated PRrep quantile b^r of utrue in each replication. The distribution of B2j ¼ r¼1 ðF1 ðb^r ÞÞ2 2 should converge to wRrep for correctly written software, where F is the CDF for standard normal distribution. Deviation of the distribution of the posterior quantiles from the uniform distribution can then be quantified by pj ¼ F w2R ðB2j Þ, where F w2R is the CDF for the w2Rrep distribution. Extreme rep
rep
values of pj (too close to either 0 or 1) indicate errors in the software. To analyze the simulation output more efficiently, the monitored scalar parameters (components of u or scalar functions of u) are first divided into ‘‘batches,’’ each consisting of related parameters. In our context, when neither the monotonicity assumption nor the stochastic dominance assumption is made, each set of (ag, bg) (g=EN, NE, NN) forms a batch, and each set of ðmg;z ; gg;z ; logðs2g;z ÞÞ (ðg; zÞ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þg) forms a batch; when only the monotonicity assumption is made, we can form similar batches without ðaNE ; bNE Þ or ðmNE;0 ; gNE;0 ; logðs2NE;0 ÞÞ; when only the stochastic dominance assumption is made, ðmEE;1 ; mEN;1 ; g:;1 ; logðs2:;1 ÞÞ forms a batch and ðmEE;0 ; mNE;0 ; g:;0 ; logðs2:;0 ÞÞ forms another batch; similar batches can be formed when both assumptions are made. Cook et al. (2006) recommended two complementary approaches to analyze the simulation output. In the exploratory approach, the pj-values are first transformed into statistics zj ¼ F1 ðpj Þ, and the absolute values of these zj statistics are then plotted in batches. Large values of zj indicate errors in the software. In the confirmatory approach, for each batch of related parameters, a new scalar parameter is defined as the mean of all
136
JUNNI L. ZHANG ET AL.
parameters in the batch, and the pj -values for these new parameters are calculated from the posterior samples. The (two-sided) p-value for each such pj-value is multiplied by the total number of batches to achieve a simple Bonferroni corrected p-value. Because the monotonicity assumption is less plausible in practice than the stochastic dominance assumption, below we focus on the case when the monotonicity assumption is not made, and apply the new software checking tools to the Bayesian analysis, with and then without the stochastic dominance assumption. Code checking in the presence of the monotonicity assumption can be done analogously. 5.4.1. Code Checking without the Stochastic Dominance Assumption We first perform a simulation of 20 replications to check the software developed to fit the model without the stochastic dominance assumption, using the computing strategy and algorithms described in Sections 5.1–5.3 with C=3 and W=5000. We use one covariate X Nð0; 4Þ and let each participant be randomized into the training group (Zi=1) with probability 0.5 and the control group (Zi=0) with probability 0.5. The sample size is N=300. We use the specific prior distribution in Section 4.3 with K0=100, v0=2, and s20 ¼ 1:0.5 In each replication, there are multiple posterior modes; however, each MCMC chain stayed around its starting mode, indicating that the modes are well separated; moreover, there is a single major mode whose estimated probability mass is almost 1. Therefore, we monitor the parameter draws from the major mode. All zj statistics are plotted in Fig. 1, with each row representing a different batch of parameters. Almost all zj statistics are less than 2, except for one such statistic, which is neither overly large nor unexpected, given the number of parameters. Moreover, the smallest Bonferroni-adjusted p-value is 0.36. So there is no indication of an incorrectly written software. 5.4.2. Code Checking with the Stochastic Dominance Assumption We perform a simulation of 20 replications to check the software developed to fit the model with the stochastic dominance assumption. The simulation setting is similar to that in Section 5.4.1. In each replication, there is only one mode. The absolute zj statistics for the parameters are plotted in Fig. 2. Only one zj statistic is moderately larger than 2. Moreover, the smallest Bonferroni-adjusted p-value is 0.17. So there is no indication of an incorrectly written software.
Evaluating the Effects of Job Training Programs on Wages
137
Without Stochastic Dominance Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Batch 6 Batch 7 0.0
0.5
1.0
1.5
2.0
2.5
Absolute z Transformation of p Values
Fig. 1. Scalar Validation zj Statistics without Stochastic Dominance Assumption. Each Row Represents a Batch of Parameters; the Hollow Circles in Each Row Represent the zj Statistics Associated with the Corresponding Batch of Parameters. Solid Circles Represent the zj Statistics Associated with the Mean of the Corresponding Batch of Parameters. Batch 1={aEN, bEN}; Batch 2={aNE, bNE}; Batch 3={aNN, bNN}; Batch 4={mEE,1, ZEE,1, logðs2EE;1 Þ}; Batch 5={mEE,0, ZEE,0, logðs2EE;0 Þ}; Batch 6={mEN,1, ZEN,1, logðs2EN;1 Þ}; Batch 7={mNE,0, ZNE,0, logðs2NE;0 Þ}.
6. A FREQUENTIST SIMULATION EXAMPLE 6.1. The Simulated Datasets with Fixed Parameters We specify the ‘‘true’’ model to be the specific parametric model in Section 4.1, in which the monotonicity assumption does not hold but the stochastic dominance assumption does. Then we apply our computating software developed according to Section 5 to the simulated datasets with fixed parameters as below, and then illustrate that our Bayesian analysis can correctly estimate ATEEE. Assume there is only one pretreatment covariate X that follows an N(0,4) distribution. The ‘‘true’’ parameters for the multinomial logistic regression are set to be aEN ¼ 1:13; bEN ¼ 1 aNE ¼ 0:235; aNN ¼ 1:205;
bNE ¼ 2:5 bNN ¼ 4
138
JUNNI L. ZHANG ET AL. With Stochastic Dominance Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Batch 6 Batch 7 0.0
0.5
1.0
1.5
2.0
2.5
Absolute z Transformation of p Values
Fig. 2. Scalar Validation zj Statistics with Stochastic Dominance Assumption. Each Row Represents a Batch of Parameters; the Hollow Circles in Each Row Represent the zj Statistics Associated with the Corresponding Batch of Parameters. Solid Circles Represent the zj Statistics Associated with the Mean of the Corresponding Batch of Parameters. Batch 1={aEN, bEN}; Batch 2={aNE, bNE}; Batch 3={aNN, bNN}; Batch 4={mEE,1, mEN,1, Z.,1, logðs2 ;1 Þ}; Batch 5={mEE,0, mNE,0, Z.,0, logðs2 ;0 Þ}.
which gives us the following marginal proportions of the principal strata: pEE=30%, pEN=30%, pNE=10%, pNN=30%. The ‘‘true’’ parameters in the model for log potential wages are set to be mEE;1 ¼ 8:5;
mEE;0 ¼ 7; mEN;1 ¼ 2:5; mNE;0 ¼ 6:5 Z:;1 ¼ Z:;0 ¼ 1; s2:;1 ¼ s2:;0 ¼ 1
The true value of ATEEE is ATEEE ¼ E½Y ð1ÞjG ¼ EE E½Y ð0ÞjG ¼ EE EfIðG ¼ EEÞ½Y ð1Þ Y ð0Þg ¼ pEE EfPrðG ¼ EEjX ; hÞ½EðY ð1ÞjG ¼ EE; X ; hÞ EðY ð0ÞjG ¼ EE; X ; hÞg ¼ pEE 2 3 1 6expðmEE;1 þ Z:;1 X þ 0:5 s2:;1 Þ expðmEE;0 þ Z:;0 X þ 0:5 s2:;0 Þ7 P ¼ E4 5 pEE expðag þ bg X Þ g
ð13Þ
Evaluating the Effects of Job Training Programs on Wages
139
Table 2. Means and Standard Deviations (in Parentheses) of the Employment Rates and the Average Wage for the Employed Groups Within the Training Group and the Control Group for the 100 Simulated Datasets.
Employment rate Average wage for the employed
Training
Control
60.2% (4.2%) 1283.77 (440.24)
39.3% (4.3%) 1310.83 (532.86)
which, as estimated by simulating a large sample of X from N(0,4), is equal to 2037:6. Therefore, training improves both the employment rate by 30%–10%=20%, and the average wage for the EE group by 2037.6. We generated 100 sets of observed data with the above parameters fixed. Therefore, this is a frequentist simulation. In each dataset, X values for 300 participants were drawn according to N(0,4); the principal stratum for each participant was then drawn according to PðG i ¼ gjX i ; hÞ. These 300 participants were then randomized into the training group (Zi=1) with probability 0.5 and the control group (Zi=0) with probability 0.5; the observed employment indicator Sobs was 1 if Gi=EE, or Zi=1 and Gi=EN, i or Zi=0 and Gi=NE, and was 0 otherwise. The observed wage Y obs was i drawn according to f ðY i ðZ i ÞjGi ; X i ; hÞ if ðG i ; Z i Þ 2 fðEE; 1Þ; ðEE; 0Þ; ðEN; 1Þ; ðNE; 0Þg and was equal to * otherwise. Table 2 shows the means and standard deviations of the employment rates and the average wages for the employed groups within the training group and the control group for the 100 datasets.
6.2. Bayesian Analysis with Stochastic Dominance Because the stochastic dominance assumption holds in the example of Section 6.1, we first discuss the Bayesian analysis of the simulated data under the stochastic dominance assumption. We use a prior distribution similar to that used in Section 5.4.1. For each of the 100 simulated datasets, and for each posterior draw of h (given the observed dataset), ATEEE in Eq. (13) is estimated using the empirical distribution of X by P PN expðmEE;1 þ Z:;1 X i þ 0:5 n s2:;1 Þ expðmEE;0 þ Z:;0 X i þ 0:5 n s2:;0 Þ i¼1 g expðag þ bg X i Þ PN P i¼1 1= g expðag þ bg X i Þ
140
JUNNI L. ZHANG ET AL.
Table 3. Frequentist Properties of Bayesian Analyses for ATEEE ¼ 2037:6 for the 100 Simulated Datasets (SD: Stochastic Dominance Assumption). Estimator
Mean Bias
Posterior median (with SD) Posterior median (without SD)
56.24 217.07
Root Mean Squared Error
95% Interval Coverage rate
Mean (s.d.) of width
95% 99%
3105.49 (1001.19) 3850.83 (1461.19)
695.43 854.11
For the 100 simulated datasets, the frequentist properties of the Bayesian analysis for ATEEE are demonstrated of Table 3. For all the 100 simulated datasets, the 95% posterior intervals suggest correctly that the treatment effect is positive for the EE group. The mean and the standard deviation of the length of these intervals are 3105.49 and 1001.19, respectively, and the coverage rate of these intervals for the true value of ATEEE is at the nominal coverage rate of 95%. Also, the mean bias and the root mean squared error for the posterior median are 56.24 and 695.43, respectively. Thus, we can see that the Bayesian analysis has good frequentist properties for ATEEE in this example. 6.3. Checking Stochastic Dominance Suppose after analyzing the observed data under the stochastic dominance assumption, we want to check whether this assumption is reasonable for the observed data. We propose to use a posterior predictive check with the discrepancy variable ns
TðD; hs Þ ¼ log Lðhs jX; Z; Sobs ; Yobs Þ log Lðb h ðDÞjX; Z; Sobs ; Yobs Þ where D is a dataset, hs is ans parameter vector satisfying the stochastic dominance assumption, and b h ðDÞ is the major posterior mode for h given the dataset D when the stochastic dominance assumption is not made. Specifically, let Dobs denote the observed dataset; given the V posterior draws fhs1 ; . . . ; hsV g under the stochastic dominance assumption, we generate a new dataset DrepðvÞ using each posterior draw hsv , with Z and X being fixed as in Dobs. The posterior predictive p-value can be calculated as V 1X I½TðDrepðvÞ ; hsv Þ TðDobs ; hsv Þ V i¼1
Evaluating the Effects of Job Training Programs on Wages
141
where I is a general indicator function. The posterior predictive p-value is a counterpart of the classical p-value for assessing goodness-of-fit for statistics T (Rubin, 1984). For each of the 100 simulated datasets, the posterior predictive p-value is between 0.4 and 0.6, indicating that the stochastic dominance assumption is reasonable.
6.4. Bayesian Analysis without Stochastic Dominance Suppose we analyze the 100 simulated datasets using the general model with neither the stochastic dominance assumption nor the monotonicity assumption. For each of the 100 simulated datasets, we make inferences about h and ATEEE based on the major posterior mode. For each posterior draw of h, ATEEE in Eq. (13) is estimated using the empirical distribution of X by .P PN expðmEE;1 þ ZEE;1 X i þ 0:5 n s2EE;1 Þ expðmEE;0 þ ZEE;0 X i þ 0:5 n s2EE;0 Þ g expðag þ bg X i Þ i¼1 PN P g expðag þ bg X i Þ i¼1 1=
As we can see from Table 3, inferences under the stochastic dominance assumption are sharper than those without the stochastic dominance assumption, in the sense that the bias and the root mean squared error of the posterior median are smaller, and the 95% posterior intervals are narrower. Moreover, the stochastic dominance assumption eliminates the multimodality problem, because there is only one posterior mode with this assumption, but multiple posterior modes without it.
7. DISCUSSION In this chpater, we have shown how to address a common problem in the evaluation of job training programs through randomized experiments. Our approach is new, and hopefully innovative and helpful, and differs from the more traditional econometric analysis. Critically, our approach, rather than trying to estimate the effects of the training on wages for the entire population, instead focuses on the subset of individuals who would be employed, that is, those who would have positive wages, whether or not they were job trained. This subgroup is not directly observed, and therefore
142
JUNNI L. ZHANG ET AL.
latent. But it is a well-defined subgroup, and one of potentially much applied interest, especially because its definition can be conveyed to people in program evaluation without technical training in econometrics. Moreover, because our general approach, principal stratification (Frangakis & Rubin, 2002), obtains estimates on the training’s effect in distinct subgroups of people (i.e., the NE, EN, NN, and EE groups) on both employment and wages (only the EE group), its resulting answers should generalize better to environments outside the current experiment, for example, to different job markets where the chance of getting employed may be different but the effect of training on wages may be the same. That is, because we estimate the effects of training on wages conditional on the type of person, as defined by that person’s covariates and principal stratum, our estimates may generalize to other settings where the proportions of types of people differ but the effects of training on wages for the EE subgroup remain the same as in the current experiment. This is the same sort of reasoning that is used whenever we obtain more conditional estimates (e.g., separately for men and women).
NOTES 1. A brief discussion of how this approach differs from previous approaches, e.g., in economics (Roy, 1951), appears in Imbens and Rubin (2006). The formal notation for potential outcomes first appeared in Neyman (1923) in the context of a randomized experiment. 2. In addition, all of the covariates are individual characteristics (as opposed to choice-specific characteristics), and each of them has coefficients that are choicespecific (i.e., each of them enters the underlying stochastic utilities with a different coefficient): in this case, cross elasticities are not constant, and including a new alternative to the choice set would not leave the odds of the other alternatives unchanged, because a different model should be specified with an additional vector of coefficients. 3. Although zero might not be a good number for the means of the prior distribution of intercepts of equations for log potential wages, because K0 is large and the mean of log potential wage would not be an overly big number, the effect of this prior distribution will be quite small given a large sample size. Hence, we do not explore a careful calibration of the mean of the prior distribution. 4. To make the normal approximation more reasonable, we take the logarithm of the variance parameters (i.e., s2EE;1 ; s2EE;0 ; s2EN;1 ; and s2NE;0 ) whereas keeping the other components of h as in Eq. (7). 5. According to Cook et al. (2006) advice to choose un-nice numbers, we recognize that the choice of these values could be have been better.
Evaluating the Effects of Job Training Programs on Wages
143
ACKNOWLEDGMENT Junni L. Zhang’s research is sponsored in part by the Chinese NSF grant 10401003.
REFERENCES Angrist, J., Imbens, G., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, 444–455. Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. Wiley. Cook, S., Gelman, A., & Rubin, D. B. (2006). Validation of software for Bayesian models using posterior quantiles. To Appear in Journal of Computational and Graphical Statistics, 15, 675–692. Copas, J. B., & Li, H. G. (1997). Inference for non-random samples (with discussion). Journal of the Royal Statistical Society, Series B, 59, 55–96. Decker, P. T., & Corson, W. (1995). International trade and worker displacement: Evaluation of the trade adjustment assistance program. Industrial and Labor Relations Review, 48, 758–774. Dempster, A. P., Laird, N., & Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data using the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38. Frangakis, C. E., & Rubin, D. B. (1999). Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika, 86, 365–379. Frangakis, C. E., & Rubin, D. B. (2002). The defining role of ‘principal stratification and effects’ for comparing treatments adjusted for posttreatment variables: From treatment noncompliance to surrogate endpoints. Biometrics, 58, 191–199. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). Chapman and Hall. Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–807. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–511. Grilli, L., & Mealli, F. (2005). Causal inference through principal startification to assess the effect of university studies on job opportunities. Submitted. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–162. Heckman, J. J. (1990). Varieties of selection bias. American Economic Review, 80, 313–318. Heckman, J. J., Lalonde, R. J., & Smith, J. A. (1999). The economics and econometrics of active labour market programs. In: O. Ashenfelter & D. Card (Eds), Handbook of labour economics (Vol. 3A). Amsterdam: North Holland.
144
JUNNI L. ZHANG ET AL.
Heckman, J. J., & Smith, J. (1998). Evaluation the welfare state. In: S. Strom (Ed.), Econometrics and economic theory in the 20th century: The Ragnar Frisch Centennial Econometric Society monograph series. Cambridge University Press. Heckman, J. J., Smith, J., & Clements, N. (1997). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impacts. The Review of Economic Studies, 487–535. Heckman, J. J., & Vytlacil, E. J. (1999). Local instrumental variables and latent variable models for identifying and bounding treatment effects. Proceedings of the National Academy of Sciences, USA, 96, 4730–4734. Hirano, K., Imbens, G. W., Rubin, D. B., & Zhou, X. H. (2000). Assessing the effect of an influenza vaccine in an encouragement design. Biostatistics, 1, 69–88. Imbens, G. W., & Rubin, D. B. (1997). Bayesian inference for causal effects in randomized experiments with noncompliance. Annals of Statistics, 25, 305–327. Imbens, G. W., & Rubin, D. B. (2006). Rubin causal model. In: The new Palgrave dictionary of economics. Palgrave: Macmillan. Kodrzycki, Y. K. (1997). Training programs for displaced workers: What do they accomplish? New England Economic Review, 3, 39–59. Lee, D. S. (2005). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. NBER Working paper. Little, R. J. A. (1985). A note about models for selectivity bias. Econometrica, 53, 1469–1474. Liu, C., & Rubin, D. B. (1998). Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data. Biometrika, 85, 673–688. Manski, C. F. (1989). Anatomy of the selection problem. Journal of Human Resources, 24, 343–360. Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review Papers and Proceedings, 80, 319–323. Manski, C. F. (1995). Identification problems in the social sciences. Harvard University Press. Manski, C. F. (2003). Partial identification of probability distributions. New York: Springer. Meng, X. L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Neyman, J. (1923). On the application of probability theory to agricultural experiments: Essay on principles. Translated in Statistical Science, 5, 465–480. O’Leary, C. L. (1995). An impact analysis of employment programs in Hungary. Upjohn Institute Staff Working Paper. Roy, A. (1951). Some thoughts on the distribution of earnings. Oxford Economic Papers, 3, 135–146. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701. Rubin, D. B. (1975). Bayesian inference for causality: The importance of randomization. Proceedings of the Social Statistics Section of the American Statistical Association, American Statistical Association (pp. 233–239). Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. Rubin, D. B. (1978). Bayesian inference for causal effects. Annals of Statistics, 6, 34–58.
Evaluating the Effects of Job Training Programs on Wages
145
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151–1172. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528–550. Zhang, J. L., & Rubin, D. B. (2003). Estimation of causal effects via principal stratification when some outcomes are truncated by ‘‘death’’. Journal of Educational and Behavioral Statistics, 28, 353–368. Zhang, J. L., Rubin, D. B., & Mealli, F. (2005). Using the EM algorithm to estimate the effects of job training programs on wages. Proceedings of the 55th Session of the International Statistical Institute. International Statistical Institute, Sydney, Australia.
APPENDIX. PROOF FOR IMPLICATIONS OF STOCHASTIC DOMINANCE IN SECTION 4.2 We only prove the implications of stochastic dominance under training when covariates can take unbounded values, and the proof for implications of stochastic dominance under no training is analogous. For any real number t and for any real values of the covariates X ; PrðY i ð1Þ tjG i ¼ EE; X i ¼ X ; hÞ PrðY j ð1Þ tjG j ¼ EN; X j ¼ X ; hÞ. Equivalently, Fððt mEE;1 gTEE;1 X Þ=sEE;1 Þ Fððt mEN;1 gTEN;1 X Þ=sEN;1 Þ where F is the CDF for the standard normal distribution. This holds if and only if for any real number t and for any real values of the covariates X ; ððt mEE;1 gTEE;1 X Þ=sEE;1 Þ ððt mEN;1 gTEN;1 X Þ=sEN;1 Þ or ðsEE;1 sEN;1 Þt þ ðsEN;1 gEE;1 sEE;1 gEN;1 ÞT X mEN;1 sEE;1 mEE;1 sEN;1 . Equivalently, sEE,1sEN,1=0, sEN,1gEE,1sEE,1gEN,1=0, and hence 0Z mEN,1sEE,1mEE,1sEN,1. In other words, s2EE;1 ¼ s2EN;1 (¼ s2 ;1 ), gEE,1=gEN,1 (=g ,1) and mEE,1ZmEN,1.
GRAPHICAL DIAGNOSTICS OF ENDOGENEITY Xavier de Luna and Per Johansson ABSTRACT We show that in sorting cross-sectional data, the endogeneity of a variable may be successfully detected by graphically examining the cumulative sum of the recursive residuals. Moreover, the sign of the bias implied by the endogeneity may be deducible through such graphs. In general, instrumental variables are needed to implement the graphical test. However, when a continuous or ordered (e.g. years of schooling) variable is suspected to be endogenous, a graphical test for misspecification due to endogeneity (e.g. self-selection) can be obtained without instrumental variables.
1. INTRODUCTION In regression models for continuous responses, all sorts of model misspecifications may be diagnosed by an analysis of ordinary residuals, i.e. from an ordinary least square (OLS) estimator. Endogeneity of variables is a notable exception, however; with linear models, this is due to the OLS estimated parameter being consistent for the reduced form parameter (other than the structural parameter).
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 147–166 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00006-0
147
148
XAVIER DE LUNA AND PER JOHANSSON
The purpose of this chapter is to show how recursive residuals associated with a specific ordering of the data can be successfully graphically analyzed in order to diagnose the endogeneity of a variable.1 In particular, we show that a graphical display of the cumulative sum (CUSUM) of the recursive residuals obtained by assuming exogeneity is helpful in diagnosing the presence of endogeneity. Moreover, the sign of the bias implied by this endogeneity (e.g. the direction of a self-selection bias) is also deducible through such graphs. The use of recursive residuals was earlier advocated by Harvey and Collier (1977) to test for functional misspecification in regression analysis.2 In particular, they proposed a t-statistic to test that the residuals have expectation zero. This test is directly applicable to test against endogeneity when the data is sorted adequately. Instruments may be needed to obtain such a sorting. However, an interesting case arises with a continuous or ordered (e.g. years of schooling) variable whose endogeneity is due to selectivity. Sorting the data with respect to this variable and looking at recursive residuals obtained with a model where exogeneity is assumed allow us to diagnose the misspecification. In this case, endogeneity can thus be diagnosed without specifying a model for the alternative, in contrast with the Hausman test (cf. Hausman, 1978) for which certain instruments are needed. In Section 2, we start by presenting a theoretical framework allowing us to introduce special ordering of the data that are useful for diagnosing endogeneity. Section 3 presents the methodology based on the calculation of recursive residuals associated with relevant ordering of the data. Graphical displays of these residuals as well as the Harvey–Collier (HC) test statistic are proposed to diagnose the endogeneity of a variable. Sections 4 and 5 present different areas of application. Thus, Section 4 looks at a textbook example of endogeneity due to simultaneity of two variables. Section 5 discusses the case of endogenous treatments. First, a homogeneous treatment model is discussed; thereafter this model extended to a heterogeneous treatment setting; and finally the heterogeneous treatment model is extended to an ordered treatment (e.g. years of education). In Section 6, the graphical test is used to test for endogeneity of education when estimating the return to education. The chapter is concluded with a discussion in Section 7.
2. SORTING SCORES FOR ENDOGENEITY We consider an observational study where independent observations are available for a response y together with a set of exogenous variables x and a
Graphical Diagnostics of Endogeneity
149
possibly endogenous variable z (denoted the treatment in the sequel). The following linear statistical model is considered: y ¼ x0 b þ gz þ where e is a zero mean error term. The exogeneity of x implies that its marginal density, pðx; dÞ, d 2 D Rd , can be ignored without loss of information about b and g (see, e.g. Gourie´roux & Montfort, 1995, Chapter 1.5). Similarly, if treatment z is exogenous, its effect can be studied by the sole specification of pðyjz; x; b; gÞ, i.e. the density of the error term. However, in a typical observational study, the exogeneity of treatment z must be assessed. We propose graphical diagnostics of endogeneity by sorting the data with respect to a sorting score. The sorted data should then not be distinguishable from any other random ordering, only under the exogeneity of the variable of interest. In general, there is no unique sorting score for a given problem, but certain sorting scores will be more useful than others. In order to present results, we must give a minimal description of the alternative hypothesis of endogeneity. Hence, let us consider an unobserved variable u such that Eðjx; z; uÞ ¼ u
(1)
Let mðx; z; hÞ ¼ x0 b þ gz where h ¼ ðb0 ; gÞ0 . Under H0: ‘‘z is exogenous’’, implying within this framework that Eðujx; zÞ ¼ 0, we have that mðx; z; hÞ ¼ Eðyjz; xÞ, the unbiased and optimal (with minimum mean squared error of prediction) predictor of y, given x and z. Let us define hc (possibly a function of c) as hc ¼ arg min E½fy mðx; z; hÞg2 jz; x; soc y
where s is a random variable. We call this variable a sorting score when it is such that, hc ¼ h, a constant, only under H0. Such sorting scores will allow us to identify endogeneity by fitting mðx; z; hc Þ recursively to data sorted with respect to s and looking for evidence of a varying parameter hc . We now define a class of particularly useful sorting scores. Definition 1. A monotone sorting score s for Eðyjx; zÞ is such that 8c 2 Os , mðx; z; hc ÞwEðyjz; xÞ for all x, z such that s>c, when H0 does not hold.
150
XAVIER DE LUNA AND PER JOHANSSON
We start by a preliminary result. Lemma 1. If u ¼ lz, and hence sorting with respect to u or z is identical, then for s=u, 8c 2 Os , mðx; z; hc Þ ¼ Eð yjz; xÞ for all x, z. Proof. We have that Eð yjz; xÞ ¼ x0 b þ ðg þ lÞz. Moreover, hc is the solution of Eð y mðx; z; hc Þjz; x; socÞ ¼ 0 8c 2 Os . But here, Eð y mðx; z; hc Þjz; x; socÞ ¼ Eð y mðx; z; hc Þjz; xÞ and hc ¼ ðb0 ; g þ lÞ0 . The lemma is then proved.’ Thus, in the situation of the lemma, s=u is not monotone and will not help us identify endogeneity. In fact, l and g are not identifiable in such a situation. On the other hand, let u and z be not proportional but linearly dependent as u ¼ x1 z þ x2
(2)
where xi ; i ¼ 1; 2 are random variables with Eðxi jx; zÞ ¼ ai , a1 a0 and V ðxi jx; zÞ 0, for any z. The latter inequality is assumed to be strict for at least one i, i=1,2. Furthermore, when V ðx1 jx; zÞ40; z is assumed to be either always non-negative or always non-positive. Then, it can be shown that u is a monotone sorting score: Proposition 1. Let u in Eq. (1) be such that under endogeneity of z, u and z are linearly dependent as described by Eq. (2). Then, u is a monotone sorting score. Proof. We assume endogeneity of z. We get Eð yjx; z; uÞ ¼ b0 x þ gz þ u and Eð yjx; zÞ ¼ b0 x þ gz þ Eðujx; zÞ where Eðujx; zÞa0. Furthermore, for a constant c 2 Ou (the sample space of u), Eð yjx; z; uocÞ ¼ b0 x þ gz þ Eðujx; z; uocÞ By the linearity assumption (2), we get Eðujx; z; uocÞ ¼ ac2 þ ac1 z, where aci ¼ Eðxi jx; z; uocÞ; i ¼ 1; 2, are constants although functions of c.
Graphical Diagnostics of Endogeneity
151
Hence, Eðyjx; z; uocÞ ¼ b0 x þ gz þ ac2 þ ac1 z which is linear and equivalent to mðx; z; hc Þ. Moreover, because aci oai when V ðxi jx; zÞ40 (this is true for at least one i), Eðujx; z; uocÞoEðujx; zÞ for any positive z and Eðujx; z; uocÞ4Eðujx; zÞ for any negative z. Thereby the monotonicity of u as sorting score is implied.’ Even if u is unobserved, this result is of practical use because the ordering of u can be retrieved by studying z and its relation to certain instruments, see the example sections below. A most convenient case arises when z is a monotone sorting score in itself, since no instrumental variables are then required. This situation can arise when EðujzÞ is non-linear in z which is, for instance, the case with random coefficient models (see Section 5.3).
3. GRAPHICAL DIAGNOSTICS Graphical diagnostics are informal tools for analysis but at the same time, a very powerful medium for conveying information. A graph may tell more than the value of a test statistic, although they are obviously complementary. Since ordinary residuals are not really appropriate to identify the endogeneity misspecification, we base our analysis on recursive residuals.
3.1. Recursive Residuals Let a set of independent observations ðyi ; xi Þ; i ¼ 1; . . . ; n be generated by a model with corresponding density pðyjx; hÞ. For each k ¼ q; . . . ; n 1, a consistent estimate b hk of h, based on ðyi ; xi Þ, i ¼ 1; . . . ; k, is assumed to be available. Recursive residuals are then obtained by predicting yj with Eðyj jxj ; b hj1 Þ, j ¼ q þ 1; . . . ; n. This prediction is an estimate, based on observations ðyi ; xÞ, i ¼ 1; . . . ; j 1 of the optimal (mean squared error sense) predictor Eðyj jxj ; hÞ. The recursive residuals are then standardized prediction errors: wj ¼
yj Eðyj jxj ; y^ j1 Þ ; Varðy Eðy jxj ; y^ j1 Þjxj Þ j
j
j ¼ q þ 1; . . . ; n
152
XAVIER DE LUNA AND PER JOHANSSON
Assuming that the involved moments exist and that the model is well specified, these recursive residuals are, at least asymptotically, independent and identically distributed (i.i.d.) with mean zero, variance 1, i.e. fwj g is i:i:d:;
Eðwj Þ ¼ 0 and Varðwj Þ ¼ 1 j ¼ q þ 1; . . . n
(3)
These properties hold exactly when h is known. Example 1. The linear Gaussian model, yi ¼ x0i h þ i , with ei independently and normally distributed with mean zero and variance s2, is an important particular case for which recursive residuals were originally studied, e.g. by Brown, Durbin, and Evans (1975). For this model, we have, for j ¼ q þ 1; . . . ; n, wj ¼
yj x0j b hj1 sð1 þ x0j ðX0j1 Xj1 Þ1 xj Þ1=2
where Xj1 ¼ ðx01 ; . . . ; x0j1 Þ0 . Assuming that X0j1 Xj1 are invertible, wj are homoscedastic, independent, and with standard normal distribution (Brown et al., 1975). No asymptotic argument is needed here. Kianifard and Swallow (1996) review the application of recursive residuals (see also Dawid, 1984; de Luna & Johansson, 2000).
3.2. Cumulative Sum and the Harvey–Collier Test A graphical diagnostic tool is obtained by graphically displaying recursive residuals. Their CUSUM is most useful. Asymptotically, the recursive residuals have mean zero for a well-specified model. When misspecification arises, in our case when exogeneity of the treatment does not hold, the recursive residuals will typically have non-zero mean. If we then sort the data with respect to s, a monotone sorting score, the residuals will have positive (or negative) mean throughout the recursion. The results of Section 2 can be used to obtain such sortings. In the time series context, the aim when inspecting CUSUMs is to detect a change of the parameter values (see, e.g. Brown et al., 1975). Most often this is believed to be an abrupt structural change at an unknown point in time. The endogeneity misspecification is instead translated by small but systematic biases in predictions. Thus, a monotone sorting score should be used for these biases to accumulate instead of cancelling each other out, thereby
Graphical Diagnostics of Endogeneity
153
guaranteeing the best visual effect when plotting the CUSUM of the recursive residuals. Examples illustrate these issues in the next section. The constancy of the bias sign is also relevant for the test presented below to have power. Harvey and Collier (1977) proposed a simple test based on the sum of the recursive residuals to identify functional misspecification in a regression model. In our context w ¯ ¼
n 1 X wi n q i¼qþ1
which is the average of the recursive residuals. Then, under H0, asymptotically, w ¯ is (cf. Eq. (3)) normally distributed with mean zero and variance 1=ðn qÞ, and constitutes a test statistic for H0. This test is a necessary complement to the CUSUM plot. Note that a simulation study conducted by de Luna and Johansson (2000) showed the good properties of the HC test in comparison with, e.g. a classic Hausman test.
4. CONSUMPTION AND INCOME APPLICATION Consider the model for y and z: y ¼ bx1 þ gz þ where variable z is endogenous, such that z ¼ ax2 þ n with e and n correlated.3 Denote x ¼ ðx1 ; x2 Þ. Assuming EðjnÞ to be linear in n (e.g. bivariate normality), we have Eðyjx; zÞ ¼ bx1 þ gz þ lðz ax2 Þ
(4)
where l=0 if and only if e and n are uncorrelated. Here, s ¼ ðz ax2 Þ is chosen as a sorting score. Note that lsðx2 ; zÞ plays the role of the omitted variable u in Eq. (1). The assumptions of Proposition 1 are met when, for instance, x2 and z are jointly normal4 (so that Eðx2 jzÞ is linear in z), in which case we can say that s is a monotone sorting score. It can be approximated by ðz a^ x2 Þ; where a^ is a consistent estimate. Note that x2 ¼ x1 would lead to the non-identifiability of g, and the non-applicability of Proposition 1.
154
XAVIER DE LUNA AND PER JOHANSSON
Example 2. We use data on U.S. consumption expenditures (ct), disposable income ( yt) and government expenditure (gt), in billions of dollars in 1982, for year t between 1955 and 1986.5 However, in order to make the presentation clear, we disregard the observations after the oil crisis that started in 1976. We assume: c t ¼ g 0 þ g 1 yt þ t
(5)
where yt is endogenous such that yt ¼ a0 þ a1 gt þ v t
(6)
Fig. 1 shows how the endogeneity of yt is revealed by using the residuals from Eq. (6) as a sorting score. The hold-out sample in estimating the recursive residuals is set to 5, thus q=5. The graphical diagnostic is, hence, presented for the 1960–1975 period. In the southeast panel, we can see the usefulness of the CUSUM to diagnose misspecification. The pattern of a monotonous increasing CUSUM indicates that the effects of income CUSUM of rec. residuals
rec. residuals
15 10 5 0 -5 -10 -15
-10 -20 -30 -40
20 rec. residuals
1960 1962 1964 1966 1968 1970 1972 1974
CUSUM of rec. residuals
1960 1962 1964 1966 1968 1970 1972 1974
15 10 5
1 2 3 4 5 6 7 8 9 10
12
sorting score ordering
14
16
100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10
12
14
16
sorting score ordering
Fig. 1. Recursive Residuals and their CUSUM when Regressing the Consumption Expenditures on Disposable Income (yt), using Time Ordering (Upper Panels) HC=1.09 and the Ordering using the Sorting Score: Residuals of Regressing yt on the Government Expenditure Variable (Lower Panels) HC=5.67.
Graphical Diagnostics of Endogeneity
155
of consumption is under-predicted as a result of the endogeneity. This provides evidence of unobserved variable that is positively correlated with income and evidence that the OLS estimate of g1 would be upward biased.
5. TREATMENT-EFFECT MODELS Consider for the time being a dichotomous treatment: z=1 if an individual is treated and z=0 if not treated. Let y(0) be the potential outcome (e.g. log income) if not treated, and let y(1) be the potential outcome if treated. For a given individual, only one of these potential outcomes is observed depending on whether she/he is actually treated or not. Let yð0Þ ¼ a þ ð0Þ and yð1Þ ¼ a þ g þ ð1Þ
(7)
where e(0) and e(0) are stochastic variables with, Eðð0ÞÞ ¼ Eðð1ÞÞ ¼ 0, and a and g are fixed constants. The causal effect from treatment for a given individual is equal to G ¼ yð1Þ yð0Þ ¼ g þ ð1Þ ð0Þ while a and a+g are the mean outcomes from no treatment and treatment, respectively. Using the switching regression framework, the observed outcome can be written as y ¼ zyð1Þ þ ð1 zÞyð0Þ ¼ a þ Gz þ ð0Þ ¼ a þ gz þ ½ð1Þ ð0Þz þ ð0Þ
ð8Þ
The econometric problem to identify g consists of the following problems: (i) z may be correlated with the unobserved e(0); (ii) G may be correlated with z; and (iii) G may be correlated with e(0). The first problem is the standard sample selection problem (Heckman, 1978) or the standard errors in variable problem (i.e. that z is measured with an error). The second problem may occur when individuals select on anticipation of e(1)–e(0) and is in the literature, denoted as selection on relative advantages or sorting gains. The last problem occurs when, for instance, the best nontreated (i.e. those with high e(0)) are those with low gains from treatment (i.e. low e(1)).
156
XAVIER DE LUNA AND PER JOHANSSON
5.1. Homogeneous Treatment Effects In the standard endogenous treatment-effect model (cf. Heckman, 1978), it is assumed that e(0)e(1)=e and that the outcome equation is described as y ¼ x0 b þ gz þ
(9)
The choice equation can be described as z ¼ x0 a þ Z z ¼ Iðz 40Þ
(10)
where z is an unobserved latent variable and x includes the covariates in x and, at least, one variable not included in x. Endogeneity implies that Z and e are correlated. If Z and e are bivariate normal and correlated, we have that e Eðjx ; zÞ ¼ rs1 where l ¼ fðx0 aÞ=ð1 Fðx0 aÞÞ and 2 ðlz lð1 zÞÞ, 0 0 e l ¼ fðx aÞ=Fðx aÞ. Assuming joint normality of Z and e, and denoting s2Z ¼ 1, s2 , r, their respective variances and correlation, ~ Eðyjx ; zÞ ¼ x0 b þ gz þ rs1 ½lz lð1 zÞ
(11)
The last term in this equation corresponds to the unobserved variable u in Eq. (1). In this case, the hypotheses of Proposition 1 are fulfilled6 and u is therefore a monotone sorting score. Here, the missing variable is not observed but can be estimated by using a consistent estimator a^ of a. Note that sorting with respect to lz e lð1 zÞ is equivalent to first sorting the subsample for which z=0 with respect to Fðx0 aÞ 1, or equivalently, with respect to Prðz ¼ 1jx Þ ¼ Fðx0 aÞ, followed by the subsample with z=1, which is also sorted with respect to Fðx0 aÞ. Prðz ¼ 1jx Þ is often called the propensity score.7 More generally, i.e. without the need to specify the relation between Eqs. (10) and (9), theory often predicts that the propensity score is a monotone sorting score (in that individuals maximize the expected return from the treatment; see, e.g. Heckman & Robb, 1986) allowing us to test this prediction. When there are more than two possible exclusive treatments, say m, the outcome can be written as y ¼ x0 b þ
m X k¼1
gk zk þ
Graphical Diagnostics of Endogeneity
157
where zki ¼ 1 if individual i takes treatment k, k=1,y, m, and zero otherwise. Then, sðx ; zk Þ ¼
m X
zk Prðzk ¼ 1jx Þ
k¼1
is a sorting score under the stringent model assumptions of Lee (1983). More generally, treatments can be compared in pairs, e.g. against the nontreatment class, by using the data concerning only two such treatments and then proceeding as in the above binary choice situation. Finally, when there is a natural order and meaningful numbers can be assigned to treatments, then the situation is similar to a continuous treatment z and, for instance, the Garen (1984) model may be used, as described in Section 5.3. Other models are reviewed in Vella (1998), where a control function is always provided, often in the form of generalized residuals. This control function corresponds to the unobserved variable u and will often provide a useful sorting score. 5.2. Heterogeneous Treatment Effects Here, we let r ¼ ð1Þ ð0Þ and ð0Þ ¼ . We assume that the outcome equation can be described as y ¼ x0 b þ gz þ zr þ and that conditional on x the variables ð1Þ of ð0Þ are independent of the treatment z. The selection model is again given by Eq. (10). The reduced form is here equal to Eð yjx Þ ¼ x0 b þ gPrðx Þ þ Eðrjx ; z ¼ 1Þ pðx Þ ¼ x0 b þ gPrðx Þ þ Eðrjpðx ÞÞ pðx Þ where, by assumption, Eðjx Þ ¼ 0 and pðx Þ ¼ Prðz ¼ 1jx Þ is the propensity score. The second equality follows from Theorem 3 in Rosenbaum and Rubin (1983) assuming that 0opðx Þo1. If there is no selection on relative advantages, then Eðrjpðx ÞÞ ¼ 0. In this case, Eð yjx Þ is a linear function of the propensity score, which can be seen as the treatment. The last term in the equation corresponds to the unobserved variable u in Eq. (1). Here, the missing variable is once again not observed but can be estimated. Non-linearity in pðx Þ means that there is
158
XAVIER DE LUNA AND PER JOHANSSON
heterogeneity in the returns to training and that individuals select themselves into training based on their own known returns. Remark 1. Note that Proposition 1 does not apply here. However, the theory postulates that individuals maximize their present value of future returns. Thus, stating further that individuals know their own coefficient r, and that they select themselves into training based on their own known return, then not accounting for this in the estimation would lead to an under-prediction of individual returns to training for increasing pðx Þ. In other words, the economic theory predicts that pðx Þ is a monotone sorting score, leading to recursive residuals with a positive mean. If Eðrjpðx ÞÞ ¼ 0, a conventional two-stage least squares estimator would estimate the average (for the population with covariates x) treatment effect. It is, thus, of great interest to test whether individuals select themselves into training or education based on their expected return. Cairnero, Heckman, and Vytlacil (2003) suggest a Wald test based on the inclusion of a third-order polynomial of pðx Þ and with standard errors estimated using bootstrapping.
5.3. The Correlated Random Coefficient Model Now we turn to the situation when the treatment is ordered or continuous. The correlated random coefficient model, proposed by Garen (1984, 1988) can be written as8,9 y ¼ x0 b þ gz þ zr þ z ¼ f ðx Þ þ n
ð12Þ
For simplicity, we here assume no problem with sample selection, thus Eðjx; zÞ ¼ 0. Here, x contains all the variables in x and possibly others. For this model, Eðyjx ; zÞ ¼ x0 b þ gz þ zEðrjnÞ
(13)
Assuming EðrjnÞ to be linear in n (e.g. bivariate normality), we have EðrjnÞ ¼ lðz f ðx ÞÞ. Exogeneity of z corresponds to l=0, i.e. uncorrelated r and v variables. Here, heteroscedasticity is present even if z is exogenous: V ðyjx ; zÞ ¼ z2 V ðrjnÞ þ s2
(14)
Graphical Diagnostics of Endogeneity
159
where s2 is the variance of e. In this case, neglecting the endogeneity of z leads to a misspecification of the conditional expectation, by assuming it to be linear while zEðrjnÞ is non-linear in x and z. In this example, not accounting for the endogeneity of z corresponds to omitting the variable10 zðz f ðx ÞÞ in Eq. (13) that corresponds to u in Eq. (1). The non-linearity of Eðyjx; zÞ may often be hidden by the heteroscedastic noise when examining conventional residuals. On the other hand, recursive residuals can often identify the systematic bias in predictions obtained with the sorting score sðx ; zÞ ¼ zðz f ðx ÞÞ11 as illustrated in Example 3. Because f is unknown, an approximate sorting score must be used to estimate this function, yielding zðz fbðx ÞÞ. Note that within this framework, it is possible to proceed without specifying a parametric form for f, but instead using a non-parametric estimate. Remark 2. An important property of the diagnostics (CUSUM plots and HC test) is that they have the correct size as soon as l=0, even if the random coefficient r exists. This is to be contrasted with a classical Hausman-type test (typically a conventional test on the residuals of the regression of z against instruments, when introduced in the outcome equation) where the null hypothesis is r 0 and which therefore has power even against the mere existence of r. Remark 3. The Garen model is an econometric translation of a theoretical proposition saying that individuals maximize their present value of future returns. Thus, stating further that individuals know their own coefficient r, we should observe a positive correlation between r and z. If not taken into account, this leads to an under-prediction of individual returns to training (or schooling) for increasing z’s.12 In other words, the theory that has inspired the Garen model also predicts that z is a monotone sorting score, leading to recursive residuals with a positive mean. Example 3. The Garen model is considered, and we simulate 100 observations with the following specifications: for i ¼ 1; . . . ; 100, yi ¼ 1 þ 2x1i þ gi zi þ i gi ¼ 1 þ r i zi ¼ x1i x2i þ ni
160
XAVIER DE LUNA AND PER JOHANSSON
with x1i Uð0; 1Þ; x2i Uð0; 1Þ; i Nð0; 1Þ and ri and ni bivariate normal with expectations zero, variances 0.36 and 1 respectively, and correlation 0.5. Assuming exogeneity Eðyi jx1i ; zi Þ ¼ x1i b þ zi g is estimated with OLS. Here, zi vi corresponds to ui in Eq. (1), and thus from Proposition 1 we know that ui ¼ zi vi is a monotonous sorting score. Since Eðui jx1i ; zi Þ ¼ Eðui jni Þ ¼ 0:5ni neglecting ui in the recursive estimation (using ui as a sorting score) would lead to an over-prediction of the conditional mean and thus a negative recursive residual. This implies that the CUSUM of the recursive residual are decreasing with u. Several types of residual analyses are presented in Figs. 2 and 3. From the residuals plots of Fig. 2, there seem to be no severe heteroscedasticity (the conditional variance is given in Eq. (14)). Identifying the misspecification of the conditional mean is not straightforward with these residual plots, although a trained eye may see some structure in the OLS residuals when sorted with respect to the omitted variable, graph (b), and in
(b)
2
OLS resid
OLS resid
(a)
0 -2 0
20
40
60
80
2 0 -2
100
0
sorted w.r.t. to variable x
20
2 0 -2 -4 20
40
60
80
100
(d) recursive residuals
recursive residuals
(c)
0
40
sorted w.r.t. the sorting score
60
random order
80
100
3 2 1 0 -1 -2 -3 0
20
40
60
80
100
sorted w.r.t. the sorting score
Fig. 2. Residuals from the Garen Model of Example 3: (a) OLS Residuals Sorted w.r.t. the Variable z; (b) OLS Residuals Obtained with the Ordering of the Omitted Variable (Optimal Sorting Score); (c) Recursive Residuals Obtained with a Random Ordering; and (d) Recursive Residuals Obtained with the Optimal Sorting Score.
CUSUM of recursive residuals
15 10 5 0 20
40
60
80
0
20
(d)
-15 -25 40
60
80
80
100
80
100
-25 0
20
40
60
sorted w.r.t. the estimated sorting score
CUSUM of recursive residuals
(f)
-30
40
100
-15
100
-10
20
80
-5
(e)
0
60
(c)
sorted w.r.t. the sorting score CUSUM of recursive residuals
40
random order
-5
20
-2
sorted w.r.t. the sorting score
5
0
8 6 4 2
100
CUSUM of recursive residuals
0
CUSUM of recursive residuals
(b)
Graphical Diagnostics of Endogeneity
CUSUM of OLS resid
(a)
60
sorted w.r.t. to variable z
80
100
0
-20
-40 0
20
40
60
sorted w.r.t. residuals from regress. of z against x
161
Fig. 3. CUSUM Plots of various Residuals from the Garen Model of Example 3: (a) OLS Residuals Sorted w.r.t. the Sorting Score zi ðzi f ðxi ÞÞ HCðri 0Þ ¼ 0:00; (b) Recursive Residuals Obtained with a Random Ordering HCðri 0Þ ¼ 0:40; (c) Recursive Residuals Obtained with the Sorting Score zi ðzi f ðxi ÞÞ HCðri 0Þ ¼ 2:53; (d) Ditto but with an Estimate of the Previous Sorting Score HCðri 0Þ ¼ 2:83; (e) Ditto but with the Sorting Score sðx; zÞ ¼ z HCðri 0Þ ¼ 3:75; and (f) Ditto but with the OLS Residuals from the Regression of zi on xi as the Sorting Score HCðri 0Þ ¼ 4:74.
162
XAVIER DE LUNA AND PER JOHANSSON
the recursive residuals obtained with this same sorting score, graph (d). The CUSUM plots in Fig. 3 are more interesting. We note that the recursive residuals obtained with well-chosen sorting scores all provide a clear sign of the misspecification of the conditional mean of the model (endogenous treatment) by displaying a systematic departure from zero for the CUSUM trajectory. This neat visual effect is due to the monotonic property. Note that, as in Section 4, the residuals of the selection equation as well as the endogenous variable itself also seem to provide monotone sorting scores (see the bottom panel graphs e & f of Fig. 3). The values of the HC test (for H 0 : ri 0), given in the caption of the figure, confirm the visual impression.
6. RETURN TO SCHOOLING In this section, we illustrate the kind of insights a graphical display of CUSUM recursive residuals can yield. We use a data set13 previously analyzed in Angrist and Krueger (1991) and the interest is to study the effects of compulsory school attendance on wage (see also Angrist, Imbens, & Krueger, 1999). The linear model of interest in Eq. (12) tries to explain the log weekly wage by the number of schooling years (z), while controlling for an age effect (assumed to be exogenous). Schooling systems differ between states (see Angrist & Krueger, 1991, Appendix 2), and for that reason, we perform state-specific analyses. As argued in the previous section, the explanatory variable describing school attendance can be used as a sorting score to check its endogeneity, which is predicted by the theory. We discard individuals with 0–11 years of education, in order to avoid effects due to compulsory schooling laws.14 In particular, the compulsory schooling period should not suffer from a selection bias. Recursive residuals are computed starting from 13 years of education, 1–12-year cases serving as starting values, together with one individual with 13 years to allow for estimability.15 In a sense, the question of interest is whether the return to education remains constant after 12 years of education, which most often corresponds to the completion of a high school degree. Recursive residuals are not only useful for diagnosing whether years of schooling are endogenous (selection bias) but also indicate the sign of the selection bias when present (see Remark 3). This is illustrated by the comments below. Fig. 4 displays CUSUM plots of recursive residuals obtained for California, Kansas, New York and Louisiana. We use the Californian case
Graphical Diagnostics of Endogeneity
163
California
Kansas log weekly wage
log weekly wage
50 0 -50
40 20 0
-100 0
1000 2000 3000 4000 5000 6000
0
500
1000
1500
2000
individuals with increasing nr of years of education
individuals with increasing nr of years of education
New York
Louisiana log weekly wage
log weekly wage
60
100 0 -200
10 0 -10 -20 -30 -40
-400 0
5000
10000
15000
individuals with increasing nr of years of education
0
500
1000
1500
2000
individuals with increasing nr of years of education
Fig. 4. Vertical Bars Indicate Years of Education; for instance, Residuals before the First Bar Correspond to 13 Years, Residuals between the First and the Second Bar are for Individuals with 14 Years of Education and so on, up to 20 Years. HC Values are: 2.13 for California, 0.73 for Kansas, 5.23 for New York and 1.34 for Louisiana.
for our main comments: The CUSUM plot indicates that there is actually no selection bias up to year 15 of education (in agreement with Angrist & Krueger’s, 1991 empirical findings, where they compared OLS and twostage least squares estimates). At year 16 (most often the completion of a university degree) there seems to be a positive selection bias, however. Indeed, although the HC value (1.02) is not significant at this stage, the clear upward trend observed for students with 16 years of education is convincing enough (such a trend is clear in 31 states out of 50; examples include Kansas and New York in Fig. 4, while Louisiana is a counter example). The nonsignificance of the test is most surely due to the fact that many of the recursive residuals are consistent with the zero mean hypothesis (those corresponding to 13–15 years of education). Finally, years 17–20 (most often postgraduate studies) do not seem to be rewarded at the same rate as
164
XAVIER DE LUNA AND PER JOHANSSON
previous years in terms of log wages since there are strong negative trends (over-predictions are observed); here, HC is significant (this downward pattern for postgraduate studies is observed in 39 states out of 50). The final HC value is seldom significant unless, as for California and New York, a large number of individuals are available. This is due to the use of a nonmonotone sorting score. The empirical evidence against its monotonicity is actually interesting per se, because economic theory predicts monotonicity (see Remark 3), more precisely positively biased recursive residuals. If we thus take the theory seriously, then the non-monotonicity suggests a model misspecification (e.g. that return to schooling is increasing at a decreasing rate). Alternatively, we need to reconsider the theory that states that individuals are maximizing their present value of future returns based on their knowledge of their relative advantages for schooling (see Remark 3).
7. DISCUSSION In this chapter, a graphical analysis of the recursive residuals associated with a sorting of the data has been advocated as a tool for diagnosing endogeneity. We expect practitioners to find this type of analysis a useful complement to the existing tests for exogeneity. A major advantage of the graphical analysis is that, in case of endogeneity, the direction of the bias implied by the endogenous variable is directly available from the CUSUM plot of the recursive residuals, as illustrated with the U.S. data on returns to schooling. When instruments are available, our approach is complementary to Hausman-type tests by providing an appealing graphical diagnostic tool. The proposed HC test has, moreover, the advantage of having no power against the presence of a random coefficient in front of an exogenous variable (see Remark 2). Monotone sorting scores have been emphasized because they ensure the best power when looking at CUSUM plots of recursive residuals. However, as soon as endogeneity implies non-linearity of the conditional expectation, e.g. random coefficient models or non-normality of the error term in the regression equation for the endogenous variable, then any sorting, even a random sorting, may allow the analyst to diagnose endogeneity (by identifying the non-linearity). In this case, even OLS residuals may be sufficient. This is, however, far from certain because this non-linearity is often weak and the residuals are heteroscedastic. In this chapter, we have shown how recursive residuals associated with a monotone sorting score may overcome this difficulty.
Graphical Diagnostics of Endogeneity
165
NOTES 1. de Luna and Johansson (2005) also took advantage of such a specific ordering of the data to introduce a test of exogeneity. 2. Endogeneity of a variable is often equivalent to a functional misspecification. For instance, if a random coefficient is associated with a continuous endogenous variable (e.g. Garen’s, 1984 model), the outcome equation is implicitly non-linear in that variable. 3. Classical examples include: (i) y is a quantity of goods and z its price, and (ii) y is a consumption measure and z is the disposable income. 4. This assumption might seem restrictive, but is only needed to ensure that Proposition 1 can be applied. It should be observed, however, that the linearity of E(s|z) is not a necessary condition for monotonicity. 5. This data is described in Hill et al. (1997) and obtained from http:// www.wiley.com 6. That is, u can be rewritten as Eq. (2). 7. Note that sorting the whole sample with respect to the propensity score does not yield exactly the same sorting. In the latter case, the two subsamples defined by z=0 and 1 will generally not be fully separated by the sorting, since a non-treated individual may in fact have a similar, or even higher, propensity to be treated than the one who is actually treated. 8. See also Wooldrige (1997, 2003). 9. Garen also considered sample selection, i.e. y ¼ x0 b þ zd þ zr þ Z þ , with Eðyjx ; zÞ ¼ x0 b þ zd þ zEðrjnÞ þ EðZjvÞ and EðZjvÞ ¼ rv. Here, we omit Z for clarity. 10. The omitted variable is a sorting score since l=0 under exogeneity. 11. Note that Proposition 1 does not apply here since E(s|z) is not linear in z. However, E(s|z) is quadratic in z, and therefore, fitting a linear model in z leads to a systematic under-prediction of y; hence, s is a monotone sorting score. 12. Using the full sample OLS estimator on Eq. (12) would, of course, lead to a positive-biased estimate of the mean return to schooling. Here, we rather discuss the individual’s return to schooling, when sorting with respect to schooling, recursively estimating the model with OLS, and thereafter performing out-of-sample predictions using this previous OLS estimate. 13. The data set is available at: http://qed.econ.queensu.ca/jae/ 14. Compulsory schooling laws may, in some instances, push students to complete a high school degree( see Angrist & Krueger, 1991, pp. 1004–1005). 15. In this application, we have multiple observations for a given number of years of education. These are left in their original ordering.
ACKNOWLEDGMENTS We thank Daniel Millimet and an unknown referee for helpful comments. The first author would like to acknowledge the Wikstro¨m Foundation for its financial support.
166
XAVIER DE LUNA AND PER JOHANSSON
REFERENCES Angrist, J. D., Imbens, G. W., & Krueger, A. B. (1999). Jackknife instrumental variables estimation. Journal of Applied Econometrics, 14, 57–67. Angrist, J. D., & Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics, CVI, 979–1014. Brown, R. L., Durbin, J., & Evans, J. M. (1975). Techniques for testing the constancy of regression relationships over time (with discussion). Journal of the Royal Statistical Society Series B, 37, 149–192. Cairnero, P., Heckman, J., & Vytlacil, E. (2003). Understanding what instrumental variables estimate-estimating marginal and average returns to education. Mimeo. Dawid, A. P. (1984). Statistical theory: The prequential approach. Journal of the Royal Statistical Society Series A, 147, 278–292. de Luna, X., & Johansson, P. (2000). Testing exogeneity in cross-section regression by sorting data. IFAU Working Paper no. 2000:2. de Luna, X., & Johansson, P. (2005). Exogeneity in structural equation models. To appear in Journal of Econometrics, 132, 527–543. Garen, J. (1984). The returns to schooling: A selectivity bias approach with a continuous choice variable. Econometrica, 52, 1199–1218. Garen, J. (1988). Compensating wage differentials and endogeneity of job riskiness. Review of Economics and Statistics, 70, 9–16. Gourie´roux, C., & Montfort, A. (1995). Statistics and econometric models. Cambridge: Cambridge University Press. Harvey, A., & Collier, G. (1977). Testing for functional misspecification in regression analysis. Journal of Econometrics, 6, 103–119. Hausman, J. A. (1978). Specification test in econometrics. Econometrica, 46, 1251–1271. Heckman, J. J. (1978). Dummy endogenous variables in a simultaneous equation system. Econometrica, 46, 931–959. Heckman, J. J., & Robb, R. (1986). Alternative identifying assumptions in econometric models of selection bias. In: H. Wainer (Ed.), Drawing inference from self selected samples (pp. 63–107). Berlin: Springer-Verlag. Hill, R. C., Griffiths, W. E., & Judge, G. G. (1997). Undergraduate econometrics. New York: Wiley. Kianifard, F., & Swallow, W. H. (1996). A review of the development and application of recursive residuals in linear models. Journal of the American Statistical Association, 91, 391–400. Lee, L.-F. (1983). Generalized econometric models with selectivity. Econometrica, 51, 507–512. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effect. Biometrika, 70, 41–55. Vella, F. (1998). Estimating models with sample selection bias: A survey. Journal of Human Resources, 38, 127–169. Wooldridge, J. M. (1997). On two stage least squares estimation of the average treatment effect in a random coefficient model. Economics Letters, 56, 129–133. Wooldrige, J. M. (2003). Instrumental variables estimation of the average treatment effect in the correlated random coefficient model. Economics Letters, 79, 185–191.
FERTILITY AND THE HEALTH OF CHILDREN: A NONPARAMETRIC INVESTIGATION Daniel J. Henderson, Daniel L. Millimet, Christopher F. Parmeter and Le Wang ABSTRACT Although the theoretical trade-off between the quantity and quality of children is well established, empirical evidence supporting such a causal relationship is limited. This chapter applies a recently developed nonparametric estimator of the conditional local average treatment effect to assess the sensitivity of the quantity–quality trade-off to functional form and parametric assumptions. Using data from the Indonesia Family Life Survey and controlling for the potential endogeneity of fertility, we find mixed evidence supporting the trade-off.
1. INTRODUCTION In the usual treatment effects setup, one is interested in identifying the average impact of a binary variable (the ‘treatment’) on a particular outcome. When the treatment is assigned randomly, the average treatment effect (ATE) is estimated by the difference in means. However, when the treatment is not Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 167–195 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00007-2
167
168
DANIEL J. HENDERSON ET AL.
assigned randomly, and is correlated with unobservables that also impact the outcome of interest, an exogenous variable (the ‘instrument’) is required such that the instrument is correlated with the treatment, but uncorrelated with the outcome conditional on the treatment. With such an instrument and assuming a constant treatment effect, the usual Wald estimator – given by the ratio of the effect of the instrument on the outcome to the effect of the instrument on the probability of treatment – identifies the ATE. However, the conditions required in this relatively simple framework may break down for two reasons. First, the treatment effect may not be constant across observations. Second, finding an exogenous instrument (i.e., one that is uncorrelated with the outcome conditional only on treatment status) may be unrealistic. In the case of non-constant treatment effects, Imbens and Angrist (1994) introduced the concept of the local average treatment effect (LATE). Under certain conditions (discussed below), Angrist, Imbens, and Rubin (1996) show that the Wald estimator may be interpreted as the ATE for the subpopulation of observations whose treatment assignment is determined by the instrument, referred to as compliers. In the second case, one solution is to incorporate additional covariates into the model such that an instrument is available that is uncorrelated with the outcome conditional on treatment status and the additional covariates. While adding covariates improves the likelihood that an instrument will be exogenous, the instrument must maintain its correlation with the treatment variable conditional on the covariates as well. Assuming one has an instrument satisfying the necessary conditions, the final issue confronted is how one estimates the conditional means required by the Wald estimator. Until recently, parametric or semiparametric methods have been exclusively relied on (see, e.g., Angrist, Graddy, & Imbens, 2000; Yau & Little, 2001; Abadie, 2003). However, Fro¨lich (2007) proves the feasibility of nonparametric estimation of the relevant conditional means. In this chapter, we review the general method proposed in Fro¨lich (2007), as well as the actual nonparametric techniques based on Generalized Kernel Estimation – developed in Li and Racine (2004) and Racine and Li (2004) – used to operationalize Fro¨lich’s estimator of the conditional LATE (c-LATE). We then apply this estimator to test the theoretical quantity– quality trade-off: an increase in the quantity of children in a household has a negative causal effect on the ‘quality’ of children. Here, the treatment is defined as a household having more than two children (relative to the control of having only two children), and the instrument is an indicator of whether the first two children are of the same gender. The outcome of interest is a measure of child health, weight-for-age, and the covariates included in the model include several individual and family attributes.
Fertility and the Health of Children: A Nonparametric Investigation
169
The model is estimated using data on roughly 5,000 children aged 10 and under from the 2000 wave of the Indonesian Family Life Survey (IFLS). For comparison, we also utilize Two-Stage Least Squares (TSLS) as well as the usual Wald estimator. The results indicate two main conclusions. First, the nonparametric c-LATE estimator reveals a negative c-LATE of residing in a larger household when analyzing the full sample. Moreover, the effect is statistically and economically significant; in the subpopulation of compliers, going from a household with only two children to one with more than two children leads to a decrease in weight-for-age by an average of more than one standard deviation. However, when we analyze the subsample of first and second born children, or only first born children, the estimate is statistically insignificant. Second, TSLS estimation, based on a linear functional form, yields a smaller, and only mildly statistically significant, point estimate using the full sample. Moreover, the TSLS estimates vary across subpopulations of compliers, unlike with the nonparametric c-LATE estimator. Thus, the flexibility afforded by the nonparametric approach matters in this example. Finally, as with the nonparametric estimator, TSLS estimates are not statistically significant in either the subsample of first and second born children, or only first born children. Thus, conclusions regarding the validity of the quantity–quality trade-off – in this application – depend crucially on the sample of children analyzed and the flexibility of the estimator used. The remainder of the chapter is organized as follows. Section 2 discusses the parameter to be estimated as well as the various estimation details. Section 3 discusses the applications and Section 4 concludes.
2. NONPARAMETRIC c-LATE ESTIMATION 2.1. The c-LATE Estimator We are interested in identifying the causal effect of a binary treatment on an outcome of interest. The fundamental problem with such identification is one of incomplete information (Rosenbaum & Rubin, 1983). While one observes whether the treatment occurs and the outcome conditional on the treatment, the counterfactual is unobserved. Let Y 1i denote the outcome of observation i if treatment occurs (given by Di=1) and denote Y i0 the outcome if treatment does not occur (Di=0). Thus, the treatment effect for observation i, gi, is given by the difference in the ‘potential outcomes’, Y 1i Y 0i . Moreover, if both ‘potential outcomes’ were observable, the ATE, gATE ¼ E½Y 1i Y i0 , as well as any other summary statistic of the
170
DANIEL J. HENDERSON ET AL.
distribution of the treatment effect, would be trivial to estimate. However, given that only Y i ¼ Di Y 1i þ ð1 Di ÞY i0 is observed for each observation, estimation of the ATE is no longer straightforward unless treatment assignment is random. Absent data from a randomized experiment, estimators of the ATE (or other mean treatment effect parameters) can be classified into two categories: selection on observables and selection on unobservables. In the former, it is assumed that the econometrician observes a vector of attributes for each observation, Xi, such that the distribution of the potential outcomes are independent of treatment assignment conditional of X. Formally, this unconfoundedness or conditional independence assumption is given by Y 1 ; Y 0 ? DjX
(1)
If (1) is unlikely to hold, one falls into the selection on unobservables case. It is this case we consider.1 When (1) does not hold, estimation of mean treatment effect parameters typically requires an instrumental variable (IV), Z. When Z is a binary variable and the following conditions hold: (A1) FIRST-STAGE: E½DjZ is a nontrivial function of Z (A2) MEAN INDEPENDENCE: E½Y ji;Zi jZ i ¼ 0 ¼ E½Y ji;Zi jZ i ¼ 1; j ¼ 0; 1 (A3) CONSTANT TREATMENT EFFECT: gi ¼ g 8i the ATE is given by gATE ¼ E½Y 1i Y i0 ¼
E½Y jZ ¼ 1 E½Y jZ ¼ 0 E½DjZ ¼ 1 E½DjZ ¼ 0
(2)
where Y ji;Zi denotes the outcome of observation i under treatment j, j=0,1, and with Z=Zi. Replacing the expectations in Eq. (2) with their sample counterparts results in the usual Wald estimator. In practice, however, the independence assumption (A2), may not hold unconditionally, and assumption of a constant treatment effect (A3), may be unrealistic (Heckman, 1997). When (A3) does not hold, one must distinguish between different subpopulations in terms of how their treatment assignment responds to variation in the instrument. Following Angrist et al. (1996), every observation must belong to one of four types t 2 fn; c; d; ag given the binary nature of D and Z: NEVER-TAKER: ti ¼ n3D0i ¼ 0 and D1i ¼ 0 COMPLIER: ti ¼ c3D0i ¼ 0 and D1i ¼ 1
Fertility and the Health of Children: A Nonparametric Investigation
171
DEFIER: ti ¼ d3D0i ¼ 1 and D1i ¼ 0 ALWAYS-TAKER: ti ¼ a3D0i ¼ 1 and D1i ¼ 1 where Dji is the value of D for observation i given Zi=j, j=0,1 (i.e., it is the potential treatment corresponding with setting the instrument to j.) Never(always-) takers never (always) receive the treatment regardless of the value of the instrument. The treatment assignment of compliers and defiers, on the other hand, is manipulated by the instrument. Thus, when the treatment effect is not constant, one cannot learn anything about the treatment effect for never- and always-takers. However, under the following conditions: (B1) MONOTONICITY (no defiers): PrðD1i D0i Þ ¼ 1 (B2) FIRST-STAGE (population of compliers has positive probability): PrðD1i ¼ 1Þ4PrðD0i ¼ 1Þ (B3) UNCONFOUNDED TYPE: Prðti ¼ tjZ i ¼ 0Þ ¼ Prðti ¼ tjZ i ¼ 1Þ; t 2 fn; c; ag (B4) MEAN INDEPENDENCE WITHIN SUBPOPULATIONS: 0 E½Y i;Z jZ i ¼ 0; ti ¼ t ¼ E½Y 0i;Zi jZi ¼ 1; ti ¼ t; t 2 fn; cg; E½Y 1i;Zi jZi ¼ i 0; ti ¼ t ¼ E½Y 1i;Zi jZ i ¼ 1; ti ¼ t; t 2 fa; cg the ATE for the subpopulation of compliers, also known as the LATE), is given by g ¼ E½Y 1i Y 0i jti ¼ c ¼
E½Y jZ ¼ 1 E½Y jZ ¼ 0 E½DjZ ¼ 1 E½DjZ ¼ 0
(3)
Replacing expectations with their sample counterparts yields the same estimand as in Eq. (2); the only difference is in the interpretation (see Lee, 2005, for further details). In the case when (A2) and (A3) both do not hold, one may identify the c-LATE, conditional on a vector of observables, X, under certain conditions (Angrist & Imbens, 1995). The c-LATE evaluated at a particular x, g(x), is defined as gðxÞ ¼ E½Y 1i Y i0 jX i ¼ x; ti ¼ c ¼
E½Y jX ¼ x; Z ¼ 1 E½Y jX ¼ x; Z ¼ 0 (4) E½DjX ¼ x; Z ¼ 1 E½DjX ¼ x; Z ¼ 0
The overall c-LATE, g, is a weighted average of g(x), where the weights represent the distribution of x in the subpopulation of compliers; this is given by Z (5) g ¼ gðxÞ:dF xjt¼c
172
DANIEL J. HENDERSON ET AL.
where F xjt¼c denotes the distribution of x in the subpopulation of compliers. If the following conditions hold: (C1) MONOTONICITY (no defiers): PrðD1i D0i jX Þ ¼ 1 (C2) FIRST-STAGE (population of compliers has positive probability): PrðD1i ¼ 1jX Þ4PrðD0i ¼ 1jX Þ (C3) UNCONFOUNDED TYPE: Prðti ¼ tjX i ¼ x; Z i ¼ 0Þ ¼ Prðti ¼ tjX i ¼ x; Z i ¼ 1Þ; t 2 fn; c; ag; 8x 2 SuppðX Þ (C4) MEAN INDEPENDENCE WITHIN SUBPOPULATIONS: 0 0 E½Y i;Z jX i ¼ x; Z i ¼ 0; ti ¼ t ¼ E½Y i;Z jX i ¼ x; Z i ¼ 1; ti ¼ t; t 2 i i 1 fn; cg; 8x 2 SuppðX Þ; E½Y i;Zi jX i ¼ x; Zi ¼ 0; ti ¼ t ¼ E½Y 1i;Zi jX i ¼ x; Z i ¼ 1; ti ¼ t; t 2 fa; cg; 8x 2 SuppðX Þ (C5) COMMON SUPPORT: SuppðX jZ ¼ 0Þ ¼ SuppðX jZ ¼ 1Þ Eqs. (4) and (5) may be estimated either parametrically, semiparametrically, or nonparametrically. Given that estimates of g(x) and g may be sensitive to the choice of functional form assumed or the additive separability imposed in semiparametric models, we utilize a fully nonparametric approach recently developed in Fro¨lich (2007). To proceed in the estimation of g, we require specific knowledge of the distribution of the covariates for the subpopulation of compliers. However, since the subpopulation of compliers is unknown, dF xjt¼c is unknown. Relying on Bayes’ theorem simplifies matters; we can replace dF xjt¼c with Prðt ¼ cjX ¼ xÞ dF x Prðt ¼ cÞ
(6)
in Eq. (5). Noting that Prðt ¼ cjX ¼ xÞ ¼ E½DjX ¼ x; Z ¼ 1 E½DjX ¼ x; Z ¼ 0 and substituting Eq. (4) for g(x), Eq. (5) may be rewritten as R E½Y jX ¼ x; Z ¼ 1 E½Y jX ¼ x; Z ¼ 0 f ðxÞdx (7) g¼ Prðt ¼ cÞ This representation of g still R depends on an unknown, R Prðt ¼ cÞ. However, replacing Prðt ¼ cÞ ¼ Prðt ¼ cjX ¼ xÞf ðxÞdx ¼ E½DjX ¼ x; Z ¼ 1 E½DjX ¼ x; Z ¼ 0; f ðxÞdx in Eq. (7) yields R E½Y jX ¼ x; Z ¼ 1 E½Y jX ¼ x; Z ¼ 0 f ðxÞdx (8) g¼R E½DjX ¼ x; Z ¼ 1 E½DjX ¼ x; Z ¼ 0 f ðxÞdx which is nonparametrically identified.
Fertility and the Health of Children: A Nonparametric Investigation
173
Define the following conditional mean functions: m1 ðxÞ ¼ E½Y jX ¼ x; Z ¼ 1 m0 ðxÞ ¼ E½Y jX ¼ x; Z ¼ 0 m1 ðxÞ ¼ E½DjX ¼ x; Z ¼ 1 m0 ðxÞ ¼ E½DjX ¼ x; Z ¼ 0 b 0 ðxÞ, b b 1 ðxÞ, m m1 ðxÞ, and b m0 ðxÞ be their corresponding nonparametric and let m conditional mean estimators. Then, a suitable estimator for Eq. (8) is n P
b 1 ðX i Þ m b 0 ðX i Þ ½m ^g ¼ i¼1 n P ½b m1 ðX i Þ b m0 ðX i Þ
(9)
I¼1
where n is the sample size. b 1 ðxÞ, m b 0 ðxÞ, b This estimator uses Xi to construct m m1 ðxÞ, and b m0 ðxÞ. However, following Fro¨lich (2007), it is reasonable to use the observed values Yi and Di as estimates of E½Y i jX ¼ X i ; Z ¼ z and E½Di jX ¼ X i ; Z ¼ z, respectively, whenever Z i ¼ z. The nonparametric c-LATE estimate, g^ , now becomes P P ^ 0 ðX i Þ ^ 1 ðX i Þ ½Y i m ½Y i m i:Zi ¼1 i:Zi ¼0 P (10) g^ ¼ P ½Di m^ 0 ðX i Þ ½Di m^ 1 ðX i Þ i:Z i ¼1
i:Zi ¼0
which is equivalent to the ratio of two matching estimators (see Heckman, LaLonde, & Smith, 1999; Smith & Todd, 2005 for an introduction to matching estimators). Eq. (10) may be estimated using any nonparametric estimator, provided that the estimator satisfies the p assumptions given ffiffiffi in Fro¨lich (2007), under which, the estimator bg is n consistent and asymptotically normal. While power series, polynomial series, splines, wavelets, neural networks, and sieves are all viable nonparametric estimation methods, we utilize recently developed Generalized Kernel Estimation. The conditions for pffiffiffi n-consistency and asymptotic normality, laid out in Fro¨lich (2007), are satisfied with Generalized Kernel Estimation techniques and this estimator is only a slight variant of the popular and widely used kernel methods, which provides the impetus for our selection of it for this application.
174
DANIEL J. HENDERSON ET AL.
2.2. Generalized Kernel Estimation Given the type of data typically encountered in applications of program evaluation or general treatments, we utilize Generalized Kernel Estimation – developed in Li and Racine (2004) and Racine and Li (2004) – to estimate Eq. (10). The benefit of this approach over typical kernel methods is that it allows for smoothing of both continuous and categorical (ordered and unordered) variables. Our objective is to estimate mz ðxÞ ¼ E½Y jX ¼ x; Z ¼ z and mz ðxÞ ¼ E½DjX ¼ x; Z ¼ z nonparametrically for all Z. Given that Z is binary, we need to estimate four models: mz(x) and mz ðxÞ, z=0,1. Each model is b 0 ðxÞ estimated using only the appropriate subsample. In other words, m b 1 ðxÞ and b m0 ðxÞ are estimated using the subsample where Z=0; similarly for m and b m1 ðxÞ. Once the four models are estimated, we are able to predict the missing counterfactual for each observation. Specifically, we utilize b 0 ðxÞ and b m m0 ðxÞ to estimate the missing counterfactuals, E½Y i jX ¼ xi ; zi ¼ 0 and E½Di jX ¼ xi ; zi ¼ 0, respectively, for the subsample where Z=1; for b 1 ðxÞ and b the subsample where Z=0, we utilize m m1 ðxÞ. This yields an estimate of the expected change from moving from Z=1 to Z=0, and vice versa. To obtain each of these estimates, we first use a Nadaraya–Watson (Local-Constant Least Squares) type estimator, which estimates the conditional expectation of a variable, and then predict the missing counterfactuals. Formally, P Y j Kððxj xi Þ=lÞ j:Z¼0 b 0 ðxi Þ ¼ E½Y jX ¼ x; Z ¼ 0 ¼ P m (11) Kððxj xi Þ=lÞ j:Z¼0
where KðÞ is the kernel function and l is the bandwidth (discussed below). b 0 ðxi Þ for each observation i, i ¼ 1; 2; :::; n1 , where n1 is the We then evaluate m number of observations with Z=1. Similarly, P Y j Kððxj xi Þ=lÞ j:Z¼1 b 1 ðxi Þ ¼ E½Y jX ¼ x; Z ¼ 1 ¼ P m (12) Kððxj xi Þ=lÞ j:Z¼1
b 1 ðxi Þ for each observation i, i ¼ 1; 2; :::; n0 , where n0 is and we then evaluate m the number of observations with Z=0. The extensions for m0 ðxÞ and m1 ðxÞ
Fertility and the Health of Children: A Nonparametric Investigation
175
are trivial and are given by P b m0 ðxi Þ ¼ E½DjX ¼ x; Z ¼ 0 ¼
Dj Kððxj xi Þ=lÞ
j:Z¼0
P
Kððxj xi Þ=lÞ
(13)
j:Z¼0
and P b m1 ðxi Þ ¼ E ½DjX ¼ x; Z ¼ 1 ¼
Dj Kððxj xi Þ=lÞ
j:Z¼1
P
Kððxj xi Þ=lÞ
(14)
j:Z¼1
respectively. We purposely skipped the description of the kernel function, KðÞ, as this is what separates the Generalized Kernel Estimation procedure from standard kernel methods. In the standard model, the kernel function is designed for continuous variables and satisfies several properties. First, for large values of l1 ðxj xi Þ, Kððxj xi Þ=lÞ should be small; small weights are assigned to data points, which are far away from xi. Second, since the bandwidth l should go to zero as the sample size grows toward infinity, Kððxj xi Þ=lÞ should go to zero for all xj axi . Finally, the kernel function must integrate to unity. Given these requirements, kernels are frequently chosen to be well-known density functions, with the standard normal being one of the most popular. When the data are discrete, however, a kernel function that is designed specifically for discrete data must be used in order to smooth. Different kernels are used for unordered and ordered discrete variables, where unordered kernels treat deviations from xi equally, whereas ordered kernels weight the values differently depending on the distance between xj and xi. For instance, a time trend is an example of an ordered discrete variable, whereas a variable identifying province of residence is an example of an unordered discrete variable. Even though the kernels used to smooth discrete data are different from those used for continuous variables, one benefit of these kernels is that, by identifying a variable as discrete, the rate of convergence of the nonparametric regression estimator is not affected by their inclusion. This is one of the novelties of Generalized Kernel Estimation. Data sets that, on the surface, do not appear well-suited to nonparametric estimation, due to inclusion of many covariates, can be accommodated with an appropriate recognition of the nature of the data type.
176
DANIEL J. HENDERSON ET AL.
For a data set containing continuous, unordered, and ordered discrete variables, KðÞ can be constructed by generalizing the standard product kernel (Pagan & Ullah, 1999, p. 59). Before proceeding, however, let us clarify our terminology. For a data set with q covariates, we arrange the covariates so that the first qu are unordered, the next qo are ordered, and the last qc are continuous, and q ¼ qu þ qo þ qc . Similarly our bandwidth vector will be arranged so that l ¼ ½lu1 ; . . . ; luqu ; lo1 ; . . . ; loqo ; lc1 ; . . . ; lcqc . Following this notation, our generalized product kernel is: qu qY q x x Y u þqo Y 1 c xcsj xcsi j i u u u o o o u o K l ðxsj ; xsi ; ls Þ l ðxsj ; xsi ; ls Þ l ¼ lc l lcs s¼1 s¼q þ1 s¼q þq þ1 s u
u
o
(15) where l c is the standard normal kernel function with bandwidth lcs associated with the sth component of xc, l u a variation of Aitchison and Aitken’s (1976) kernel function with bandwidth lus associated with the sth component of xu, and l o the Wang and Van Ryzin (1981) kernel function with window width los associated with the sth component of xo. Formally, the kernel function for continuous variables is given by ! c c c c 2 x x x x 1 1 sj si sj si (16) lc ¼ pffiffiffiffiffiffi exp 2 lcs lcs 2p the kernel function for unordered categorical variables is given by 8 u u u > < 1 ls if xsi ¼ xsj u u u u u l ðxsi ; xsj ; ls Þ ¼ ls > : d 1 otherwise s
(17)
where ds is the number of unique values xus can take (e.g., if xus is binary, ds=2); the kernel function for ordered categorical variables is given by 8 o if xosi ¼ xosj > < 1 ls l o ðxosi ; xosj ; los Þ ¼ 1 los o jxo xo j (18) > ðls Þ si sj otherwise : 2 After choosing the kernel function, the final issue to be resolved is the choice of bandwidths. Because it is believed that the choice of the continuous
Fertility and the Health of Children: A Nonparametric Investigation
177
kernel function matters little in the estimation of the conditional mean (see Ha¨rdle, 1990), selection of the bandwidths is considered to be the most salient factor when performing nonparametric estimation. As indicated above, the bandwidth controls the amount by which the data are smoothed. Large values of l will lead to large amounts of smoothing, resulting in low variance, but high bias. Small values of l, on the other hand, will lead to less smoothing, resulting in high variance, but low bias. This trade-off is well known in applied nonparametric econometrics, and the ‘solution’ is most often to resort to automated determination procedures to estimate the bandwidths. The use of such procedures to select the bandwidths allows the researcher to draw upon past Monte Carlo investigations assessing the relative performance of various methods and opt for a method that is in-line with the researcher’s view on balancing bias and efficiency (see, e.g., Jones, Marron, & Sheather, 1996). Although there exist many selection methods, we employ Hurvich, Simonoff, and Tsai (1998) Expected Kullback Leibler (AICc) criteria. This method chooses smoothing parameters using an improved version of a criterion based on the Akaike Information Criteria (AICc). AICc has been shown to perform well in small samples and avoids the tendency to undersmooth as often happens with other approaches such as Least-Squares Cross-Validation.2 The basic idea behind the procedure is that we want to choose the bandwidths in order to minimize a particular objective function. Specifically, we wish to minimize AIC c ðlu ; lo ; lc Þ ¼ logðb s2 Þ þ
1 þ trðHÞ=nz 1 ½trðHÞ þ 2=nz
where 1 X b z ðxj ÞÞ2 ðY j m nz j:Z¼z 1 ¼ Y 0 ðI HÞ0 ðI HÞY nz
b s2 ¼
or 1 X ðDj b mz ðxj ÞÞ2 nz j:Z¼z 1 ¼ D0 ðI HÞ0 ðI HÞD nz
b s2 ¼
(19)
178
DANIEL J. HENDERSON ET AL.
and I is the identity matrix. Here H is the linear smoother matrix composed of the kernel weights, 2
Kððx1 x1 Þ=lÞ 6P n 6 Kððx x Þ=lÞ 1 i 6 6 i¼1 6 6 Kððx2 x1 Þ=lÞ 6 n 6P 6 Kððx2 xi Þ=lÞ H¼6 6 i¼1 6 6 .. 6 . 6 6 Kððxn x1 Þ=lÞ 6 6P n 4 Kððx x Þ=lÞ n
i¼1
i
Kððx1 x2 Þ=lÞ n P Kððx1 xi Þ=lÞ
i¼1
Kððx2 x2 Þ=lÞ n P Kððx2 xi Þ=lÞ
i¼1
.. . Kððxn x2 Þ=lÞ n P Kððxn xi Þ=lÞ i¼1
..
.
3 Kððx1 xn Þ=lÞ 7 n P Kððx1 xi Þ=lÞ 7 7 7 i¼1 7 Kððx2 xn Þ=lÞ 7 7 n 7 P Kððx2 xi Þ=lÞ 7 7 7 i¼1 7 7 .. 7 . 7 Kððxn xn Þ=lÞ 7 7 7 n P Kððx x Þ=lÞ 5 n
i
i¼1
A nice feature of this cross-validation procedure is that it does not require the use of a leave-one-out estimator. While this offers no additional gain in computation time – there are still n2 calculations to be performed – the intuitive appeal of this method is attractive. The criterion is composed of two distinct parts, one that rewards fit and another that penalizes fit. The objective is not simply to interpolate the data by connecting all of the points together. Rather, we are concerned with how our estimate of the function predicts the counterfactuals. A model that fits our data well may not be the best model for constructing counterfactuals and so the AICc criteria punishes bandwidths that are interpolating the data rather than determining the underlying population data generating process. The set of bandwidths that minimize the AICc function is utilized in the final estimation. Obviously, as the sample size grows and the number of regressors increases, computation time increases dramatically. However, it is highly recommended that one use a bandwidth selection procedure as opposed to a rule of thumb selection, especially in the presence of discrete data as no rule of thumb selection criteria exist. As an aside, we note that an even simpler bandwidth selection procedure, the ‘ocular’ method, is not appropriate once the number of covariates is larger than two. As the number of regressors exceeds two, visual methods to investigate the fit of the model are cumbersome and uninstructive. With a large dimension for the number of regressors, it is suggested that crossvalidation techniques be used as opposed to either ocular or rule of thumb methods.
Fertility and the Health of Children: A Nonparametric Investigation
179
Once we have estimated the four separate sets of bandwidths, correspondb 0 ðxÞ, ing to the four conditional expectation estimates, we can estimate m b 1 ðxÞ, b m m0 ðxÞ, and b m1 ðxÞ and obtain our estimate of bg in Eq. (10). The standard error of bg is obtained via an i.i.d. bootstrap, re-sampling observations with replacement (199 repetitions). The use of 199 bootstrap replications, while arbitrary, is large enough to reduce the effect of outliers in the re-sampling process, but small enough that the computational burden imposed on the researcher is held to a minimum.
3. APPLICATION TO FERTILITY AND CHILD HEALTH 3.1. Motivation A long-standing issue in labor and development economics is the so-called quantity–quality trade-off. According to the classic theory, households make interdependent choices regarding the number of children, as well as the level of human capital investments per child. An implication of this theory is the existence of a negative relationship between child outcomes (quality) and the number of children in a household (quantity) (Becker & Tomes, 1976; Becker & Lewis, 1973; Becker, 1960).3 The trade-off is typically modeled as arising from parental preferences for equal levels of quality across children combined with a binding budget constraint (Rosenzweig & Wolpin, 1980). Empirical tests of this implication concentrate on estimating demand equations for child-specific outcomes (e.g., the level of schooling), where the number of children is one potential influence on demand. Such studies document a negative relationship between sibship size and human capital investments (for recent examples, see, Conley & Glauber, 2005; Glick, Marini, & Sahn, 2005; Lee, 2004), although a few find either no effect (e.g., Black, Devereux, & Salvanes, 2005) or even a positive effect (e.g., Qian, 2005). Empirical assessments of the quantity–quality trade-off have focused mainly on education as a measure of child quality. Here, we use health status for several reasons. First, Thomas and Frankenberg (2002) state that adult stature is largely determined during the fetal and early childhood periods, and Thomas, Strauss, and Henriques (1990, 1991) note the direct relationship between child anthropometric measures and the probability of survival as well as skill development. Second, researchers have documented a positive association between adult health and labor market outcomes at both the microeconomic and macroeconomic levels. Specifically, many
180
DANIEL J. HENDERSON ET AL.
studies have found that there is a positive impact of height on individual hourly earnings (Strauss & Thomas, 1998, provide an excellent review), Fogel (1994) documents the parallel historical increases in height and economic growth, Bloom, Canning, and Sevilla (2001) document a causal connection between life expectancy and economic growth, and Lo´pez-Casasnovas, Rivera, and Currais (2005) offer a detailed theoretical and empirical account of the linkages between health and economic development (see also Wolpin, 1997). Given the importance of children’s health, a small literature has developed investigating its determinants. In many of these studies, household size enters the empirical analysis as an important control, although the estimated relationship is not of primary interest and the issue of causation versus correlation is often ignored. Specifically, unobservable household preferences may lead to correlation between the quantity and quality of children, and/or households with a greater health ‘endowment’ may opt for larger households since the shadow price of children will be lower. While such a correlation is consonant with Becker’s original treatment of the issue, our focus herein is on whether the trade-off represents a causal relationship, rather than arising from unobserved preferences or endowments. Two recent studies that do address the issue of causation are Glick et al. (2005) and Angrist, Lavy, and Schlosser (2005). Glick et al. (2005) utilize data on twins to isolate the casual effect of fertility on child health and school enrollment using Romanian data, finding sizeable negative effects that increase in magnitude after accounting for the endogeneity of sibship size. Angrist et al. (2005) use Israeli data on twins and the gender composition of children to estimate the causal impact of fertility on a variety of children’s outcomes as adults (e.g., completed education, labor market outcomes, and own marital and fertility patterns), finding little impact. Here, we use a binary instrument based on the gender composition of children in the household, as utilized in, for example, Butcher and Case (1994), Angrist and Evans (1998), Cruces and Galiani (2004), Angrist et al. (2005), and Conley and Glauber (2005). Identification of the effect of household size via the gender composition of children appears well-suited for the Indonesian context. Specifically, existing evidence suggests little reason to worry about gender bias in Indonesia, although – as our analysis below will show – there does exist gender preference in Indonesia (in particular, a preference for children of both genders). The lack of gender bias is crucial for the identification strategy, since a bias that results in differential treatment across genders would compromise the exogeneity of child gender. For instance, Kevane and
Fertility and the Health of Children: A Nonparametric Investigation
181
Levine (2001) find that no evidence of ‘missing girls’ in Indonesia (due to selective abortion, infanticide, or neglect), and that historical gaps in educational attainment, particularly at the primary school level, have narrowed, if not disappeared. In addition, Levine and Ames (2003) find that even during the Asian financial crisis in the late 1990s, girls did not suffer disproportionately relative to boys, and may have actually benefited, in terms of a wide range of measures including school enrollment, immunizations, and mortality.
3.2. Data The data are obtained from IFLS, a large-scale longitudinal survey conducted jointly by RAND and the Center for Population and Policy Studies (CPPS) at the University of Gadjah Mada. The IFLS provides a rich data source based on a sample of households representing about 83% of the Indonesian population living in 13 of the nation’s 26 provinces in 1993. Three waves exist presently: 1993, 1997, and 2000 (see Strauss et al., 2004a, 2004b, for a complete description of the surveys). We utilize the 2000 wave, which is the most current survey containing physical assessments of children, and form a sample of roughly 5,300 children aged 10 and under, with at least one identifiable birth parent in the survey, and who come from a household with at least two children. We assess the empirical relevance of the trade-off using weight as the basis of our measure of health status, Y. However, since children are still growing, comparing anthropometric data from children of different ages is complicated (Vidmar, Carlin, Hesketh, & Cole, 2004). Thus, we standardize the raw data on weight to the reference population for the child’s age and sex utilizing the 1990 British Growth Reference data. This is a frequently used measure of health, and captures persistent long-term malnutrition as well as short-term health shocks.4 The quantity of children is defined as the number of children ‘belonging’ to a given set of parents. Assuming that couples take into account their spouses’ fertility history, this definition includes any children from previous marriages. For example, if one (or both) parents were previously married and entered the current marriage with children from the previous union, then these children are used when computing the number of siblings. Our definition excludes children ‘belonging’ to other couples who may share a common residence. Finally, our definition includes children ‘belonging’ to the couple, but not currently residing in the household if the child resided in
182
DANIEL J. HENDERSON ET AL.
the household during the 1993 or 1997 wave.5 Once the number and identity of siblings are established, we create the binary treatment variable, D, equal to one if the household has more than two children and zero otherwise (recall, the sample is limited to households with at least two children). We refer to this as the MoreThan2 treatment. As an alternative, we also define a second treatment, Three, equal to one if the household has three children and zero if the household has only two children. Thus, the second treatment limits the sample to only households with either two or three children. To estimate g, we define Z based on the gender of children; Z is equal to one if the first two children are of the same sex (zero otherwise). We refer to this as the SameSex2 instrument. Again, as an alternative, we define a second instrument, Daughter2, equal to one if the first two children are both girls (zero otherwise). The use of this second instrument is interesting in two respects. First, because the c-LATE is specific to the subpopulation of compliers with the particular instrument utilized, assessing differences across different instruments provides some indication as to the heterogeneity of the treatment effects. Second, assessing differences in the first-stage explanatory power between the two instruments provides some indication into the strength of gender preferences in Indonesia. Specifically, if there are strong preferences for sons, one might expect the Daughter2 instrument to provide greater explanatory power in the first-stage. If this is the case, one might be concerned – despite the previous studies cited above – that gender may not be exogenous in Indonesia (due to such factors as selective abortion, infanticide, or neglect). Finally, utilization of all children in a household may be problematic, as argued in Angrist et al. (2005). First, using all children pools children exposed to the natural experiment being exploited by the identification strategy together with those not explicitly exposed to the natural experiment. Second, if the number of children is indeed endogenous, then higher-order children (the third child and above) represent a non-random sample (see also Pitt, 1997). Thus, we perform two additional sensitivity analyses. Specifically, in addition to performing the analysis on the full sample of roughly 5,300 children, we also utilize two subsamples. First, we utilize only the first and second born children from each household, keeping in mind that these are the children exposed to the natural experiment. Second, we analyze only first born children, since Angrist et al. (2005) also view second born children as potentially problematic since it is the gender of these children that create the experiment. Table 1 provides a crosstabulation of the instruments and treatments for each sample. Note, we do not utilize the treatment Three in the subsamples of first born, and first and
Fertility and the Health of Children: A Nonparametric Investigation
Table 1.
183
Cross-Tabulation of Treatment Variable and Instrument.
Sample
Treatment (D)
1st born only
MoreThan2
SameSex2 Instrument (Z)
0 1 Total
1st and 2nd MoreThan2 born only
All children MoreThan 2
0 1 Total
0 1 Total
0 1 434 414 115 133 549 547 Pr[D=Z]=0.517 [p=0.25] 0 1 1,098 1,010 383 425 1,481 1,435 Pr[D=Z]=0.522 [p=0.02] 0 1 1,098 1,010 1,534 1,770 2,632 2,780 Pr[D=Z]=0.530 [p=0.00]
Daughter2 Instrument (Z)
Total
848 248 1,096
2,108 808 2,916
2,108 3,304 5,412
0 1 657 191 182 66 839 257 Pr[D=Z]=0.660 [p=0.00] 0 1 1,634 474 592 216 2,226 690 Pr[D=Z]=0.634 [p=0.00] 0 1 1,634 474 2,437 867 4,071 1,341 Pr[D=Z]=0.462 [p=0.00]
Total
848 248 1,096
2,108 808 2,916
2,108 3,304 5,412
Three 0 1 Total
1,098 1,010 650 708 1,748 1,718 Pr[D=Z]=0.521 [p=0.01]
2,108 1,358 3,466
1,634 474 1,007 351 2,641 825 Pr[D=Z]=0.573 [p=0.00]
2,108 1,358 3,466
Note: p-values in brackets associated with the null Ho: Pr[D=Z]=0.5.
second born, children since this entails dropping relatively few observations (55 and 209, respectively); as a result, the findings should not change qualitatively. The cross-tabulations indicate that the combined probability that (D=0, Z=0) and (D=1, Z=1) is greater than 50% – and statistically different from 50% at conventional levels – in each sample with exception of when we utilize: (i) first born children only, the MoreThan2 treatment, and the SameSex2 instrument (combined probability=0.517; p-value for the null that the combined probability is 0.5 is 0.25); and (ii) all children, the MoreThan2 treatment, and the Daughter2 instrument (combined probability=0.462; p-value for the null that the combined probability is 0.5 is 0.00). Moreover, in the sample of first born only children, the combined probability is 66.0% when using the MoreThan2 treatment and the Daughter2 instrument (p-value for the null that the combined probability is 0.5 is 0.00). Finally, comparing the Pr(D=1|Z=1) across the two
184
DANIEL J. HENDERSON ET AL.
instruments for each sample indicates little substantive difference, although the probability is marginally higher in each sample using the Daughter2 instrument (first born only sample: 24.3% versus 25.7%; first and second born only: 29.6% versus 31.3%; all children sample, MoreThan2 treatment: 63.7% versus 64.7%; all children sample, Three treatment: 41.2% versus 42.5%). This is suggestive of little overt preference for sons. To obtain the c-LATE, we condition on several unordered and ordered discrete variables, as well as continuous variables. While the gender composition of children seems at first glance to be uncorrelated with all other potential determinants of child health, Angrist and Evans (1998) show that it may be correlated with the individual genders of the first two children (i.e., if s1 and s2 are binary variables indicating the gender of the first two children and are independent and identically distributed across children, then the MoreThan2 instrument is equal to s1 s2 þ ð1 s1 Þð1 s2 Þ, which is independent of s1 and s2 only if E[si]=0.5, i=1,2). Moreover, even in randomized experiments, controlling for important potentially confounding variables may be advisable to capture residual covariance, as well as the fact that randomization only balances the confounders in expectation (Imai & van Dyk, 2004). Thus, we condition on the following variables: Unordered Discrete: Gender of the first child, and province (North Sumatra, West Sumatra, South Sumatra, Lapung, Jakarta, West Java, Central Java, Yogyakarta, East Java, Bali, West Nusa Tenggara, South Kalimantan, and South Sulawesi); Ordered Discrete: Father’s education (less than primary school, junior high school, senior high school, university, and other), mother’s education, and age of the child in months; Continuous: Mother’s weight and father’s weight.6 The inclusion of parental weight is noteworthy. As noted in Thomas et al. (1990) and Thomas, Lavy, and Strauss (1996), parental health status, as reflected in anthropometric measures, may have a direct effect on child health through their impact on birth weight. In addition, parental measures may partially capture household unobservables, lending further credibility to the use of Z as a valid instrument. Summary statistics are available upon request.
3.3. Results Prior to discussing the results, it is important to examine the bandwidths for the nonparametric estimates. The bandwidths, by affecting the degree of
Fertility and the Health of Children: A Nonparametric Investigation
185
smoothing, are not just a means to an end; they also provide some indication of how the dependent variable is affected by the regressors. For instance, when using the Nadaraya–Watson estimator, the kernel effectively smoothes out a variable as the bandwidth approaches its upper bound; when the bandwidth attains the upper bound, the regressor does not impact the estimation results. For continuous regressors, the upper bound is infinity. For unordered discrete variables, the upper bound is given by (ds1)/ds, where ds is the number of values the discrete regressor can take. Finally, for ordered discrete variables, the upper bound is unity. Tables 2–4 present the bandwidths for the nonparametric models. Table 2 corresponds to the sample of only the first born child (sample size is 1,096); Table 3 corresponds to the sample of both first and second born children (sample size is 2,916); Table 4 corresponds to the full sample (sample size is 5,412). The bandwidths in Table 2 yield three salient findings. First, each bandwidth for mother’s weight exceeds 3.347E+07, regardless of whether
Table 2. Treatment
Bandwidths for the Nonparametric c-LATE Estimator (1st Born Only).
Instrument
MoreThan2 SameSex2
MoreThan2 Daughter2
Variable
m0
m1
m0
m1
Upper Bound
Gender of 1st child Province Father’s education Mother’s education Age Father’s weight Mother’s weight Province Father’s education Mother’s education Age Father’s weight Mother’s weight
0.500
0.236
0.500
0.500
0.500
0.564 0.640
0.923 0.591
0.923 1.000
0.923 0.826
0.923 1.000
0.543
0.684
1.000
1.000
1.000
1.000 3.991 107
0.999 4.786
0.928 0.943 1.000 5.516 107 2.406 107 N
2.738 108 3.498 107 1.689 108 3.347 107
N
0.520 0.663
0.923 1.000
0.923 1.000
0.923 0.512
0.923 1.000
0.356
0.659
1.000
1.000
1.000
1.000 9.138 107
1.000 5.879
0.917 1.531 108
0.964 59.538
1.000 N
1.579 108 6.800 107 6.669 108 6.426 107
N
186
DANIEL J. HENDERSON ET AL.
Table 3. Treatment
Bandwidths for the Nonparametric c-LATE Estimator (1st and 2nd Born).
Instrument
MoreThan2 SameSex2
MoreThan2 Daughter2
Variable
m0
m1
m0
m1
Upper Bound
Gender of 1st child Province Father’s education Mother’s education Age Father’s weight Mother’s weight Province Father’s education Mother’s education Age Father’s weight Mother’s weight
0.500
0.236
0.500
0.500
0.500
0.730 0.613
0.898 0.545
0.410 1.000
0.736 0.731
0.923 1.000
0.561
0.589
1.000
1.000
1.000
0.943 0.589 0.957 0.942 1.000 1.461 108 3.817 108 1.144 108 9.911 107 N 2.538 108 5.331 108 8.554 107 1.188 108
N
0.807 0.554
0.923 0.722
0.365 1.000
0.601 1.000
0.923 1.000
0.479
1.000
1.000
1.000
1.000
0.886 0.892 0.947 0.962 1.000 4.653 107 8.400 107 7.310 107 1.323 108 N 8.301 107 5.340 107 1.036 108 1.398 108
N
we utilize the SameSex2 or Daughter2 instrument. Since the bandwidths are close to their upper bound of infinity, this suggests that mother’s weight is not important in the prediction of either Y or D in any of the counterfactual regressions. Second, the age of the child may also not be relevant as the bandwidths are often close to or equal to unity. Finally, the gender of the first child, as well as father’s weight, appears to be important only in the estimation of m1; province is only relevant in the estimation of m0. The bandwidths in Table 3 reveal two primary differences. First, while mother’s weight remains most likely irrelevant, father’s weight may also be irrelevant also as each bandwidth is in excess of 4.653E+07. Second, there is some decrease in the bandwidths associated with age, particularly for m1 when using the SameSex2 instrument, and province, suggesting the relevance of these variables. There are only small changes for the remaining bandwidths. The bandwidths using the full sample (Table 4) indicate some different conclusions. Specifically, the bandwidths for many regressors now fall far
Fertility and the Health of Children: A Nonparametric Investigation
Table 4. Treatment
Bandwidths for the Nonparametric c-LATE Estimator (All Children).
Instrument
MoreThan2 SameSex2
MoreThan2 Daughter2
Three
Three
187
SameSex2
Daughter2
Variable
m0
m1
m0
m1
Upper Bound
Gender of 1st child Province Father’s education Mother’s education Age Father’s weight Mother’s weight Province Father’s education Mother’s education Age Father’s weight Mother’s weight Gender of 1st child Province Father’s education Mother’s education Age Father’s weight Mother’s weight Province Father’s education Mother’s education Age Father’s weight Mother’s weight
0.500
0.386
0.199
0.104
0.500
0.839 0.839
0.905 0.618
0.269 0.388
0.319 0.397
0.923 1.000
0.452
0.556
0.093
0.008
1.000
0.848 0.811 1.000 1.000 1.000 1.309 108 1.050 108 2.618 108 2.888 107 N 1.783 108 7.449 107 1.870 108 8.697 107
N
0.818 0.608
0.921 1.000
0.266 0.523
0.386 0.463
0.923 1.000
0.498
0.571
0.084
0.021
1.000
0.821 0.835 0.982 1.000 1.000 8.644 107 6.973 107 1.171 108 1.114 108 N 8.077 107 1.195 108 2.080 108 1.289 108
N
0.500
0.297
0.081
0.032
0.500
0.778 0.785
0.909 0.636
0.435 0.290
0.442 0.343
0.923 1.000
0.490
0.575
0.288
0.384
1.000
0.911 0.887 1.000 1.000 1.000 7.452 107 1.022 108 2.785 107 1.284 107 N 8.199 107 3.556 108 1.296 108 3.556 108
N
0.811 0.617
0.923 1.000
0.465 0.282
0.449 0.456
0.923 1.000
0.502
0.553
0.259
0.259
1.000
0.860 0.915 0.993 1.000 1.000 2.190 108 7.354 107 1.278 108 1.410 108 N 2.050 108 1.472 108 2.004 108 1.295 108
N
188
DANIEL J. HENDERSON ET AL.
Table 5. Nonparametric and Parametetric c-LATE Estimates with Standard Errors. Sample
Treatment
Instrument
1st born only MoreThan2
SameSex2
MoreThan2
Daughter2
MoreThan2
SameSex2
MoreThan2
Daughter2
MoreThan2
SameSex2
MoreThan2
Daughter2
Three
SameSex2
Three
Daughter2
1st and 2nd born only
All children
Wald Estimate
Parametric (TSLS)
LATE Standard error
c-LATE
Standard error
2.879 3.461 F=1.39; p=0.24 2.485 6.737 F=0.29; p=0.59 1.432 1.399 F=4.94; p=0.03 0.097 1.888 F=2.11; p=0.15 1.635 0.793 F=14.19; p=0.00 1.292 1.436 F=3.70; p=0.05 0.950 1.057 F=6.89; p=0.01 0.146 2.884 F=0.81; p=0.37
1.964 2.292 F=2.15; p=0.14 0.590 5.315 F=0.26; p=0.61 0.724 1.069 F=8.83; p=0.00 2.319 2.855 F=1.70; p=0.19 1.306 0.708 F=15.78; p=0.00 0.237 1.294 F=3.87; p=0.05 0.793 0.935 F=7.48; p=0.01 2.312 3.478 F=1.01; p=0.32
Nonparametric c-LATE
Standard error
1.166
1.267
3.944
19.123
0.561
0.570
0.042
0.785
1.489
0.434
1.448
0.734
0.634
0.491
0.361
0.918
Note: TSLS, two-stage least squares. F-test and corresponding p-value refer to the significance of the instrument in the first-stage regression. Standard errors for the Wald and parametric estimates are robust to arbitrary heteroskedasticity. Standard errors for the nonparametric estimates obtained via 199 bootstrap repetitions.
below their upper bounds. For example, the bandwidths for the gender of the first child indicate that it is relevant in all cases except for m0 using both instruments. Furthermore, age is relevant for m0 and m1 using both instruments and both treatments, and province is relevant in all cases except for m1 using both instruments and both treatments. However, some consistency remains with the previous subsamples. For instance, mother’s and father’s weight again appear to be irrelevant predictors of Y or D. The LATE estimates are presented in Table 5. For comparison, we present the unconditional Wald estimates, parametric TSLS estimates, as well as nonparametric estimates of the c-LATE. Utilizing the MoreThan2 treatment and the SameSex2 instrument, but excluding the control variables, we find a large and statistically significant adverse impact of fertility on weight-for-age when using the full sample (g=1.635, SE=0.793). This corresponds with Eq. (2). Moreover, we also find that the instrument is positive and statistically significant, as expected, in the first-stage regression (F=14.19, p=0.00). When we include control
Fertility and the Health of Children: A Nonparametric Investigation
189
variables, but impose the parametric functional form, we obtain a smaller and marginally statistically significant estimate (g=1.306, SE=0.708). Finally, the nonparametric c-LATE estimate, corresponding to Eq. (5), is comparable in magnitude, but highly statistically significant (g=1.489, SE=0.434). This represents an effect slightly more than one standard deviation. Thus, for the subpopulation of households whose fertility decisions are influenced by the SameSex2 instrument, we find strong evidence of the quantity–quality trade-off using the full sample when the treatment under consideration is MoreThan2. To assess whether the difference between the TSLS and nonparametric estimates are statistically different, one would ideally like to conduct a specification test. Given the newness of the estimator, such a test does not exist at present, but represents a fruitful avenue for future research. As a first pass, we attempted to perform a standard Durbin–Wu–Hausman test, given that the nonparametric estimate is always consistent, but inefficient if the parametric functional form is correct, and the TSLS estimate is only consistent if the functional form is correctly specified.7 However, the test 8 statistic is negative, as often happens in finite pffiffiffi samples. In light of this result, and the fact that both estimators are n consistent, we would advocate relying on the nonparametric estimate in this case (although, as stated previously, they do not differ much). Restricting the sample to either first born children or first and second born children, and maintaining the use of the MoreThan2 treatment and the SameSex2 instrument, yields statistically insignificant results in all models, consonant with Angrist et al. (2005). Moreover, there is evidence that the instrument is weak in both samples given Stock, Wright, and Yogo (2002) rule of thumb guide of an F-statistic above 10. However, this test for weakness of the instrument hinges on the validity of the parametric functional form; developing tests for weak instruments in the nonparametric case is also a fruitful avenue for future work. Lastly, for completeness we also attempted Durbin–Wu–Hausman tests for each sample; both test statistics are again negative. Thus, results on the presence of the quantity– quality trade-off are sensitive to the sample of children analyzed. We next turn to the results using the MoreThan2 treatment and the Daughter2 instrument. Utilizing the traditional Wald estimator (excluding the control variables), we find a large but statistically insignificant adverse impact of fertility on weight-for-age when using the full sample (g=1.292, SE=1.436). As with SameSex2, we find that the instrument is positive and statistically significant in the first-stage regression (F=3.70, p=0.05). However, the relationship between the Daughter2 instrument and the
190
DANIEL J. HENDERSON ET AL.
treatment is not as strong (in a statistical sense) as with the SameSex2 instrument, providing some further evidence against strong son preferences in Indonesia. When we include control variables, but impose the parametric functional form, we continue to obtain a statistically insignificant estimate (g=0.237, SE=1.294). Finally, the nonparametric c-LATE estimate is larger in magnitude and continues to be statistically significant (g=1.448, SE=0.734); the Durbin–Wu–Hausman test continues to yield a negative test statistic. Thus, when considering the MoreThan2 treatment and using the full sample, the c-LATEs across the subpopulations of compliers with the SameSex2 and Daughter2 instruments are nearly identical, yielding strong evidence of the quantity–quality trade-off. Restricting the sample to either first born children or first and second born children, and maintaining the use of the Daughter2 instrument, yields statistically insignificant results in all models, consonant with results using the SameSex2 instrument. In fact, the nonparametric point estimate is positive when using the sample of first and second born children. Moreover, there is also evidence Daughter2 is a weak instrument in both samples (conditional on the parametric functional form). Finally, the Durbin–Wu– Hausman test fails to reject the parametric TSLS model when using the first born only sample (w21 ¼ 0:03, p=0.14); the test statistic is negative in the sample of first and second born children. Thus, results on the presence of the quantity–quality trade-off are sensitive to the sample of children analyzed when using the SameSex2 instrument, as well as the Daughter2 instrument. The final set of results utilizes the sample of all children, but analyzes the Three treatment. Recall, we do not analyze the Three treatment using the subsample of only first born children or first and second born children as the samples do not differ much from those used to analyze the MoreThan2 treatment. Utilizing the traditional Wald estimator (excluding the control variables), we find statistically insignificant estimates using either instrument (SameSex2: g=0.950, SE=1.057; Daughter2: g=0.146, SE=2.884). In addition, while SameSex2 is highly statistically significant in the first-stage, Daughter2 is not. Again, this provides some evidence against strong son preferences in Indonesia. When we include control variables, but impose the parametric functional form, we continue to obtain statistically insignificant estimates (SameSex2: g=0.793, SE=0.935; Daughter2: g=2.312, SE=3.478), as well as a weak relationship between the treatment and the Daughter2 instrument. Finally, the nonparametric c-LATE estimates are also statistically insignificant in both cases (SameSex2: g=0.634, SE=0.491; Daughter2: g=0.361, SE=0.918); the Durbin–Wu–Hausman test yields
Fertility and the Health of Children: A Nonparametric Investigation
191
negative test statistics in both cases. In sum, then, the only statistically significant evidence of the quantity–quality trade-off arises when we utilize the nonparametric c-LATE estimator along with the health outcomes of higher parity children (those of birth order greater than three).
4. CONCLUSION In this chapter, we revisit the issue of a negative causal effect of sibship size on the ‘quality’ of children, as well as highlight the usefulness of combining Generalized Kernel Estimation with Fro¨lich’s (2007) nonparametric c-LATE estimator. The methodology allows for heterogeneous treatment effects as well as unordered discrete, ordered discrete, and continuous covariates in the model. The results indicate several conclusions. First, the nonparametric c-LATE estimator reveals a negative and statistically significant c-LATE of residing in a household with three or more children (relative to only two children) when analyzing all children under 10 years of age. Specifically, in the subpopulation of compliers with the SameSex2 or Daughter2 instrument, changing from a household with only two children to one with more than two children leads to a decrease in weight-for-age by an average of more than one standard deviation. However, when we analyze the subsample of first and second born children, or only first born children, or utilize the Three treatment, the estimates are statistically insignificant. Second, TSLS estimation, which imposes a specific functional form, yields a smaller, and only mildly statistically significant, point estimate using the full sample and the SameSex2 instrument. Moreover, the TSLS estimation indicates significant heterogeneity in the ATE across the subpopulations of compliers with the SameSex2 and Daughter2 instruments. Thus, the flexibility afforded by the nonparametric approach matters in the full sample. Furthermore, as with the nonparametric c-LATE estimator, the TSLS estimates are not statistically significant when analyzing the Three treatment using either instrument. Third, the relationships between the Daughter2 instrument and the treatments are weaker than between the MoreThan2 instrument and the treatments, suggesting that the first-stage strength of the latter instrument does not arise due to strong preference for sons in Indonesia. This finding, along with the findings of previous researchers, helps validate the use of this identification strategy in the Indonesian context as it is not likely that Indonesian households engage in behavior designed to alter the gender
192
DANIEL J. HENDERSON ET AL.
composition of children. Finally, as the conclusions in this application depend crucially on the sample chosen and the flexibility of the estimator employed, future work by applied researchers – in this subject area and others – ought to take note. In particular, given recent nonparametric methodological developments, researchers should be wary of imposing parametric functional forms unless there is a solid basis for such a specification.
NOTES 1. For an introduction to selection on observables estimators, see Heckman et al. (1999) and Lee (2005). pffiffiffi 2. However, achieving n-consistency actually requires one to undersmooth in order to reduce the bias of the nonparametric estimates (see Theorem 3, Condition iv C and the discussion that follows in Fro¨lich, 2007). Nevertheless, conventional methods, such as the AICc criteria, appear to perform well in finite samples (see Fro¨lich, 2004, 2005). 3. Throughout the text, we use the terms ‘household’ and ‘family’ interchangeably. In the data section, we explicitly define our empirical measures. 4. Cogill (2003, p. 11) notes that weight-for-age ‘‘reflects both past (chronic) and/or present (acute) undernutrition.’’ See also Thomas et al. (1991, 1996). 5. Checking household registrars in the 1993 and 1997 waves also helps minimize the misclassification of D and Z due to infant mortality. 6. Observations with missing data on parental education are excluded. In the samples using all children, this results in 370 children being dropped. Missing values for mother’s and father’s weight are replaced with sample means. Again, in the samples using all children, father’s (mother’s) weight is missing in roughly 13% (3%) of the observations. Gender of the second child is excluded in all cases, otherwise the common support condition (C5) would be invalidated. Gender of the first child is excluded when using the Daughter2 instrument for the same reason. 7. The validity of the Durbin–Wu–Hausman test rests on the consistency of the bootstrap estimate of the standard error of the nonparametric estimator. 8. Note, the usual approach to resolving the problem of a negative test statistic – utilization of a common s^ in the computation of the test statistic – is not pursued here since the variance of the nonparametric estimate is obtained via bootstrap.
ACKNOWLEDGMENTS The authors are grateful to two anonymous referees, Josh Angrist, Jeff Smith, and Markus Fro¨lich for helpful comments. The data and GAUSS code used in the chapter are available upon request.
Fertility and the Health of Children: A Nonparametric Investigation
193
REFERENCES Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113, 231–263. Aitchison, J., & Aitken, C. B. B. (1976). Multivariate binary discrimination by kernel method. Biometrika, 63, 413–420. Angrist, J., & Evans, W. (1998). Children and their parents’ labor supply: Evidence from exogenous variation in family size. American Economic Review, 88, 450–477. Angrist, J. D., Graddy, K., & Imbens, G. W. (2000). The interpretation of instrumental variables estimators in simultaneous equations models with an application to the demand for fish. Review of Economic Studies, 67, 499–527. Angrist, J. D., & Imbens, G. W. (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American Statistical Association, 90, 431–442. Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, 444–455. Angrist, J. D., Lavy, V., & Schlosser, A. (2005). New evidence on the causal link between the quantity and quality of children. NBER Working Paper no. 11835. Becker, G. S. (1960). An economic analysis of fertility. In: Demographic and economic change in developed countries. Universities-National Bureau Conference Series, no. 11. Princeton: Princeton University Press. Becker, G. S., & Lewis, H. G. (1973). On the interaction between the quantity and quality of children. Journal of Political Economy, 82, S279–S288. Becker, G. S., & Tomes, N. (1976). Child endowments and the quantity and quality of children. Journal of Political Economy, 84, S143–S162. Black, S. E., Devereux, P. J., & Salvanes, K. G. (2005). The more the merrier? The effect of family composition on children’s education. Quarterly Journal of Economics, 120, 669–700. Bloom, D. E., Canning, D., & Sevilla, J. (2001). The effect of health on economic growth: Theory and evidence. NBER Working Paper no. 8587. Butcher, K. F., & Case, A. (1994). The effect of sibling sex composition on women’s education and earnings. Quarterly Journal of Economics, 109, 531–563. Cogill, B. (2003). Anthropometric indicators measurement guide, food and nutrition technical assistance project. Washington, DC: Academy for Educational Development. Conley, D., & Glauber, R. (2005). Parental education investment and children’s academic risk: Estimates of the impact of sibship size and birth order from exogenous changes in fertility. NBER Working Paper 11302. Cruces, G., & Galiani, S. (2004). Fertility and female labor supply in Latin America: New causal evidence. Unpublished manuscript, London School of Economics. Fogel, R. (1994). Economic growth, population theory and physiology: The bearing of long-term processes on the making of economic policy. American Economic Review, 84, 369–395. Fro¨lich, M. (2004). Finite sample properties of propensity-score matching and weighting estimators. Review of Economics and Statistics, 86, 77–90. Fro¨lich, M. (2005). Matching estimators and optimal bandwidth choice. Statistics and Computing, 15, 197–215. Fro¨lich, M. (2007). Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139, 35–75.
194
DANIEL J. HENDERSON ET AL.
Glick, P., Marini, A., & Sahn, D. E. (2005). Estimating the consequences of changes in fertility on child health and education in Romania: An analysis using twins data. Food and Nutrition Policy Program Working Paper no. 183. Cornell University. Ha¨rdle, W. (1990). Applied nonparametric regression. Cambridge: Cambridge University Press. Heckman, J. (1997). Instrumental variables – A study of implicit behavioral assumptions used in making program evaluations. Journal of Human Resources, 32, 441–462. Heckman, J., LaLonde, R., & Smith, J. (1999). The economics and econometrics of active labour market programs. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics. New York: North Holland. Hurvich, C. M., Simonoff, J. S., & Tsai, C.-L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society, 60(Series B), 271–293. Imai, K., & van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association, 854–866. Imbens, G., & Angrist, J. (1994). Identification and estimation of local average treatment effects. Econometrica, 62, 467–475. Jones, M. C., Marron, J. S., & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of American Statistics Association, 91, 401–407. Kevane, M., & Levine, D. I. (2001). The changing status of daughters in Indonesia. Working Paper no. 74. University of California, Berkeley Institute of Industrial Relations. Lee, J. (2004). Sibling size and investment in children’s education: An Asian instrument. IZA Discussion Paper no. 1323. Lee, M.-J. (2005). Micro-econometrics for policy, program, and treatment effects. Oxford: Oxford University Press. Levine, D.I., & Ames, M. (2003). Gender bias and the Indonesian financial crisis: Were girls hit hardest? Working Paper no. C03-130. University of California, Berkeley Center for International and Development Economics Research (CIDER). Li, Q., & Racine, J. (2004). Cross-validated local linear nonparametric regression. Statistica Sinica, 14, 485–512. Lo´pez-Casasnovas, G., Rivera, B., & Currais, L. (2005). Health and economic growth: Findings and policy implications. Cambridge: MIT Press. Pagan, A., & Ullah, A. (1999). Nonparametric econometrics. Cambridge: Cambridge University Press. Pitt, M. M. (1997). Estimating the determinants of child health when fertility and mortality are selective. Journal of Human Resources, 32, 127–158. Qian, N. (2005). Quantity–quality: The positive effect of family size on school enrollment in China. Unpublished manuscript, Department of Economics, MIT. Racine, J., & Li, Q. (2004). Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics, 119, 99–130. Rosenbaum, P., & Rubin, D. (1983). The central role of the propensity score in observational studies on causal effects. Biometrika, 70, 41–55. Rosenzweig, M., & Wolpin, K. I. (1980). Testing the quantity–quality fertility model: The use of twins as a natural experiment. Econometrica, 48, 227–240. Smith, J. A., & Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125, 305–353.
Fertility and the Health of Children: A Nonparametric Investigation
195
Stock, J. H., Wright, J. H., & Yogo, M. (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics, 20, 518–529. Strauss, J., Beegle, K., Sikoki, B., Dwiyanto, A., Herawatt, Y., & Witoelar, F. (2004a). The third wave of the Indonesia family life survey: Overview and field report. Rand Labor and Population Working Paper Series. Strauss, J., Beegle, K., Sikoki, B., Dwiyanto, A., Herawatt, Y., & Witoelar, F. (2004b). User’s guide for the Indonesia family life survey, wave 3. Rand Labor and Population Working Paper Series. Strauss, J., & Thomas, D. (1998). Health, nutrition and economic development. Journal of Economic Literature, 36, 766–817. Thomas, D., & Frankenberg, E. (2002). Health, nutrition and prosperity: A microeconomic perspective. Bulletin of the World Health Organization, 80, 106–113. Thomas, D., Lavy, V., & Strauss, J. (1996). Public policy and anthropometric outcomes in the Co˘te d’Ivoire. Journal of Public Economics, 61, 155–192. Thomas, D., Strauss, J., & Henriques, M. (1990). Child survival, height for age and household characteristics in Brazil. Journal of Development Economics, 33, 197–234. Thomas, D., Strauss, J., & Henriques, M. (1991). How does mother’s education affect child height? Journal of Human Resources, 26, 183–211. Vidmar, S., Carlin, J., Hesketh, K., & Cole, T. (2004). Standardizing anthropometric measures in children and adolescents with new functions for egen. Stata Journal, 4, 50–55. Wang, M. C., & van Ryzin, J. (1981). A class of smooth estimators for discrete estimation. Biometrika, 68, 301–309. Wolpin, K. I. (1997). Determinants and consequences of the mortality and health of infants and children. In: M. R. Rosenzweig & O. Stark (Eds), Handbook of population and family economics (Vol. 1A). Amsterdam: Elsevier Science B.V. Yau, L., & Little, R. (2001). Inference for the complier-average causal effect from longitudinal data subject to noncompliance and missing data, with application to a job training assessment for the unemployed. Journal of the American Statistical Association, 96, 1232–1244.
PROGRAM PARTICIPATION, LABOR FORCE DYNAMICS, AND ACCEPTED WAGE RATES Jakob Roland Munch and Lars Skipper ABSTRACT We apply a recently suggested econometric approach to measure the effects of active labor market programs on employment, unemployment, and wage histories among participants. We find that participation in most of these training programs produces an initial locking-in effect and for some even a lower transition rate from unemployment to employment upon completion. Most programs, therefore, increase the expected duration of unemployment spells. However, we find that the training undertaken while unemployed successfully increases the expected duration of subsequent spells of employment for many subpopulations. These longer spells of employment come at a cost of lower accepted hourly wage rates.
1. INTRODUCTION This chapter is concerned with uncovering the effects of publicly subsidized training programs for the unemployed using observational data. Recent
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 197–262 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00008-4
197
198
JAKOB ROLAND MUNCH AND LARS SKIPPER
research within this area has documented the pivotal importance of aligning labor force dynamics of participants and potential controls in the period leading up to the participation decision for credibly uncovering causal relationships using such kind of data (see Heckman, Ichimura, Smith, & Todd, 1998; Heckman & Smith, 1999). In this chapter, we will use the identification strategy that has been dubbed as timing of events (see Abbring & van den Berg, 2003): The idea here is to directly align the hazards out of unemployment among treated and controls, and then use (conditional) randomness at the moments at which training is initiated over these spells of unemployment to uncover the causal effects that training has on outcomes. We extend the econometric framework of Abbring and van den Berg (2003) in dimensions appropriately suited for the kind of programs we evaluate. We consider generic training programs and decompose and evaluate the effects of participation in multiple dimensions, all highly relevant and easily interpretable within an economic model of optimal job search behavior. Specifically, we estimate jointly the effect that participation in training has on transition out of unemployment while the training takes place,1 the effect on the transition rate out of unemployment after completion, the effect it has on the length of subsequent spells of employment, and the impact that training will have on accepted wage rates. That is, we will use an extended multivariate duration model in which the inflow into different kinds of training programs, the outflow from unemployment, the accepted hourly wage rate, and the outflow from subsequent employment are specified and allowed to interact jointly. As mentioned above, the idea underlying identification in our model is to use the randomness at which the spells of training are initiated, and combine this with both pre- and postprogram durations. With this information, and the assumption that the different transition processes can be modeled jointly as mixed proportional hazards, the model is identified without the need of unpalatable exclusion restrictions. Intuitively, information on the correlations between unobserved heterogeneity components in the different labor market states and the earning potentials among the agents can be obtained from the durations of these states and the observed wage rates. Because we model the unobservables explicitly, this method will give an estimate of the treatment effect taking into account that both observables and unobservables may determine the processes. It is already important at this point to explicitly notice that even with access to experimental data, where the selection into training is exogenously manipulated, we would still be forced to rely on these nonexperimental approaches in order to estimate the interdependent effects just outlined (see Ham & LaLonde, 1996; Card & Hyslop, 2005).
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
199
We focus on the training programs introduced on a large scale to the unemployed in the Danish labor market in 1994, where improving the job prospects of the unemployed was considered as the main aim. In fact, the programs were introduced on such a large scale that today Denmark (or Sweden, depending on how it is measured) is the country in Europe that spends the most on active labor market policy as a share of GDP. In the light of this, Kluve and Schmidt (2002) in a recent review of active labor market programs (ALMPs) in Europe, highlight Denmark as the prime example among European countries performing the transition from a benefit system of passive measures to one of active measures. These authors also conclude that many European ALMPs have been introduced without any prior knowledge about their effects, and they make a call for independent evaluations to play a more important role in the implementation of the programs. The purpose of this chapter will, therefore, also be to contribute with a thorough evaluation of the Danish large-scale system of ALMPs.2 Specifically, using a 10 percent random draw of the Danish population with detailed information on individuals’ labor market states collected on a weekly basis in administrative registers, we evaluate and determine the effects of the different programs as compared to an outcome in which they had continued in ‘open’ unemployment without intervention. In this chapter, we test whether the unemployed were helped in getting back to work by these new ALMPs, whether the programs helped the participants in keeping their jobs, and whether participants were able to earn higher wages once they got back in the labor market; that is, we evaluate the effects of the programs in terms of expected unemployment duration, expected employment duration, and expected hourly wage rate. The applied model assumes the existence of a common treatment effect. In the light of the recent studies on the heterogeneity of effects of training programs, this assumption is clearly not innocuous. We, therefore, estimate our full-blown model on stratified subsamples of our data to investigate the robustness of the results with respect to heterogeneity in outcomes. Specifically, we investigate impacts of training across gender, four different levels of education, and for five different age cohorts. The proportional hazard model and the recent timing-of-events method have been used in a number of studies before in connection with the evaluation of government-sponsored training programs.3 The first such study that we are aware of is Gritz (1993) who models training, nonemployment, and employment as three distinct states using observational data from NLSY. Here it is found that private training programs increased the duration of subsequent spells of employment among women
200
JAKOB ROLAND MUNCH AND LARS SKIPPER
and as such is considered successful whereas the result for men was more ambiguous in that both the duration of subsequent employment and nonemployment spells increased. The result from the public training programs is negative for both men and women, as the duration of the employment spells decreased after participation. Ham and LaLonde (1996) use experimental data on women from the NSW Demonstration combined with the hazard modeling approach and find that this program, as for the similar private training programs for women in Gritz (1993), worked through an increase in the duration of subsequent employment spells. Using the same setting as in Ham and LaLonde (1996) and Eberwein, Ham, and LaLonde (1997) evaluate classroom training (CT) participation for women in the JTPA study and find that this type of training works through an increase in the transition rate out of unemployment and that the training has no effect on subsequent spells of employment. Bonnal, Fouge`re, and Se´randon (1997) evaluate the effect of French training programs among young and unskilled male workers, and find that private job training work through an increase in the intensity at which participants leave subsequent spells of unemployment, whereas the opposite is the case for the effects of public job training. Common to these studies is that they all rely on retrospective information on labor force states collected from surveys and, therefore, rely on people’s notoriously poor ability and willingness to recollect and report the exact timing of incidences, information obviously crucial in these kinds of studies. Moreover, both Gritz (1993) and Bonnal et al. (1997) aggregate selection across training programs such that a subsequent inference of differences in the workings of the training programs becomes confounded with the differences in the selection processes governing them. In Richardson and van den Berg (2001), Swedish vocational CT programs are evaluated for the entire population of participants using the timing-ofevents method on the transition out of unemployment. Here, as will be the case below, training is modeled as a subspell of the ongoing spell of unemployment, and not as a separate state as in the above-mentioned studies. Not surprisingly, given the nature of CT programs, participation results in lowering of the transition rate out of unemployment while ongoing, the so-called locking-in effect, whereas the subsequent or what could be termed as postprogram effect on the transition rate is positive, such that the resulting overall net effect on the individual’s unemployment duration is about zero. The present chapter has most in common with Lalive, van Ours, and Zweimu¨ller (2002) and Bolvig, Jensen, and Rosholm (2003) in that they evaluate entire systems of ALMPs. Lalive et al. (2002) study the effects of training programs, employment programs, and wage
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
201
subsidy programs in Switzerland and, with the exception of temporary wage subsidies to foreign males, find no positive effects on the job finding rate. Compared to the Danish ALMPs, the Swiss system is of a much smaller extent. Bolvig et al. (2003) evaluate active social policy in a large Danish municipality. The programs at study are employment and training measures offered to unemployed noninsured workers, nonworkers, disabled, and persons with other social problems. The employment measures have a positive net effect on the job finding rate while there is a negative effect of training measures. The results found in this chapter highlight the relevance of considering the effects of training participation in multiple dimensions: effects of training on subsequent employment duration and wages should also be taken into account, since these effects for some subgroups are more important and unambiguous than the effects of training on unemployment durations. Concerning the latter impact, we typically find evidence of locking-in effects, that is, while participation takes place, the cost of search obviously increases and the participants have a lower transition rate into regular employment. This effect is sometimes counteracted by positive postprogram effects, but overall the programs do not have any quantitatively important effect on expected unemployment durations, increasing these by weeks or perhaps a month only. An important exception is a category of residual programs – essentially a mixture of training and educational programs directed toward a weaker group of workers – that have prolonging effects on unemployment duration ranging up to half a year depending on subpopulation. Turning to the effects of participation on the expected duration of subsequent spells of employment, there is a clear prolonging impact from participation in private on-the-job training (OJT), while the opposite is true for participants in public OJT. In this respect, the results from our study are in full accordance with the results found in Gritz (1993) and Ham and LaLonde (1996) outlined above. We find that those with low levels of formal skills, the group of youths, and those with only low levels of initial education gain from participation in ordinary CT in terms of increases in expected employment duration. The same is also true for the group with potential obsolete formal skills, i.e., the subgroup of people above 50; there appears to be no impact on employment duration from participation for the remaining subpopulations at all, rather depressing results in this respect, but fully in line with what was found in Eberwein et al. (1997). Turning to the residual group of training programs, which were found to have a negative effect in terms of unemployment duration, we find that the impact from participation on the employment duration is often sufficiently positive to render the overall attachment to the labor market positive.
202
JAKOB ROLAND MUNCH AND LARS SKIPPER
Finally, the effect of training participation on the subsequent hourly wage rate is relatively unambiguous; typically, wages are reduced by 3–7 percent. It is important to note here, though, that we do not consider the effect of participation on overall labor earnings. The estimated negative effect on the hourly wage rate might, to a large extent, be offset by the higher attachment to the labor market in terms of employability. For recent studies where earnings are considered, see Gerfin, Lechner, and Steiger (2005) and Lechner, Miquel, and Wunsch (2004).4 In sum, the results indicate that training, not surprisingly, takes time and tends to prolong the duration of unemployment and that participants in private OJT, ordinary CT, and the residual type of training subsequently have longer spells of employment as a consequence of participation, but attain this at the cost of a lower hourly wage rate. For public OJT, the effects are found to be negative in all dimensions under study, and as such the working of this part of the Danish active labor market system is no different from that of most other countries. One possible explanation as to why people continue to participate in this specific part of the ALMPs would be that in a system such as the Danish with a mandatory or workfare aspect, this type of program perhaps has the lowest effort cost among the choices available to the unemployed and, as such, participation in public OJT sends a negative signal to future potential employers that would otherwise have been unobserved. For a discussion of issues of signaling effects of programs, see Gerfin et al. (2005). To sum up, in the chapter we evaluate the effects of ALMPs offered to unemployed members of UI funds in Denmark between 1995 and 2000, using highly accurate and detailed information on people’s labor force dynamics from administrative registers using a flow sample. The chapter contributes to the international literature in both specific and general ways. First, it provides a wide range of new results on the effects of ALMPs in Denmark and as such sheds important light on the effects of the programs from a country implementing these policies on a large scale. Second, it implements and extends new methodology to program evaluation. We argue that the extensions we make will be of significant importance when performing studies of this kind and that incorporation of our suggestions might very well change the conclusions when assessing the successfulness of these policies. The rest of the chapter is organized as follows: Section 2 contains our data description along with information about the institutional environment. In Section 3, the econometric model is specified and assumptions needed for identification are stated and discussed. Results are presented in Section 4, and Section 5 concludes the chapter.
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
203
2. INSTITUTIONAL SETTING AND DATA DESCRIPTION The 1994 reform of the Danish unemployment insurance system had several elements. First, the possibility for unemployed to renew eligibility for benefit periods by participating in active labor market measures was abolished and, second, the maximum time for receiving benefits was gradually shortened from nine to four years. Moreover, the active labor market measures were strengthened such that, in principle, the benefit entitlements were made conditional on participation in training programs after an initial period in ‘open’ unemployment. This time until participation in ALMPs has been advanced gradually since, such that after 1999 the unemployed are, in principle, obliged to participate after one year of unemployment. Once this period of unconditional benefits has expired, the unemployed must participate in ALMPs during 75 percent of further time spent in unemployment.5 Finally, the reform introduced individual training plans with the purpose of targeting the training effort towards the needs of the unemployed and the local labor markets. A host of different programs are available to the unemployed with the three most important ones being CT, public OJT, and private OJT (see Table 1). Participants in (most often vocational) CT will get their UI benefits while participating,6 whereas participants in OJT will receive the centrally negotiated minimum wage while participating. Participation will, therefore, increase earnings with up to 25 percent in OJT compared to staying on UI. Moreover, the firm taking in the unemployed for job training will get refunding equivalent to the maximum UI-benefit level, as well as Table 1. Distribution of Programs: The Treated UI Recipients Distribution Across Programs, 1995–2000 (Percentage Points).
Private OJT Public OJT Ordinary CT Individual job training Specially designed education Specially designed programs Employment programs Other programs
1995
1996
1997
1998
1999
2000
14 31 30 8 8 2 0 7
10 19 46 4 6 4 7 4
9 19 44 8 5 3 8 4
8 15 55 7 4 3 7 1
7 16 59 5 4 3 5 1
6 15 65 6 2 3 1 2
Note: The group considered is unemployed with a UI in the age between 19 and 66, and only the first program for each person in each year is used.
204
JAKOB ROLAND MUNCH AND LARS SKIPPER
subsidies to mentors and potential equipment needed for the training. Participation in OJT is meant to result in an upgrade of the professional and technical skill base and facilitate a general rehabilitation to the labor market. The remaining programs are targeted toward weaker groups of unemployed ‘who are having difficulties finding jobs or job training under regular circumstances with respect to wages and working conditions’ and will generally entail a stronger component of basic education. These will be pooled into one residual program for apparent computational reasons in the econometric analysis below. The proportion of the unemployed participating in the programs has more than doubled since the first reform in 1994. This is partly due to the aforementioned strengthening of active measures, and partly due to the fact that the reforms also entailed forward shifts in the active period such that more people are affected by the requirements of the activation. In the period 1995–1999, the number of yearly full-time persons participating in some ALMP rose from nearly 45,000 to almost 55,000. After this the number declined to around 42,000 in 2001. In the same period, the number of unemployed persons declined steadily from 288,000 to 145,000. It is worth noticing that an extensive use of leave schemes was introduced in this period, withdrawing a lot of people from the unemployment statistics. Most important for the interpretation of our results below was the possibility for long-termed unemployed (defined as those having been unemployed for at least 12 months out of the previous 15 months) 50–59-year-olds to withdraw permanently from the labor market in 1994 and 1995. This temporarily introduced scheme was, in fact, so lucrative that more than 50,000 unemployed in the relevant age group took advantage of the program and retired early. Comparing this figure with our 10 percent random sample of above 50-year-old unemployed workers in subsequent years (1995–2000) being that of little less than 20,000, see below, we conclude that the program was indeed very popular; see Bingley, Datta Gupta, and Pedersen (2004) for further discussions of this retirement program. There has also been a shift in the composition of how the different types of training programs have been used (see Table 1). In 1995, 30 percent of all the participants enrolled in ordinary CT, while this percentage has risen to 65 in 2000. At the same time, the proportion of those participating in private OJT was more than halved from 14 to 6 percent, while the share of participants in public OJT fell from 31 to 15 percent. The duration of private job training programs is on average shorter than those in the public sector. Table 2 shows that among the programs initiated in 1996, private OJT on average had a duration of 22 weeks while public
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
205
OJT lasted 39 weeks. Ordinary CT lasted on average 28 weeks, while the entrepreneurship subsidy and employment programs had a considerably longer duration. To illustrate the variation in the time until participation, Fig. 1 plots the (unconditional) Kaplan–Meier hazard rates into each of the four program types. It is evident that the unemployed are selected into the different programs from very early on in their unemployment spells despite the fact that participation only becomes compulsory much later on (no earlier than one year during this period, see above).7 However, the unemployed are also required by law to be available for potential work, and the initiation of some of the programs in the early months of the unemployment spells might be
Table 2.
Duration of Programs: Average Duration of Programs Initiated in 1996 (Weeks).
Private OJT Public OJT Ordinary CT Residual programs
22 39 28 56
Note: The group considered is unemployed with a UI in the age between 19 and 66, and only the first program for each person is used.
0.9
0.020
0.8 0.7
0.015
0.5
0.010
0.4
Survivor
Hazard
0.6
0.3 0.005
0.2 0.1 0
0.000 4
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104
Duration of Unemployment (weeks) Private OJT
Public OJT
Residual
Survivor
Ordinary CT
Fig. 1. Kaplan–Meier Estimates of Weekly Hazards into Training from Unemployment (left) and (open) Unemployment Survivor Function (right) Based on 170,513 Spells of Unemployment.
206
JAKOB ROLAND MUNCH AND LARS SKIPPER
the result of caseworkers placing the unemployed in programs as tests of the willingness to work. A similar picture is found in other countries with compulsory participation components; see, e.g., Gerfin et al. (2005) for the case of Switzerland. Moreover, whereas the two OJT programs and the residual category of training programs (with hazard rate indistinguishable from public OJT) exhibit a flat profile over the first two years of unemployment, the hazard into CT shows an increasing trend, although estimated on an ever-decreasing pool of potential participants (right scale). This latter picture should be kept in mind when assessing the results of this program below, i.e., CT might very well work as an instrument among caseworkers for meeting the placement requirements among the otherwise difficult-to-place clients with potential suboptimal outcomes as a result. But, again, with weekly hazards below 2 percent from less than 5 percent of the initial pool of entrants into unemployment (cf. the estimated survivor function), the number of relevant persons affected by the estimated impacts is negligible. It is evident from Fig. 1 that the largest chunk of entry into programs takes place in the first 30 weeks. This is seen by simply multiplying the empirical hazards with the fraction of survivors. With the identification strategy we pursue below – a common proportional effect across the spell – these early entrants into the programs will weigh the most. For issues on differences in the effects with respect to initiation of programs over the duration of unemployment, see Gerfin et al. (2005). Unemployed participating in the four training programs differ with respect to observable background characteristics as is evident from Table 3. Here it is seen that the unemployed in public OJT are on average slightly older than the remaining three groups; around two-thirds are women and have relatively short education. As opposed to this, participants in private OJT are more often males, are to a larger extent skilled, and have a slightly lower benefit replacement rate. The group participating in CT is different in that they are relatively well educated and are more often women. The group participating in the residual type of training programs is characterized by having a low fraction of married participants, and the majority has less than five years of labor market experience. Thus, there are differences in the personal characteristics across the program participants, although these are not particularly pronounced. The data set we use above and in the econometric analysis below is a 10 percent random sample of entrants into unemployment in the years 1995–2000.8 The data is a longitudinal data set with detailed information of the individual’s labor market states along with information on individual socioeconomic characteristics. The socioeconomic variables are extracted
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
207
Table 3. Descriptive Statistics for Participants in the Different Programs. Variables
Age (years) Female Married Children below age of 6 Non-OECD citizenship Experience, 0–4 years Experience, 5–9 years Experience, 10+ years Elementary education Vocational education High school education College and beyond Member of union UI replacement rate
Private OJT
Public OJT
Ordinary CT
Residual Programsa
36 0.47 0.42 0.26 0.01 0.39 0.32 0.29 0.40 0.44 0.05 0.11 0.78 0.81
39 0.66 0.49 0.29 0.03 0.47 0.31 0.23 0.50 0.35 0.05 0.10 0.84 0.85
38 0.62 0.49 0.29 0.04 0.43 0.29 0.27 0.41 0.36 0.08 0.15 0.82 0.83
35 0.52 0.39 0.29 0.04 0.57 0.28 0.18 0.49 0.32 0.08 0.12 0.74 0.82
Note: The group considered is unemployed members of UI funds in the age between 19 and 66, and only the first program in the period 1995–2000 for each person is accounted for. a The group of residual programs consists of all programs apart from private OJT, public OJT, and ordinary CT (see Table 1).
from the integrated database for labor market research (IDA) and the income registers in Statistics Denmark. For individuals in our sample, event histories are created such that we are able to identify every person’s labor market state in any week during the years. That is, we know whether the individuals are employed, unemployed, participating in ALMPs, or out of the labor force. The hourly wage rate is calculated from annual labor earnings and number of working hours. The measure of working hours used in this calculation is very precise in that this information comes from registers on compulsory contributions to supplemental pension payments that are closely linked to the working hours actually paid for by employers. Our sample consists of all UI fund members between 19 and 66 years of age in the period 1995–2000. Individuals having participated in any program prior to 1995 are excluded from our evaluation and only the treatment effect of the first program in this period is evaluated. Observations with more than one spell of training participation are censored at the time of entry into the second program spell. In the sample, there are 102,411 individuals who share among them nearly 470,000 employment, training, and unemployment spells over the period. Of these spells, 17,978 are ALMP spells and 269,777 are employment spells.
208
JAKOB ROLAND MUNCH AND LARS SKIPPER
The demographic characteristics that we condition on are age-group dummies, gender, marital status, dummies for the presence of children, citizenship, and city size. These variables are found in the literature to be of great importance in determining not only employability, but also the probability of taking training. For example, women with family responsibilities because of dependent children are supposedly less likely to engage in training as the perceived opportunity costs are higher. Attained education is captured by dummies for basic schooling, high school, and further education with vocational education as reference. Again, these variables are also of great importance in determining both the duration of unemployment and employment, as well as predicting training participation and expected wage rates. Labor market experience since 1964 is included, and we also include the rate with which UI benefits replace the latest observed wage rate. This rate has a relatively high ceiling of 90 percent. A dummy for membership of a trade union is included, and the type of previous industry is also controlled for by inclusion of nine industry-specific UI fund membership dummies. We capture business-cycle effects by including dummies for the year in which the spells are initiated. Finally, we include an indicator for the remaining weeks of UI benefits that the unemployed had at the beginning of the unemployment spell. This variable is defined as the difference between the maximum number of weeks with benefits minus the individuals’ UI seniority at the beginning of the spell. The UI seniority is the number of weeks the unemployed were previously unemployed and received UI benefits. In 1995, UI seniority was reset whenever the individual had been employed for 26 weeks, but this requirement was strengthened to 52 weeks by January 1997. The maximum number of weeks with benefits was gradually shortened from nine to four years during the 1990s (see discussion above), and this means that the number of weeks remaining with benefits is reduced each time the maximum time limit is shortened.
3. ECONOMETRIC MODEL This section gives a description of the econometric method we use for identification of treatment effects and how the self-selection into ALMPs is controlled for. The main problem is how to calculate the effect of treatment of those treated compared to a state, where they were not treated: the problem of construction counterfactuals. It is not straightforward to create a suitable control group in the Danish labor market, as all unemployed, in principle, have to participate in a training program at some point in time,
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
209
should they stay unemployed long enough. Thus, late in the unemployment spells there are few or no nonparticipants to compare with. We solve this problem of constructing counterfactuals by the use of the already mentioned timing-of-events methodology of Abbring and van den Berg (2003). The method exploits the variation in the starting dates of the different types of programs over the unemployment spells (cf., Fig. 1). Some unemployed are participating early in their spells, and therefore, unemployed individuals not yet participating in ALMPs can be used as comparison group over this time interval. If the effect of participation is assumed constant irrespective of when it is initiated over the unemployment spell, then the relevant counterfactual for the unemployed participating in programs at later stages in their unemployment spell is deducted. In this way, a hazard rate for a hypothetical nonparticipant is derived, and so the effects of program participation can be calculated. In what follows, a duration model is specified, where effects of personal characteristics on the exit rate out of a specific labor market state are obtained. The labor market states under consideration are unemployment, employment, and duration until program participation. Following Abbring and van den Berg (2003), participation in a program is seen as a part of the unemployment spell, and is not considered a separate state. That is, the unemployed are observed in the states of unemployment and pretraining participation at the same time, and the exit rates for these states are modeled and estimated simultaneously. Time until participation is modeled as competing risks hazards, where the different destinations (‘risks’) are the four different types of programs: private OJT, public OJT, CT, and the residual category. So compared to Gritz (1993) or Bonnal et al. (1997), who aggregate selection across different states, we explicitly allow for selection into treatments to vary with the treatment program. This is accomplished by letting the selection parameters vary freely between our different programs and by allowing for unobservables affecting these selection mechanisms to be correlated across different types of training. Furthermore, to investigate the effects of ALMP participation on the duration of subsequent employment and earnings potentials, the exit rate out of this state and hourly wage rate is also estimated. The model explicitly takes into account the self-selection into the different ALMPs. The selection that occurs based on observed characteristics is accounted for by using these variables as explanatory factors affecting the competing risks hazard rates into the programs proportionally. On top of this, selection based on unobservables is, under strict assumptions, also possible, as no measures for, e.g., ability or motivation are available, and
210
JAKOB ROLAND MUNCH AND LARS SKIPPER
unobserved administrative selection of participants into programs may also take place. The econometric model accommodates for unobservables in the selection process and in the outcome process as outlined below. Let the continuously distributed random variable T denote the duration of a given labor market state. The hazard rate, which is the probability that individuals with given observed and unobserved characteristics, x and n, exit a given state in the period t+dt conditional on being in the state until time t, is then given by PrðtoT t þ dtjT4t; x; nÞ dt!0 dt
yðtjx; nÞ ¼ lim
(1)
The hazard functions in this chapter are specified as mixed proportional hazards, i.e., the functions are products of baseline hazards and functions of observed characteristics, x, and unobserved characteristics, n, yðtjx; nÞ ¼ lðtÞ jðx; nÞ
(2)
where l(t) is the baseline hazard and j(x,n) is the systematic part defined as exp(xb+n). The baseline hazard is specified as a piecewise constant function, i.e., l(t)=exp(am), m=1, y, M, where M is the number of baseline segments to be estimated. Three important assumptions are imbedded more or less explicit in the specification of Eq. (2). Firstly, as we estimate the baseline, l(t), as a piecewise constant hazard, we are in effect letting the data guide us in how the hazard behaves over time. But it is also important to notice that no immediate behavioral interpretation can be given to these estimated coefficients; see, e.g., Lancaster (1990) or the recent survey of duration models in van den Berg (2001). Secondly, we assume that the effects of explanatory variables are proportional to the baseline hazards and, hence, do not vary across the duration of the states. As noted in both Lancaster (1990) and van den Berg (2001), this functional form restriction has little or no economic-theoretical justification, but is nevertheless almost always invoked in empirical duration analysis. Finally, and perhaps most importantly, the inclusion of unobserved heterogeneity, most often in these types of models interpreted as unobserved cognitive ability, motivation, or selfdiscipline, has (besides being one-dimensional or scalar) to be stochastically independent of the included observed characteristics, x at the time of the inflow into the relevant spell. In practice, this means that if the interpretation of n as, say, ability is entertained, then the distribution of ability has to be identical among observed high achievers, i.e., highly educated, high earners with lots of labor market experience, as among low achievers,
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
211
i.e., poorly educated, low earners with marginal attachment to the labor market. Moreover, this assumption is in sharp contrast to the program evaluation literature of matching. Here it is typically assumed that the correlation between unobserved components such as motivation or jobreadiness, and observed covariates such as labor market history and attained education, is high enough to render program participation conditional independent of outcomes; see Lechner (1999, 2000, 2002a, 2002b), Gerfin and Lechner (2002), or Gerfin et al. (2005) for this line of argumentation. Considering first the hazard rate of the transitions from unemployment to employment,9 we use, as part of the observed characteristics, time-varying indicator variables for whether the unemployed is participating in training and whether one such training spell is completed. a(tu) is a 4 1 dummy vector that takes on the value 1 if the person participates in a given type of training at time t, while c(tu) is a 4 1 dummy vector indicating whether the unemployed has completed one of the four different programs prior to time t, i.e., the hazard rate for an unemployment spell (u) can be written as yu ðtu jxu ; aðtu Þ; cðtu Þ; nu Þ ¼ lu ðtu Þ ju ðxu ; aðtu Þ; cðtu Þ; nu Þ
(3)
This specification lets us distinguish between two different effects of training participation on the duration of unemployment. Inclusion of the dummy vector a(tu) allows us to capture that the unemployed potentially alter (reduce) their search effort while participating in training. This effect is termed as the locking-in effect, should it be negative. However, nothing ex ante would restrict us to allow these effects to be only negative. The vector c captures the posttraining effect, which is positive if training enhances skills and, thus, employability of the participants. Again there might be countervailing effects if, for instance, participants narrow their search to jobs where skills acquired during participation are demanded. An illustration of potential effects is shown in Fig. 2. Here we have a potential downward sloping unemployment hazard. At time t1, the unemployed enroll in labor market training causing a downward shift in the hazard. This may come from lower search intensity while participating as mentioned above. Once the program ends, at time t2, there is an upward shift in the hazard in Fig. 2, this shift is of a magnitude such that the rate out of unemployment is now higher than it would be had the unemployed not participated in the training program, i.e., there is a positive posttraining effect on the unemployment hazard. The time until participation in training (a) is specified as a competing risks hazard model, i.e., there are four treatment-specific hazard rates for the four
212
Fig. 2.
JAKOB ROLAND MUNCH AND LARS SKIPPER
Potential Effects of Training Participation on Unemployment Hazard.
different kinds of training: yam ðtam jxam ; nam Þ ¼ lam ðtam Þ jam ðxam ; nam Þ;
m ¼ 1; 2; 3; 4
(4)
which altogether gives the hazard rate ya ðta jxa ; na Þ ¼
4 X
yai ðtai jxai ; nai Þ
(5)
i¼1
where na=(na1,na2,na3,na4)u. The duration of the employment spells (e) are specified in much the same way as the unemployment duration: ye ðte jxe ; c; ne Þ ¼ le ðte Þ je ðxe ; c; ne Þ
(6)
where we notice that c in Eq. (6) is not time varying over the employment spells, in that it just indicates what kind of training (if any) has the individual participated in previously. As already mentioned, a possible outcome of participation in training programs could be a prolonged duration of unemployment. This result need not be suboptimal from the viewpoint of the participating individual. In standard models of job search (see, e.g., Mortensen, 1977), the unemployed may find it optimal to increase their reservation wage if individual productivity increases. That is, in our context, training participants may find it optimal to increase their reservation wage in order to reap the rewards of a potentially augmented level of human capital. This may not
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
213
only have a prolonging effect on unemployment duration, but also a higher wage rate when a job is accepted. To investigate this possibility, we specify a standard wage equation, where the returns on labor depend on education and the amount of job training the individual may have: logðwÞ ¼ nw þ xw bw þ c þ u
(7)
Now, xw is composed of variables potentially affecting productivity and hence wages. We augment the standard wage equation to include effects of the training programs, c. We let u be i.i.d. N(0,su) and the individual-specific unobserved wage effects, nw, be correlated with the unobservables from the duration contributions, nu, ne, and nam, m=1, y, 4. That is, we have assumed the existence of unobserved (random effects) components as in the other equations with two points of support in addition to the normally distributed error terms. With the unobserved component entering as a random effect, we again need to invoke the assumption of independence between observed variables, xw, and the unobserved component, nw. Moreover, the effects of the different programs are assumed to enter the wage equation additively and, hence, as homogenous effects. The parameters of the model are estimated by maximum likelihood, and the contribution to the likelihood function from an unemployment spell is Lu ðtu jxu ; aðtu Þ; cðtu Þ; nu Þ ¼ yu ðtu jxu ; aðtu Þ; cðtu Þ; nu Þd u Z tu exp yu ðsu jxu ; aðsu Þ; cðsu Þ; nu Þdsu
ð8Þ
0
where du takes the value 1 if the observation has ended with a transition to employment, and the value 0 if the observation is censored. The contribution from transitions into training to the likelihood becomes Z tam 4 Y La ðta jxa ; na Þ ¼ yam ðtam jxam ; nam Þam ðtam Þ exp ya ðsam jxa ; na Þdsam 0
m¼1
(9) where am(tam) takes the value 1 if the individual selects into training activity m at t, and 0 otherwise. Employment spells contribute with Z te Le ðte jxe ; c; ne Þ ¼ ye ðte jxe ; c; ne Þd e exp ye ðse jxe ; c; ne Þdse (10) 0
where de equals 1 if the individual is observed to return to unemployment, and 0 otherwise.
214
JAKOB ROLAND MUNCH AND LARS SKIPPER
We only observe w for those actually finding a job, du=1; hence, the likelihood contribution from observed hourly wage rates is Lw ¼ f ðlog w nw xbw cÞd u
(11)
The unobserved heterogeneity terms are specified by the stochastic variables Vu, Va, Ve, and Vw, where Va consists of the four variables, Va1, Va2, Va3, and Va4. Hence, the complete likelihood function is Z Z Z Z L¼ Lu ðtu jxu ; nu ÞLa ðta jxa ; na ÞLe ðte jxe ; ne Þ Vu
Va
Ve
Vw
Lw ðjxw ; nw ÞdGðnu ; na ; ne ; nw Þ
ð12Þ
where G is the simultaneous CDF for the unobserved heterogeneity. The models are estimated under the assumption that the terms coming from the unobserved heterogeneity, Vj, j=u, a, e, w, follow a two-point distribution normalized such that Vj can take the values 0 and vj only. This means that 128 (=27) types may exist each with a corresponding probability. This assumption is discussed further in Section 4. An appropriate way to illustrate the effects of participation is to calculate the expected duration of a given baseline state, E [Y 0], by integrating both observables and unobservables. As the group of treated individuals is systematically different from those not participating (both in terms of observables and unobservables), calculating E [Y 0|d=m] requires us to back out the empirical distribution of unobservables among the treated subpopulations.10 As the specification in Eq. (12) is a mixture model, the probability distribution of unobservables among the subgroup of treated conditional on observed labor market histories, Hi, and estimated ^ is given by parameters, C, ^ ¼ p^ i ðV j ¼ nj jH i ; xi ; CÞ
^ i ; xi ; V j ¼ nj Þ p^ 0j Li ðCjH 0 ^ i ; xi ; V j ¼ nj Þ þ ð1 p^ u Þ0 Li ðCjH ^ i ; xi ; V j ¼ 0Þ p^ j Li ðCjH
(13) where p^ 0j denotes the 64 1 vector of the estimated unconditional ^ i ; xi ; V j ¼ nj Þ is the vector of probabilities of being a type nj, and Li ðCjH conditional likelihood contributions evaluated at the estimated parameters ^ With the estimated conditional probabilities, we can calculate the C. differences in expected unemployment and employment durations because of program participation, and produce the standard parameter presented in
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
215
the program evaluation literature, the average treatment effect on the treated (ATET) E ½Y m Y 0 jd ¼ m. Applying the timing-of-events framework to a study evaluating the Danish ALMPs would of course require that the system evaluated meets the assumptions necessary for identification of the effects. One such requirement is randomness in the time of entry into the different training programs conditional on the information set. As illustrated in Fig. 1, it is evident from the data that there is a high degree of variation in the time until entry, which may be due to different mechanisms. Administrative practices are known to differ across municipalities, and within municipalities variation may arise due to variation in the starting dates of programs; certainly there is randomness, from the viewpoint of the unemployed, in private job training programs depending upon cooperation of participating firms. As emphasized by Abbring and van den Berg (2003), a behavioral assumption required for valid use of the timing-of-events method is that the unemployed do not anticipate the exact date of entry into training. If anticipation effects are present, then the unemployed may alter their search intensity for jobs and the treatment effects may be biased. However, anticipation effects are not to be confused with ex ante effects of training. That is, the unemployed are allowed to act according to the knowledge that there is a probability of training enrollment in the future. A thorough discussion of this issue is given in Richardson and van den Berg (2001). If, as is typically the case in Denmark, participants are only notified a few weeks in advance of the first day of the training programs, then it is argued that anticipation effects are limited and the treatment effects tend to be unbiased. An important source of anticipation effects concerns enrollment in ordinary CT at dates that are given within the regular school system. In that case, the starting date could coincide with the starting date of the school term, which is easier to anticipate. However, in the data, the starting dates for training programs are fairly evenly spread during the year. For most weeks around 2 percent of the programs start, but there are typically a few weeks in January and September where enrollment is somewhat higher (up to 5 percent). This is probably, to some extent, a catch up from low enrollment in previous weeks during summer and Christmas holidays (where the enrollment rate is below 0.5 percent). It is clear that the assumption of no or limited anticipatory effect is fundamental to this study (and to any other study estimating causal relationships between ALMPs and labor market outcomes using
216
JAKOB ROLAND MUNCH AND LARS SKIPPER
nonexperimental data in systems such as the Danish, where there is some compulsory component in the ALMPs). With the data we have at hand right now, we as econometricians only observe the moment of actual participation and not what is essentially the real treatment (or at least part of it), namely the information shock (which of course cannot be anticipated), i.e., the real moment of treatment is the moment at which the information about the future participation arrives. Up until now the information concerning announcements has not been available in Denmark.11 See also Black, Smith, Berger, and Noel (2003) on the threat effects of training. Finally, let us note that interviews with caseworkers indicate that although the so-called plan of actions are made between caseworkers and clients (the unemployed) within the first three to six months, the stated dates of program participation in these plans are only followed to a limited extent. A caseworker can, and is known to, call up clients on Fridays demanding them to show up in programs on Mondays. We stress that this is purely anecdotal evidence, and future research should be able to cast light on the time from announcements to program start as well as the selection processes taking place in between.
4. RESULTS This section reports the results of estimating our model. To explore the possibility of program-effect heterogeneity, we estimate the full model on 11 different subpopulations, namely, men and women separately, five different age groups, and four different subgroups defined by educational attainment. We only show the treatment effects on unemployment, employment, and wages, while the effects of personal characteristics as well as estimated baseline hazards are tabulated in the appendix. To allow for a flexible baseline hazard, we have chosen 12 different pieces with varying lengths in the main unemployment and employment equations. For the competing risks into the four different training schemes we have chosen 10 pieces, again with different lengths. We first consider the effects of participation on unemployment duration. These estimated model parameters are not straightforward to compare to the ‘standard’ evaluation literature, so next we focus on the implied ATET – for the unemployment and employment by using the formula above. Finally, the treatment effects on the wage rate in subsequent employment are reported in Section 4.5. Section 4.6 discusses the sensitivity of results with respect to some specification issues.
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
217
4.1. Unemployment Duration and Program Participation Table 4 brings the estimated treatment effects during and after training participation for the 11 different divisions of the sample. It is seen that the effects of public OJT and the residual training category are negative both during participation and afterwards. This is the case when partitioned into men and women, educational subgroups, and age groups. For example, for men after participation the hazard from unemployment to employment is only 73.7 percent of the hazard had the individuals not participated. The results give clear evidence of locking-in effects and negative postprogram effects when considering these two training schemes. Thus, the programs Table 4. Proportional Change in Hazard from Unemployment into Employment Due to Program Participation, 1995–2000. Subgroup
Men Women Age below 25 Age 25–29 Age 30–39 Age 40–49 Age above 49 Basic schooling High school Vocational College and beyond
Private OJT
Public OJT
Ordinary CT
Residual Programs
During
After
During
After
During
After
During
After
0.900 (0.073) 1.144 (0.116) 1.017 (0.183) 1.487 (0.239) 1.361 (0.167) 1.260 (0.197) 1.499 (0.251) 1.662 (0.157) 1.631 (0.909) 1.534 (0.151) 1.545 (0.385)
0.724 (0.025) 0.830 (0.032) 0.793 (0.042) 0.912 (0.052) 0.865 (0.047) 0.696 (0.056) 0.871 (0.121) 0.930 (0.045) 0.929 (0.094) 0.832 (0.040) 0.966 (0.120)
0.319 (0.024) 0.338 (0.019) 0.285 (0.044) 0.457 (0.058) 0.523 (0.045) 0.550 (0.053) 0.390 (0.045) 0.495 (0.034) 0.480 (0.120) 0.464 (0.035) 0.516 (0.079)
0.737 (0.020) 0.858 (0.018) 0.877 (0.052) 0.829 (0.037) 0.924 (0.031) 0.868 (0.036) 1.035 (0.055) 0.927 (0.026) 0.880 (0.087) 0.903 (0.030) 0.980 (0.055)
0.508 (0.018) 0.485 (0.016) 0.327 (0.021) 0.248 (0.013) 0.295 (0.013) 0.334 (0.019) 0.614 (0.045) 0.369 (0.014) 0.244 (0.021) 0.323 (0.013) 0.285 (0.018)
1.126 (0.025) 1.497 (0.030) 0.883 (0.046) 0.954 (0.034) 1.006 (0.027) 0.967 (0.031) 1.375 (0.063) 0.968 (0.024) 1.042 (0.064) 1.009 (0.026) 1.007 (0.040)
0.289 (0.018) 0.303 (0.020) 0.418 (0.047) 0.314 (0.036) 0.385 (0.036) 0.449 (0.048) 0.509 (0.074) 0.459 (0.033) 0.330 (0.059) 0.411 (0.034) 0.368 (0.054)
0.657 (0.019) 0.765 (0.021) 0.816 (0.030) 0.722 (0.040) 0.772 (0.034) 0.943 (0.050) 0.843 (0.089) 0.804 (0.027) 0.757 (0.057) 0.845 (0.034) 0.892 (0.072)
Note: Asymptotic standard errors appear in parentheses. Bold numbers are statistically different from 1 at a 5% level.
218
JAKOB ROLAND MUNCH AND LARS SKIPPER
are unambiguously prolonging the duration of unemployment for the participants with a downward shift in the hazard of more than 66 percent in some cases while participation takes place and between 10 and 35 percent afterwards! Private OJT, on the contrary, is seen to have a positive effect on the exit rate into employment while participating, i.e., no locking-in is taking place.12 This is probably to be explained by the fact that the participants in private job training tend to continue their jobs with the same employer under normal conditions. According to Table 4, this effect is stronger for those in the higher age brackets, for persons having no formal education, and for individuals with a vocational education. As opposed to this, the postprogram effect of private OJT is negative for most groups. This means that those among the participants that do not go directly from the training program to employment are worse off compared to the case of no job training. That is, there are indications that private OJT creates skills that are not useful anywhere, but at the specific employer. The negative postprogram effects from private and public job training and for the residual program could be explained by a scenario where the unemployed having finished their programs narrow their job search to jobs where these newly acquired skills can be utilized. Ordinary CT is by far the most widely used ALMP (see Fig. 1 and Table 1). Not surprisingly does CT give rise to the strongest locking-in effects with the hazard lowered in some cases by 75 percent. This indicates that the search intensity for jobs is severely reduced while participating in CT, such that the hazard rate out of unemployment is reduced. The postprogram effects of CT are found to be positive for only a few subgroups – the only program with positive postprogram effects on the unemployment hazard – but negative for youth. The positive effect is largest for women with the hazard out of unemployment up some 50 percent.
4.2. Heterogeneity in Baseline Outcomes for Standard Persons We now turn our attention to calculation of treatment effects that are easier to compare with the ‘standard’ evaluation literature. That is, in Table 5 we report the calculated ATET, along with estimated baseline expected unemployment and employment durations for standard persons ðEðY 0 ÞÞ, the estimated baseline expected unemployment and employment durations for the groups who end up participating in the different programs,
Subgroup
Outcome
E(Y 0 )
E(Y 0|d=m) [E(Y mY 0|d=m)] Private OJT
Unemployment Employment
Womenb
Unemployment Employment
Age below 25d
Unemployment Employment
Age 25–29e
Unemployment Employment
Age 30–39f
Unemployment Employment
Age 40–49b
Unemployment Employment
Age above 49g
Unemployment Employment
Basic schoolingb
Unemployment Employment
High schoolb
Unemployment Employment
39 177 (47,382) 46 197 (54,881) 21 182 (1,7867) 24 209 (26,512) 32 195 (36,567) 59 175 (22,882) 88 145 (18,941) 43 197 (36,491) 47 167 (7,799)
34 157 40 173 45 170 65 194 60 196 51 121 96 139 64 163 58 204
[4] [21] (607) [2] [21] (532) [3] [16] (166) [4] [4] (234) [0c] [8] (357) [5] [37] (223) [4] [82] (159) [1] [19] (453) [3] [7] (59)
[16] [20] (1,006) 51 [14] 161 [9] (1,989) 51 [15] 138 [5] (274) 71 [13] 177 [26] (475) 74 [7] 187 [22] (898) 92 [13] 164 [16] (696) 124 [13] 143 [34] (650) 80 [17] 147 [14] (1,512) 68 [11] 169 [3] (145) 53 136
Ordinary CT 64 152 83 162 31 139 43 178 50 176 59 161 117 127 76 134 47 181
[2c] [1] (3,976) [8c] [2] (6,600) [6] [21] (1,099) [7] [5] (1,881) [6] [3] (3,204) [7] [3] (2,347) [10c] [30] (2,055) [1] [7] (4,369) [5] [12] (779)
Residual Programs [18] [2]
46 166 (1,550) 47 181 (1,649) 41 168 (503) 66 170 (599) 77 188 (1,038) 89 182 (671) 110 145 (389) 69 173 (1,550) 51 190 (244)
[19] [6] [12] [0] [20] [3] [19] [4] [14] [12] [24] [85] [14] [6] [15] [2]
219
Menb
Public OJT
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table 5. Expected Unemployment and Employment Durations (Weeks)a.
220
Table 5. (Continued ) Subgroup
Outcome
E(Y 0 )
E(Y 0|d=m) [E(Y mY 0|d=m)] Private OJT
Vocationalb
Unemployment Employment
College and beyondb
Unemployment Employment
[7c] [21] (495) 64 [4] 230 [7] (127) 72 162
Ordinary CT
[13] [19] (1,036) 82 [9] 174 [6] (275)
[7] [5] (3,816) 49 [12] 180 [2] (1,531)
78 162
54 165
Residual Programs [18] [8]
69 176 (1,012) 68 198
[17] [17]
(370)
Significant effect at 5% level. a
Number in parentheses below estimates are respective sample sizes. Average duration and starting dates are used for the different programs, i.e., private job training starts after 29 weeks of open unemployment and lasts for 25 weeks, public job training starts after 31 weeks and lasts 34 weeks, ordinary education starts after 30 weeks and lasts 23 weeks, and the residual program starts after 30 weeks and lasts for 44 weeks. b Calculations of average duration based on a person who starts the u spell in 1998, has no children, unmarried, between 40 and 49 years old, a Danish citizen, residing in a larger city, has 20 years of labor market experience, with vocational education, receives 0.80 in UI compensation, had 35 remaining weeks of UI benefits when the spell commenced, and works in metal. See text for calculations of average baselines among treated. c Denotes locking-in effect, but positive postprogram effect. d As footnote b but with 5 years of experience. e As footnote b but with 8 years of experience. f As footnote b but with 15 years of experience. g As footnote b but with 25 years of experience.
JAKOB ROLAND MUNCH AND LARS SKIPPER
44 195 (42,073) 40 174 (18,488)
Public OJT
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
221
ðEðY 0 jd ¼ mÞ; m ¼ f1; 2; 3; 4gÞ, and the respective sample sizes used in calculating these parameters. The calculations are based on estimated parameters from the full-blown model available in the appendix. We use the empirical or unconditional distribution of unobservables in calculating E(Y 0) and the conditional distribution of unobservables among participants (see Eq. (13)). To avoid extrapolating the expectations out in areas of too thin support, all durations are calculated over a seven-year period only. The calculations are not too sensitive to this censoring, since most durations in our sample have a fairly limited range. We first offer some comments on the heterogeneity in baseline outcomes for standard persons – see the first column of Table 5. Women in our sample are on average unemployed for one and a half months longer than the corresponding group of males (46 weeks compared to 39 weeks), but when they do find jobs they tend to keep them for a longer period of time, around four months. This latter result may come from the fact that women are more likely to self-select into the public sector than males, where job protection is typically higher. Turning to the calculations based on the agespecific subgroups estimations, there is an increasing expected duration of unemployment over the subgroups, i.e., where a standard person below 25 years can expect to be unemployed for four months (21 weeks) the expected duration among those above 50 years is as high as 20 months. However, for the expected employment durations, the age profile is more concave with the young and older having expected durations around three years whereas individuals in their 30s are expected to stay employed for almost four years. There is not much difference in expected unemployment duration across the different educational groups, but somewhat surprisingly the expected employment duration among unskilled and among those with vocational education is more than four months higher than for the group with at least a college degree. The reason is as follows: Whereas estimated baseline hazards and the value of the estimated unobserved component in the employment equation, n^ e , are not substantially different across the subpopulations, the estimated probability of being a high-type among employed (those with Ve=0, or low transition probabilities out of employment) is 50 percent high in the populations with lower educational levels (around 0.3) as among the college educated (0.20) driving this peculiar result. It is important to stress here again that we base our estimation sample on the flow of people into unemployment. In this respect, our results are not representative of the population of labor market participants as a whole; the average
222
JAKOB ROLAND MUNCH AND LARS SKIPPER
durations are almost certainly ‘worse than among the general working population in Denmark.
4.3. Heterogeneity in Baseline Outcomes for Participants The next set of results concerns heterogeneity in baseline outcomes for participants – see the first column for each of the different program types. Comparing the differences in expected unemployment durations between the participants and the standardized ‘controls,’ there appears to be a rather modest positive selection for the subclassification groups of men and women into private OJT compared to the standard persons (34 weeks vs. 39, and 40 weeks vs. 46). For expected employment durations, it is evident that those who end up in, e.g., the private OJT program have worse expected baseline outcomes, the differences being more than five months or some 15 percent. The picture for the nine subpopulations confirms and even strengthens the picture of a selection into private OJT taking place among unemployed with expected ‘poor’ baseline outcomes. It is only among high school and college graduates that there appears to be a positive selection into private OJT compared to the corresponding standard persons. The calculations among high school graduates are based on 56 observations only and should, therefore, be interpreted with much caution. Where we noted in the previous section that unemployed college graduates on average had poorer distribution of unobservables than their counterparts with vocational and basic education, we also note here that for the group of participating college graduates the estimated distribution of Ve=0 is more than twice as high among private OJT participants as in the population of controls. The negative selection into public OJT is even more pronounced than for private OJT. It is seen that participants in public OJT often have up to two months longer expected baseline spells of unemployment coupled with shorter expected employment durations than the already ‘selected’ private counterparts. For the two remaining program types, the same negative selection appears albeit not to the same extent as for public OJT. Two notable differences are participants in ordinary CT among those above 50 years and among those with only basic schooling level. Here the expected employment duration is even poorer than among publicly job trained in the absence of any program. Explanations for these selective differences are obsolete skills among the older workers and no formal qualifying exam among the latter group necessitating participation in the CT program for those who would otherwise perform poorly on the labor market.
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
223
4.4. Average Program Effects among Participants The results in square brackets in Table 5 are the estimated ATET for the respective subgroups. Turning first to the effects of private OJT, we see that despite a program participation period of six months, there is hardly any prolonging effect on the expected duration of unemployment. For some groups the expected unemployment duration is even shortened; this is the case for the age group of 25–29 years, those above 50, and the subgroup with vocational education. The main effect comes from a positive effect on the transition probability from unemployment to employment while the programs take place as explained in Section 4.1, i.e., participants probably tend to continue in their jobs with the same employer under normal conditions. It is also evident from the table that the two oldest subgroups gain the most in terms of expected employment duration with more than eight months for those between 40 and 49 years and one and a half years for those above 50, while people below 30 are found to hardly gain anything. If ‘learning begets learning,’ then this difference between the two age groups could come from the fact that those in their 40s and 50s have more human capital accumulated over the years in the form of regular job training, thus, making it easier to benefit from the job training program. Another explanation could be that since the possibility for older workers to withdraw from the labor market in the mid-1990s in Denmark was rather lucrative and many workers in fact did leave, those who ended up staying in the labor market were more motivated with an expected higher gain than their younger program participating counterparts. For public OJT, the results are mostly depressing. Despite the already negative selection into this program, the effects of participation seem to be that of prolonging the unemployment spells up to almost 17 weeks (basic schooling). Moreover, once participants do find a job, they often end up keeping it for shorter periods as a result of participation in public OJT. The only group with an estimated positive employment outcome from this type of program is the subgroup of those above 50 years who gain seven months in terms of employment. The program participation had a prolonging effect on the spell of unemployment of three months, however, leaving the overall or net effect only modestly positive. This latter deviating result, although at first surprising, could stem from the fact that the participants here are systematically more motivated than their younger counterparts because of the aforementioned outflux of less easily trained or that they end up in systematically different public jobs.
224
JAKOB ROLAND MUNCH AND LARS SKIPPER
For the ordinary CT program, the effects are mostly as expected when it comes to unemployment duration. As shown in Section 4.1, there is a significant locking-in effect for all subpopulations under study, but most groups experience a positive postprogram effect and among unemployed with sufficiently long expected baseline unemployment durations (subgroup of women and older workers), the postprogram effect is large enough to overcome the initial negative locking-in effect, such that the overall effect on expected unemployment duration is actually positive. With respect to the effects of CT on the subsequent spells of employment, we see that those with low levels of human capital, the young and those with only initial basic schooling levels, gain up to 21 weeks from participation or three times the size of the effect on unemployment. The only other group of unemployed who gain from participation in CT is, again, the older. Finally, the impacts of the residual group of programs are that of rather substantial prolonging of the expected duration of unemployment for all subgroups. Again, this is not surprising given the length and nature of these programs (see Section 2). It only benefits few groups in terms of subsequent employability, namely, those with much labor market experience, women, and college educated. Only for the subpopulation of participants above 50 years is the employment effect large enough to more than match the initial negative effect on unemployment duration.
4.5. Treatment Effects on Wage Rates Turning to the last set of results – the effects of program participation on hourly wage rates – we find a remarkably consistent negative impact of ALMP participation, cf. Table 6. In summary, we see that men are punished harder than women, often more than twice as hard. Coupled with the evidence from Table 5, we conclude that participants in private job training experience an increase in employability, but at the cost of a lower hourly wage rate. This wage drop also applies to participants in public job training, but they were also found to have lower expected employment duration. The youth were found to be the only group to benefit from ordinary CT when using expected employment duration as a measure of success. However, this comes with a decline in wages of more than 6 percent. These surprisingly consistent negative hourly wage rate effects may have different reasons. Firstly, in the way we have specified our wage equation, nothing allows us to identify ‘obsolete’ human capital, and we equate the value of accumulated labor market experience prior to participation among
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table 6.
225
Percentage Change in Hourly Wage Rate Due to Program Participation, 1995–2000.
Subgroup
Private OJT
Men
0.072 (0.009) 0.028 (0.010) 0.034 (0.017) 0.051 (0.014) 0.056 (0.015) 0.029 (0.019) 0.093 (0.022) 0.048 (0.010) 0.004 (0.031) 0.034 (0.011) 0.092 (0.023)
Women Age below 25 Age 25–29 Age 30–39 Age 40–49 Age above 49 Basic schooling High school Vocational College and beyond
Public OJT
Ordinary CT
Residual Programs
0.078 (0.008) 0.032 (0.007) 0.017 (0.017) 0.057 (0.010) 0.058 (0.008) 0.050 (0.012) 0.027 (0.014) 0.043 (0.007) 0.072 (0.025) 0.047 (0.008) 0.067 (0.014)
0.048 (0.007) 0.028 (0.006) 0.062 (0.016) 0.024 (0.010) 0.043 (0.007) 0.026 (0.010) 0.034 (0.013) 0.035 (0.008) 0.033 (0.018) 0.033 (0.007) 0.041 (0.012)
0.054 (0.008) 0.035 (0.007) 0.045 (0.011) 0.033 (0.011) 0.042 (0.010) 0.064 (0.013) 0.047 (0.021) 0.048 (0.007) 0.017 (0.021) 0.028 (0.009) 0.027 (0.017)
Note: Asymptotic standard errors appear in parentheses.
trained with the value of experience among nonparticipants. But if participants end up taking one of the four types of training because of such depreciations of the value of their pretraining accumulated human capital, then we will clearly underestimate the effect of training on hourly wages. Another explanation could be that of a stigmatizing effect. We evaluate the effects of the programs in a period where they were rather new tools used in the Danish economy. Hence, employers may have a (potentially false) common belief about these newly introduced programs. That is, having participated in any kind of government (co)sponsored training program could be perceived to be nothing but a signal of lowerthan-average productivity. In that case, the average wage rate would come out lower. Even among the firms that took in the unemployed for job training, there is no reason to offer higher wages once the training is over, since the revealed productivity of the newly trained is only available to the
226
JAKOB ROLAND MUNCH AND LARS SKIPPER
firm and, therefore, carries no outside value to the participant. Some participants may trade-off the expected lower hourly wage rate with an expected improved employment outlook, resulting in an overall increase in earnings as a consequence of participation.13 However, we should also note that the explanation of a stigmatizing effect has an intrinsic asymmetric information problem in explaining why the unemployed should value these programs14 and self-select into them from very early on in their spells (recall Fig. 1) on the one hand, and the employers not valuing participation on the other hand. See Gerfin et al. (2005) for elaborate discussions of further issues connecting heterogeneity in results with theoretical arguments.
4.6. Sensitivity with Respect to Specifications Our full-blown model includes several hundred parameters and, as explained in Section 3, 128 mass points in the unobservables distribution. It should be evident that, given our dataset, the computational burden involved in estimating this model renders extensive sensitivity analysis almost impossible. However, we did implement a number of alternative specification checks. With respect to observables, we estimated the model with experience discretized. The results of the treatment effects were completely innocuous to the changes. The UI seniority variable was not included in the initial estimations. Recall that this variable, to a large extent, captures information of previous labor market dynamics, because of the way it is constructed (see Section 2). Its inclusion changed the estimated coefficients towards zero – in some instances, even dramatically. This indicates that information on previous durations and labor market attachments in general is crucial. This would be in full accordance with Gerfin and Lechner (2002) where alignment of labor market histories among treated and controls takes place in multiple dimensions. The model was estimated initially without the unobservables being correlated across states. Not surprisingly did the introduction of correlation of the unobservables matter for the coefficients of the treatment indicators. We should also note that in the estimating procedure many of the 128 combinations of the unobserved components were estimated to be either numerically zero or with very low t-statistics. As we tested the model down and eliminated many of the potential types, the results on the treatment indicator did not change.
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
227
Clearly, from the way our model is specified, there is the risk of an overdependence of results with respect to functional forms. However, the known instability and problems of identifying both the piecewise constant hazards and the points of support in mixing distribution are presumably alleviated to some extent by the presence of multiple spells. If the unobservables are fixed across spells for the same individual, this multiple spell feature of our data should greatly facilitate identification of the mixing distribution. For a discussion of the advantages of multiple spells in a recent application of a mixed proportional hazard model, see, e.g., Abbring, van den Berg, and van Ours (2005).
5. CONCLUSION The results of this chapter highlight the relevance of not just considering the effects of ALMP participation on the duration of unemployment, which has been the usual approach in the literature evaluating training programs using duration models. Treatment effects on subsequent employment duration and wages should preferably also be taken into account, since these effects as illustrated here can be more important and unambiguous than the effects on unemployment duration when evaluating labor market training programs. Concerning the impact of training on expected duration of unemployment we typically find evidence of locking-in effects. That is, while participating in programs, the unemployed are likely to reduce their effort to find a regular job. This effect is sometimes counteracted by positive postprogram effects, but overall the programs largely do not have quantitatively important effects on unemployment duration. It is hardly surprising that training is found to take time! However, concerning effects on the duration of subsequent employment, there is a clear prolonging impact for participants in private job training and the residual group of training programs, while the opposite is true for participants in public job training. The effect of ALMP participation on the hourly wage rate is also relatively unambiguous; typically, wages are reduced by 3–7 percent. We do not estimate the effect of training participation on labor market earnings, however; so it is quite possible that these actually increase as a consequence of the training, i.e., it is possible that the increase in employability is large enough to more than offset the decrease found in hourly wages for some of the subgroups and training schemes.
228
JAKOB ROLAND MUNCH AND LARS SKIPPER
NOTES 1. In this respect, we model training as a subspell of the unemployment spell, as opposed to Gritz (1993), Ham and LaLonde (1996), and Bonnal et al. (1997) who all model participation in training as distinct labor market states – see discussion in the following text. 2. It is important to stress that we focus on direct effects of the different programs. No attempt is made to evaluate any general equilibrium effects of these programs. In fact, such effects will be assumed away. 3. Of course, the list of references here is by no means exhaustively covering the area of program evaluation. See Heckman, LaLonde, and Smith (1999), Martin and Grubb (2001), and Kluve and Schmidt (2002) for recent extensive surveys of studies of program effects in general. 4. The results and discussions in these references also highlight the importance of considering long-run effects of programs with aims at augmenting human capital among participants and that one can end up drawing very different policy conclusion when not only transitions out of unemployment is considered, but also the effect of employment is taken into account. 5. After a reform in 2002 the unemployed are no longer required to participate in training programs for 75 percent of the time after the first year of unemployment. Instead, they are required to participate in a program every time they have had six consecutive months of unemployment. 6. Those below the age of 25 receive half of maximum UI benefits. 7. The Kaplan–Meier hazards (left scale) are calculated as number of weekly entrants into the four different programs over the pool of potential entrants, namely, those still unemployed who have not yet entered one of the programs, i.e., the survivors (right scale). 8. We do not evaluate the reform from its initiation in 1994 because of data recording problems in this first year. 9. Only transitions from unemployment to employment and back are considered. Hence, transitions to other labor market states are treated as right censored. 10. We use the notation of the Roy–Rubin model extended to a situation with multiple programs which is by now standard and is taken from Lechner (2001). 11. However, the relevant information about dates of announcements has been recorded during the last couple of years in AMANDA (the system used by the caseworkers in Denmark) and in future work we intend to couple this information with the data we have already (or an updated and newer sample of the population). Doing this, we will be able to quantify and assess the size and magnitude of any bias in the current estimates. 12. Bear in mind that we do not have information on the intended length of the programs, i.e., it could be that people either exit the programs (and thus unemployment) during participation or that they find regular employment immediately upon completion. This fact also limits our possibility of estimating effects of lengths and intensity of programs. 13. In a parallel study, we evaluate long-run effects of these programs on earnings and employment outcomes, and we actually do find an overall positive earnings effect for private and public OJT programs (see Jespersen, Munch, & Skipper, 2008).
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
229
14. Besides the possibility of a pure consumption motive behind participation and the fact that the two OJT programs lead to substantial increases in earnings while the program took place (see Section 2).
ACKNOWLEDGMENTS Financial support from the Danish Social Science Research Council is gratefully acknowledged. A previous draft of this chapter was circulated under the title ‘The Consequences of Active Labour Market Programme Participation in Denmark.’ The chapter has been presented at EALE 2003 and at the Department of Economics, University of Maryland. We are grateful for helpful comments and suggestions from participants. We are particularly indebted to Michael Rosholm and an anonymous referee for many valuable comments and discussions. We would also like to thank Martin Browning, Michael Lechner, Helena Skyt Nielsen, and Editor Dann Millimet for their comments and suggestions toward the chapter.
REFERENCES Abbring, J. H., & van den Berg, G. J. (2003). The non-parametric identification of treatment effects in duration models. Econometrica, 71, 1491–1571. Abbring, J. H., van den Berg, G. J., & van Ours, J. C. (2005). The effect of unemployment insurance sanctions on the transition from unemployment to employment. The Economic Journal, 115, 602–630. Bingley, P., Datta Gupta, N., & Pedersen, P. J. (2004). The impact of incentives on retirement in Denmark. In: J. Gruber & D. Wise (Eds), Social security programs and retirement around the world (pp. 153–234). NBER, Chicago: University of Chicago Press. Black, D., Smith, J., Berger, M., & Noel, B. (2003). Is the threat of reemployment services more effective than the services themselves? Evidence from random assignment in the UI system. American Economic Review, 93, 1313–1327. Bolvig, I., Jensen, P., & Rosholm, M. (2003). The employment effects of active social policy in Denmark. IZA Discussion Paper no. 736. Bonnal, L., Fouge`re, D., & Se´randon, A. (1997). Evaluating the impact of French employment policies on individual labour market histories. Review of Economic Studies, 64, 683–713. Card, D., & Hyslop, D. R. (2005). Estimating the effects of a time-limited earnings subsidy for welfare-leavers. Econometrica, 73, 1723–1770. Eberwein, C., Ham, J. C., & LaLonde, R. J. (1997). The impact of being offered and receiving classroom training on the employment histories of disadvantaged women: Evidence from experimental data. Review of Economic Studies, 64, 655–682. Gerfin, M., & Lechner, M. (2002). A microeconometric evaluation of the active labour market policy in Switzerland. The Economic Journal, 112, 854–893.
230
JAKOB ROLAND MUNCH AND LARS SKIPPER
Gerfin, M., Lechner, M., & Steiger, H. (2005). Does subsidised temporary employment get the unemployed back to work? An econometric analysis of two different schemes. Labour Economics, 12, 807–835. Gritz, R. M. (1993). The impact of training on the frequency and duration of employment. Journal of Econometrics, 57, 21–51. Ham, J. C., & LaLonde, R. J. (1996). The effect of sample selection and initial conditions in duration models: Evidence from experimental data on training. Econometrica, 64, 175–205. Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using experimental data. Econometrica, 66, 1017–1098. Heckman, J., LaLonde, R. J., & Smith, J. (1999). The economics and econometrics of active labor market programs. In: A. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. 3A, pp. 1865–2097). Amsterdam: Elsevier Science. Heckman, J., & Smith, J. (1999). The pre-program earnings dip and the determinants participation in a social program: Implications for simple program evaluation strategies. Economic Journal, 109, 313–348. Jespersen, S., Munch, J. R., & Skipper, L. (2008). Costs and benefits of Danish active labour market programs. Labour Economics, (forthcoming). Kluve, J., & Schmidt, C. M. (2002). Can training and employment subsidies combat European unemployment? Economic Policy, 17, 411–448. Lalive, R., van Ours, J. C., & Zweimu¨ller, J. (2002). The impact of active labor market programs on unemployment duration. Institute for Empirical Research in Economics Working Paper no. 41. University of Zurich. Lancaster, T. (1990). The econometric analysis of transition data. Cambridge: Cambridge University Press. Lechner, M. (1999). Earnings and employment effects of continuous off-the-job training in East Germany after unification. Journal of Business & Economic Statistics, 17, 74–90. Lechner, M. (2000). An evaluation of public-sector-sponsored continuous vocational training programs in East Germany. The Journal of Human Resources, 35, 347–375. Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In: M. Lechner & F. Pfeiffer (Eds), Econometric evaluation of labour market policies (pp. 43–58). Heidelberg: Physica. Lechner, M. (2002a). Program heterogeneity and propensity score matching: An application to the evaluation of active labour market policies. The Review of Economics and Statistics, 84, 205–220. Lechner, M. (2002b). Some practical issues in the evaluation of heterogeneous labour market programmes by matching methods. Journal of Royal Statistical Society, A165, 59–82. Lechner, M., Miquel, R., & Wunsch, C. (2004). Long-run effects of public sector sponsored training in West Germany. Discussion Paper 2004-19. Department of Economics, University of St. Gallen. Martin, J. P., & Grubb, D. (2001). What works and for whom: A review of OECD countries’ experiences with active labour market policies. Swedish Economic Policy Review, 8, 9–56. Mortensen, D. T. (1977). Unemployment insurance and job search decisions. Industrial and Labor Relations Review, 30, 505–517.
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
231
Richardson, K., & van den Berg, G. J. (2001). The effect of vocational employment training on the individual transition rate from unemployment to work. Swedish Economic Policy Review, 8, 175–213. van den Berg, G. J. (2001). Duration models: Specification, identification, and multiple durations. In: J. Heckman & E. Leamer (Eds), Handbook of econometrics (Vol. 5, pp. 3381–3462). Amsterdam: North-Holland.
232
APPENDIX
Table A1. Variablesb
Effects of Socioeconomic Characteristics on the Hazard out of Unemployment, 1995–2000a. Men
Age Below 25
Age 25–29
Age 30–39
Age 40–49
3.172 (0.037) 3.233 (0.038) 3.430 (0.039) 3.557 (0.040) 3.656 (0.041) 3.741 (0.041) 3.764 (0.040) 3.955 (0.042) 4.039 (0.043) 4.347 (0.052) 4.379 (0.055)
3.775 (0.101) 3.712 (0.102) 3.822 (0.103) 3.838 (0.104) 3.863 (0.105) 3.877 (0.105) 3.938 (0.103) 4.159 (0.105) 4.139 (0.108) 4.422 (0.130) 4.244 (0.137)
3.696 (0.077) 3.630 (0.077) 3.774 (0.078) 3.829 (0.080) 3.952 (0.081) 4.027 (0.082) 3.992 (0.080) 4.118 (0.080) 4.031 (0.081) 4.085 (0.094) 3.979 (0.096)
4.058 (0.062) 3.991 (0.062) 4.139 (0.063) 4.233 (0.064) 4.318 (0.065) 4.369 (0.065) 4.377 (0.064) 4.476 (0.064) 4.471 (0.066) 4.593 (0.076) 4.418 (0.078)
3.935 (0.075) 3.987 (0.076) 4.150 (0.077) 4.197 (0.079) 4.291 (0.079) 4.395 (0.080) 4.375 (0.078) 4.454 (0.078) 4.431 (0.079) 4.654 (0.090) 4.625 (0.095)
Age Basic Above 49 Schooling
2.404 (0.083) 2.535 (0.085) 2.669 (0.086) 2.711 (0.088) 2.661 (0.090) 2.728 (0.090) 2.825 (0.089) 2.846 (0.090) 2.985 (0.092) 3.272 (0.103) 3.390 (0.111)
4.018 (0.051) 3.968 (0.051) 4.141 (0.052) 4.175 (0.053) 4.255 (0.054) 4.297 (0.054) 4.367 (0.053) 4.435 (0.053) 4.498 (0.054) 4.656 (0.063) 4.607 (0.066)
High School
4.223 (0.136) 4.145 (0.138) 4.312 (0.139) 4.361 (0.141) 4.500 (0.144) 4.555 (0.144) 4.445 (0.139) 4.741 (0.141) 4.543 (0.140) 4.582 (0.160) 4.457 (0.160)
Vocational College and Education Beyond
3.871 (0.054) 3.867 (0.055) 3.986 (0.055) 4.052 (0.056) 4.094 (0.056) 4.183 (0.057) 4.218 (0.056) 4.366 (0.056) 4.382 (0.058) 4.685 (0.068) 4.619 (0.071)
3.856 (0.111) 3.876 (0.111) 4.011 (0.112) 4.102 (0.114) 4.175 (0.114) 4.247 (0.115) 4.201 (0.113) 4.299 (0.113) 4.332 (0.114) 4.570 (0.129) 4.536 (0.134)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Log baseline hazards 0–4 weeks 4.391 (0.048) 4–8 weeks 4.344 (0.049) 8–12 weeks 4.479 (0.049) 12–16 weeks 4.512 (0.050) 16–20 weeks 4.583 (0.050) 20–25 weeks 4.665 (0.051) 25–35 weeks 4.749 (0.050) 35–52 weeks 4.887 (0.050) 52–79 weeks 4.967 (0.051) 79–104 weeks 5.163 (0.059) 104–156 weeks 5.071 (0.062)
Women
Yearly indicator 1995 1996 1997 1999 2000 Female Children aged less than 7 Children in household Age below 25 Age 25–29 Age 30–39 Age above 49 Married
4.177 4.148 (0.070) (0.213)
3.361 (0.099)
3.892 4.256 (0.085) (0.119)
3.176 (0.145)
4.412 (0.087)
3.964 (0.166)
4.111 (0.081)
3.757 (0.147)
0.567 (0.017) 0.437 (0.016) 0.345 (0.018) 0.002 (0.016) 0.091 (0.027)
0.605 (0.016) 0.455 (0.016) 0.356 (0.017) 0.111 (0.015) 0.030 (0.027)
0.672 (0.032) 0.466 (0.030) 0.341 (0.033) 0.051 (0.030) 0.031 (0.053)
0.609 (0.029) 0.460 (0.027) 0.376 (0.030) 0.059 (0.027) 0.065 (0.047)
0.417 (0.022) 0.372 (0.022) 0.296 (0.023) 0.028 (0.020) 0.088 (0.035)
0.496 (0.027) 0.435 (0.025) 0.364 (0.028) 0.034 (0.024) 0.048 (0.042)
0.682 (0.034) 0.473 (0.034) 0.376 (0.036) 0.123 (0.029) 0.071 (0.050)
0.584 (0.020) 0.433 (0.020) 0.312 (0.022) 0.023 (0.019) 0.063 (0.034)
0.495 (0.047) 0.368 (0.045) 0.263 (0.049) 0.100 (0.045) 0.014 (0.081)
0.558 (0.019) 0.451 (0.018) 0.374 (0.020) 0.071 (0.017) 0.006 (0.029)
0.342 (0.032) 0.290 (0.031) 0.250 (0.032) 0.056 (0.028) 0.014 (0.049)
–
–
0.179 (0.021) –
0.272 (0.020) –
0.292 0.153 (0.016) (0.019) 0.055 0.084 (0.018) (0.026) 0.103 0.158 (0.020) (0.017) – –
0.257 (0.025) –
0.365 (0.014) 0.193 (0.020) 0.125 (0.018) 0.403 (0.023) 0.106 (0.022) 0.079 (0.017) 0.506 (0.018) 0.019 (0.013) 0.487 (0.045)
0.137 (0.030) 0.221 (0.059) 0.075 (0.059) 0.466 (0.057) 0.232 (0.049) 0.143 (0.044) 0.465 (0.085) 0.094 (0.035) 0.515 (0.086)
0.289 (0.014) 0.171 (0.018) 0.143 (0.016) 0.531 (0.022) 0.298 (0.020) 0.169 (0.017) 0.508 (0.017) 0.065 (0.012) 0.759 (0.053)
0.023 (0.019) 0.223 (0.031) 0.173 (0.028) 0.677 (0.058) 0.446 (0.032) 0.181 (0.026) 0.420 (0.029) 0.029 (0.020) 0.504 (0.063)
0.036 (0.018) 0.171 (0.015) 0.596 (0.020) 0.341 (0.017) 0.140 (0.014) 0.337 (0.015) 0.057 (0.011) 0.504 (0.034)
0.252 (0.014) 0.032 0.352 (0.013) (0.028) 0.407 – (0.018) 0.199 – (0.016) 0.132 – (0.013) 0.495 – (0.014) 0.003 0.354 (0.009) (0.035) 0.524 0.482 (0.038) (0.080)
0.112 (0.020) –
0.258 (0.031) –
–
–
–
–
–
–
–
–
–
–
–
–
0.075 (0.021) 0.630 (0.057)
0.091 0.114 (0.014) (0.016) 0.453 0.543 (0.042) (0.067)
0.012 (0.022) –
233
Non-OECD citizenship
4.405 (0.071)
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
156– weeks
234
Table A1. (Continued ) Variablesb
Men
Age Below 25
Age 25–29
0.125 0.007 (0.015) (0.015) Smaller city 0.091 0.101 (0.011) (0.011) Experience/100 4.173 3.083 (0.216) (0.220) (Experience/100) 13.641 11.158 squared (0.658) (0.781) Basic schooling 0.160 0.166 (0.010) (0.010) High school 0.061 0.036 (0.019) (0.017) College and 0.024 0.175 beyond (0.017) (0.014) Not member of 0.083 0.135 union (0.014) (0.012) Replacement rate 0.268 0.015 of UI benefits (0.027) (0.028) Remaining weeks 3.793 3.094 in open (0.103) (0.089) unemployment/ 1000
0.071 (0.032) 0.157 (0.022) 1.736 (0.314) –
0.006 (0.026) 0.125 (0.021) 1.090 (0.172) –
0.249 (0.020) 0.137 (0.028) 0.028 (0.058) 0.083 (0.027) 0.014 (0.059) 4.420 (0.272)
0.281 (0.020) 0.037 (0.028) 0.149 (0.030) 0.100 (0.024) 0.225 (0.053) 3.322 (0.177)
0.275 (0.038)
0.296 (0.045)
Copenhagen
UI membership Construction
0.332 (0.019)
0.198 (0.038)
Age 30–39
Age 40–49
0.026 0.134 (0.022) (0.028) 0.124 0.036 (0.017) (0.020) 1.982 1.697 (0.137) (0.408) – 4.033 (1.408) 0.200 0.091 (0.016) (0.018) 0.002 0.020 (0.027) (0.037) 0.101 0.071 (0.023) (0.024) 0.165 0.200 (0.019) (0.025) 0.220 0.305 (0.040) (0.048) 2.508 3.135 (0.125) (0.162)
0.239 (0.041)
0.294 (0.052)
Age Basic Above 49 Schooling 0.039 (0.038) 0.080 (0.028) 1.866 (0.439) 8.623 (1.227) 0.101 (0.023) 0.038 (0.103) 0.168 (0.032) 0.002 (0.027) 0.376 (0.056) 4.187 (0.186)
0.702 (0.061)
High School
Vocational College and Education Beyond
0.198 (0.022) 0.152 (0.016) 0.719 (0.094) –
0.087 (0.040) 0.138 (0.035) 0.077 (0.282) –
0.084 (0.020) 0.096 (0.014) 0.083 (0.091) –
0.053 (0.025) 0.052 (0.022) 0.324 (0.148) –
–
–
–
–
–
–
–
–
–
–
–
–
0.155 (0.019) 0.216 (0.038) 3.026 (0.112)
0.057 (0.033) 0.022 (0.083) 2.296 (0.285)
0.065 (0.017) 0.160 (0.034) 3.505 (0.125)
0.161 (0.022) 0.286 (0.050) 2.149 (0.196)
0.254 (0.035)
0.080 (0.127)
0.513 (0.038)
–
JAKOB ROLAND MUNCH AND LARS SKIPPER
Women
– 0.229 (0.052) 0.127 (0.037) 0.257 (0.048) 0.195 (0.041) 0.099 (0.052) 0.092 (0.040) 0.229 (0.081)
– 0.097 (0.044) 0.073 (0.034) 0.274 (0.045) 0.252 (0.038) 0.064 (0.045) 0.052 (0.036) 0.179 (0.052)
– 0.205 (0.053) 0.102 (0.042) 0.275 (0.051) 0.316 (0.047) 0.221 (0.057) 0.066 (0.045) 0.308 (0.064)
– 0.035 (0.069) 0.143 (0.053) 0.105 (0.060) 0.040 (0.057) 0.200 (0.075) 0.087 (0.055) 0.518 (0.069)
Treatment effects, ongoing Private OJT 0.102 (0.077) Public OJT 1.143 (0.076) CT 0.678 (0.036) Residual 1.241 (0.062)
0.135 0.130 (0.101) (0.164) 1.086 1.255 (0.055) (0.155) 0.724 1.117 (0.034) (0.064) 1.194 0.871 (0.067) (0.112)
0.497 (0.147) 0.784 (0.126) 1.396 (0.054) 1.159 (0.115)
0.293 0.144 (0.109) (0.156) 0.653 0.598 (0.083) (0.097) 1.222 1.098 (0.044) (0.056) 0.954 0.801 (0.093) (0.108)
Treatment effects, completed Private OJT 0.323 (0.034) Public OJT 0.305 (0.027) CT 0.119 (0.022)
0.186 0.232 (0.039) (0.053) 0.153 0.131 (0.021) (0.060) 0.403 0.125 (0.020) (0.052)
0.108 (0.054) 0.187 (0.045) 0.047 (0.036)
0.145 0.363 (0.055) (0.081) 0.079 0.141 (0.033) (0.041) 0.006 0.034 (0.027) (0.032)
Technical Business Academics Others Self-employed
– –
0.015 (0.016) 0.222 (0.023) 0.233 (0.024) 0.095 (0.029) 0.072 (0.020) 0.389 (0.033)
0.097 (0.018) 0.020 (0.027) 0.017 (0.016) 0.068 (0.025) 0.101 (0.015) 0.257 (0.031)
– 0.085 (0.043) 0.040 (0.023) 0.084 (0.037) 0.020 (0.027) –
– – –
– 0.132 (0.038) 0.127 (0.035) 0.101 (0.039) 0.030 (0.033) –
– –
0.064 (0.021) 0.260 (0.041)
0.014 (0.093) 0.189 (0.047) 0.238 (0.070) 0.160 (0.043) 0.236 (0.125)
0.151 (0.033) 0.277 (0.047)
0.197 (0.093) 0.227 (0.091) 0.380 (0.091) 0.168 (0.087) 0.204 (0.090) 0.444 (0.105)
0.516 (0.135) 0.996 (0.114) 0.474 (0.073) 0.829 (0.146)
0.347 (0.083) 0.921 (0.065) 0.740 (0.039) 0.957 (0.068)
0.438 (0.409) 0.735 (0.250) 1.409 (0.084) 1.108 (0.177)
0.642 (0.085) 0.767 (0.076) 1.130 (0.040) 0.889 (0.083)
0.435 (0.249) 0.570 (0.146) 1.256 (0.063) 1.000 (0.148)
0.138 (0.139) 0.035 (0.053) 0.319 (0.046)
0.151 (0.049) 0.138 (0.028) 0.160 (0.025)
0.113 (0.101) 0.170 (0.095) 0.010 (0.061)
0.107 (0.043) 0.104 (0.033) 0.009 (0.026)
0.034 (0.124) 0.020 (0.056) 0.007 (0.039)
235
– 0.065 (0.053) 0.056 (0.032) 0.069 (0.058) 0.012 (0.035) 0.067 (0.089) 0.020 (0.036) 0.490 (0.131)
Production
– –
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Metal KAD (female)
236
Table A1. (Continued ) Variablesb
Residual Mass point uu
a
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond
0.421 (0.029)
0.268 0.203 (0.027) (0.037)
0.326 (0.056)
0.259 0.059 (0.045) (0.053)
0.233 (0.105)
0.328 (0.035)
0.301 (0.077)
0.192 (0.040)
0.115 (0.080)
1.220 (0.030) 2.290
0.584 0.910 (0.016) (0.059) 1.868 2.441
1.184 (0.045) 2.122
1.214 1.295 (0.031) (0.034) 2.032 2.022
1.550 (0.025) 1.771
1.206 (0.021) 2.017
1.148 (0.067) 2.127
0.990 (0.020) 2.112
1.166 (0.041) 1.945
47,382
54,881
18,941
36,491
7,799
42,073
18,488
17,867
26,512
36,567
22,882
Asymptotic standard error appears in parentheses. The standard person had a spell starting in 1998, is a male, not married, between 40 and 49 years of age, has no children, from an OECD country, is living in one of the bigger cities, member of a union, is skilled, and has his UI in metal.
b
JAKOB ROLAND MUNCH AND LARS SKIPPER
Mean loglikelihood Sample size
Men
Variables
Effects of Socioeconomic Characteristics on the Hazard into Private OJT, 1995–2000. Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Log baseline hazards 0–4 weeks 4.492 (0.333) 4–8 weeks 4.505 (0.342) 8–12 weeks 4.413 (0.338) 12–16 weeks 4.414 (0.348) 16–20 weeks 4.312 (0.340) 20–25 weeks 4.254 (0.349) 25–35 weeks 4.191 (0.333) 35–52 weeks 4.030 (0.349) 52–79 weeks 3.370 (0.341) 79– weeks 2.562 (0.388)
3.357 (0.358) 3.477 (0.358) 3.403 (0.363) 3.628 (0.370) 3.439 (0.369) 3.481 (0.371) 3.425 (0.368) 3.376 (0.364) 3.622 (0.403) 3.095 (0.407)
3.871 (0.781) 3.952 (0.790) 3.625 (0.777) 3.817 (0.796) 3.452 (0.840) 3.659 (0.800) 3.161 (0.785) 3.112 (0.814) 2.641 (0.933) 2.145 (1.071)
4.553 (0.604) 4.412 (0.631) 4.116 (0.640) 3.856 (0.645) 4.050 (0.693) 3.759 (0.651) 3.443 (0.642) 2.839 (0.624) 2.263 (0.651) 1.549 (0.669)
4.243 (0.526) 4.372 (0.506) 4.369 (0.514) 4.427 (0.508) 4.105 (0.533) 3.981 (0.516) 4.210 (0.506) 4.165 (0.517) 4.034 (0.537) 3.582 (0.572)
4.212 (0.734) 4.077 (0.718) 3.926 (0.715) 4.145 (0.753) 4.010 (0.751) 3.977 (0.728) 3.852 (0.702) 3.891 (0.703) 3.750 (0.733) 3.155 (0.700)
4.423 (0.592) 4.274 (0.603) 4.204 (0.619) 4.310 (0.613) 4.100 (0.570) 4.297 (0.618) 4.036 (0.561) 3.831 (0.586) 3.192 (0.614) 2.290 (0.599)
3.871 (0.381) 4.051 (0.379) 3.859 (0.381) 4.027 (0.386) 3.946 (0.393) 3.913 (0.393) 3.965 (0.383) 3.914 (0.394) 3.803 (0.420) 3.454 (0.476)
5.907 (1.669) 5.873 (1.661) 5.341 (1.745) 6.768 (2.387) 5.221 (1.708) 5.876 (1.713) 4.833 (1.570) 4.714 (1.680) 3.945 (1.640) 3.630 (2.244)
3.053 (0.500) 3.153 (0.502) 3.070 (0.498) 3.069 (0.502) 3.057 (0.517) 2.976 (0.508) 3.019 (0.505) 2.980 (0.501) 2.972 (0.517) 2.388 (0.518)
3.552 (1.145) 3.059 (1.133) 3.503 (1.163) 3.436 (1.153) 3.000 (1.103) 3.085 (1.082) 2.944 (1.112) 2.923 (1.141) 2.298 (1.098) 1.537 (1.161)
0.497 (0.156) 0.436 (0.208)
0.181 (0.395) 0.018 (0.451)
0.128 0.025 (0.243) (0.226) 0.364 0.038 (0.279) (0.251)
0.047 (0.267) 0.010 (0.318)
0.412 (0.329) 0.659 (0.403)
0.301 (0.164) 0.176 (0.205)
0.296 (0.872) 0.240 (1.051)
0.315 (0.204) 0.405 (0.232)
0.654 (0.399) 0.101 (0.455)
Yearly indicator 1995 1996
0.361 (0.162) 0.002 (0.194)
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond
237
Men
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table A2.
238
Table A2. (Continued ) Variables
1997 1999 2000
Women
Age Below 25
0.479 (0.225) 0.364 (0.197) 0.316 (0.406)
0.643 (0.236) 0.084 (0.190) 0.178 (0.403)
0.512 (0.548) 0.481 (0.514) 0.373 (0.994)
0.146 0.790 (0.435) (0.267) 0.049 0.210 (0.298) (0.236) 0.306 0.299 (0.704) (0.520)
0.666 (0.336) 0.040 (0.362) 0.110 (0.796)
0.624 (0.381) 0.223 (0.335) 0.016 (0.718)
0.640 (0.244) 0.102 (0.237) 0.011 (0.536)
0.737 (1.568) 0.075 (1.205) 0.460 (2.513)
0.752 (0.248) 0.157 (0.215) 0.146 (0.435)
0.492 (0.576) 0.438 (0.542) 0.322 (1.037)
–
0.135 (0.260) –
0.148 0.259 (0.220) (0.167) – 0.068 (0.181) 0.111 0.072 (0.204) (0.200) – –
0.160 (0.174) 0.494 (0.325) 0.055 (0.175) –
0.361 (0.196) – 0.074 (0.300) –
0.096 (0.118) 0.205 (0.159) 0.224 (0.142) 0.431 (0.166) 0.226 (0.173) 0.019 (0.131) 0.307 (0.177) 0.145 (0.116) 0.320 (0.470) 0.521 (0.211)
0.915 (0.761) 0.241 (1.224) 0.631 (1.713) 0.204 (1.054) 1.136 (0.998) 0.490 (1.126) 1.699 (1.851) 0.621 (1.240) 0.455 (2.637) 0.912 (0.659)
0.235 (0.112) 0.097 (0.189) 0.001 (0.165) 0.493 (0.180) 0.101 (0.153) 0.129 (0.132) 0.505 (0.159) 0.124 (0.116) 1.382 (0.549) 0.615 (0.183)
0.871 (0.238) 0.035 (0.509) 0.017 (0.450) 1.867 (0.600) 0.925 (0.394) 0.677 (0.410) 0.096 (0.435) 0.145 (0.255) 1.417 (1.216) 1.927 (0.340)
–
Children aged less 0.212 than 7 (0.180) Children in 0.179 household (0.156) Age below 25 0.646 (0.200) Age 25–29 0.660 (0.158) Age 30–39 0.482 (0.133) Age above 49 0.252 (0.148) Married 0.008 (0.115) Non-OECD 1.277 citizenship (0.328) Copenhagen 0.823 (0.151)
0.110 (0.132) 0.045 0.008 (0.128) (0.359) 0.498 – (0.186) 0.042 – (0.176) 0.018 – (0.137) 0.609 – (0.159) 0.000 0.490 (0.097) (0.438) 0.313 0.544 (0.690) (0.823) 0.088 0.335 (0.211) (0.533)
Age 25–29
Age 30–39
Age 40–49
Age Basic Above 49 Schooling
–
–
–
–
–
–
–
–
–
–
–
–
0.051 (0.162) 1.691 (1.115) 0.835 (0.293)
0.689 (0.236) –
0.455 0.037 (0.234) (0.132) 2.603 1.701 (0.523) (0.381) 1.657 0.727 (0.270) (0.223)
0.603 (0.529)
High School
Vocational College and Education Beyond
JAKOB ROLAND MUNCH AND LARS SKIPPER
Female
Men
0.154 (0.109) Experience/100 1.153 (2.050) (Experience/100) 0.097 squared (6.247) Basic schooling 0.304 (0.100) High school 0.022 (0.183) College and 0.138 beyond (0.198) Not member of 0.034 union (0.124) Replacement rate 0.015 of UI benefits (0.260) Remaining weeks 4.194 in open (0.834) unemployment/ 1000 UI membership Construction Metal KAD (female) Production Technical
0.048 (0.230) – –
0.106 (0.273) 5.609 (3.058) –
0.184 (0.151) 2.230 (1.169) –
0.394 (0.274) 1.855 (3.637) 0.736 (10.163) 0.311 (0.173) 0.631 (1.112) 0.327 (0.364) 1.054 (0.245) 0.977 (0.541) 1.460 (1.995)
0.123 (0.146) –
0.312 (0.615) –
0.021 (0.122) –
0.305 (0.285) –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.227 (0.144) 0.534 (0.347) 1.473 (0.864)
0.418 (0.810) 1.593 (1.860) 1.826 (5.132)
0.122 (0.164) 0.365 (0.290) 0.965 (0.955)
0.190 (0.277) 0.175 (0.682) 1.678 (2.938)
0.059 0.157 (0.092) (0.252) 0.027 0.705 (0.215) (0.341) 0.322 1.243 (0.228) (1.234) 0.179 0.822 (0.145) (0.343) 0.288 0.282 (0.296) (0.756) 4.909 5.588 (0.870) (2.726)
0.541 0.289 (0.203) (0.130) 1.489 0.049 (0.322) (0.284) 0.433 0.411 (0.429) (0.202) 1.359 0.030 (0.258) (0.166) 2.627 0.537 (0.638) (0.401) 7.576 1.835 (1.764) (1.087)
0.297 (0.206) 1.763 (4.959) 1.129 (15.758) 0.344 (0.186) 0.373 (0.382) 0.431 (0.349) 0.001 (0.222) 0.104 (0.449) 1.044 (1.466)
0.226 0.784 (0.285) (0.728) – –
0.185 (0.599) –
1.896 (0.459) –
2.587 (0.881) –
0.729 (2.142) –
0.529 (0.334) –
0.072 (2.674) –
0.475 (0.383) –
–
1.208 0.214 (0.465) (0.437) 1.125 0.123 (0.296) (0.353) 1.008 0.053 (0.481) (0.500)
0.122 (0.537) 0.438 (0.346) 0.091 (0.409)
0.423 (0.610) 0.618 (0.396) 0.460 (0.410)
0.232 (0.335) 0.155 (0.208) 0.392 (0.394)
–
0.530 (0.361) 0.734 (0.347) 0.796 (0.364)
–
–
1.103 (0.613) 0.270 0.583 (0.169) (0.606) 0.002 0.892 (0.294) (1.248)
0.436 (0.182) 2.762 (1.639) –
– 1.177 (0.852)
–
0.181 (0.976) 0.641 (0.971)
239
0.600 (0.146) 0.899 (0.201)
0.023 (0.096) 2.006 (0.958) –
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Smaller city
240
Table A2. (Continued ) Variables
Women
Age Below 25
0.609 (0.188) Administration 0.880 (0.265) Academics 0.409 (0.263) Others 1.041 (0.187) Self-employed 1.076 (0.247)
0.203 (0.160) 0.381 (0.241) 0.691 (0.320) 0.332 (0.166) 0.498 (0.263)
0.513 (0.591) 2.166 (0.969) 1.226 (1.375) 0.962 (0.631) 1.642 (1.164)
Business
Mass point na1
4.441 (0.164)
Age 25–29
Age 30–39
0.289 0.052 (0.337) (0.368) 1.567 0.283 (0.752) (0.431) 0.762 0.204 (0.494) (0.480) 1.152 0.213 (0.343) (0.380) 0.993 0.185 (0.638) (0.487)
Age 40–49 0.024 (0.348) 0.349 (0.456) 0.049 (0.529) 0.005 (0.405) 0.610 (0.526)
11.362 5.3883 5.407 6.500 11.001 (204.728) (0.802) (0.297) (0.576) (325.88)
Note: See the footnotes of Table A1.
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond
0.229 (0.454) 1.404 (0.445) 0.230 (0.720) 0.093 (0.436) 1.558 (0.497)
0.262 (0.222) 0.476 (0.388) 0.214 (7.350) 0.299 (0.209) 0.148 (0.286)
1.078 (0.837) 0.175 (1.724) 1.049 (1.540) 0.481 (1.113) 1.456 (1.810)
0.738 (0.343) 0.760 (0.418)
0.052 (1.015) 0.969 (0.919) 0.562 (0.886) 0.630 (1.047) 0.989 (1.137)
6.134 (0.353)
11.272 (386.563)
6.652 (1.308)
7.310 (11.838)
6.150 (0.730)
0.580 (0.333) 0.867 (0.381) –
JAKOB ROLAND MUNCH AND LARS SKIPPER
Men
Variables
Effects of Socioeconomic Characteristics on the Hazard into Public OJT, 1995–2000. Men
Log baseline hazards 0–4 weeks 5.027 (0.266) 4–8 weeks 4.992 (0.268) 8–12 weeks 4.835 (0.271) 12–16 weeks 4.777 (0.276) 16–20 weeks 4.679 (0.278) 20–25 weeks 4.639 (0.281) 25–35 weeks 4.414 (0.271) 35–52 weeks 4.402 (0.271) 52–79 weeks 3.752 (0.278) 79– weeks 2.747 (0.288) Yearly indicator 1995
Age Below 25
Age 25–29
Age 30–39
Age 40–49
3.691 (0.652) 3.541 (0.660) 4.013 (0.674) 3.428 (0.671) 3.373 (0.673) 3.235 (0.648) 3.457 (0.662) 3.054 (0.670) 2.852 (0.734) 2.860 (0.953)
4.848 (0.483) 4.723 (0.480) 4.617 (0.494) 4.642 (0.486) 4.604 (0.499) 4.777 (0.513) 4.419 (0.494) 4.760 (0.505) 4.224 (0.513) 3.872 (0.563)
4.981 (0.420) 4.850 (0.417) 4.781 (0.419) 4.815 (0.427) 4.956 (0.432) 4.732 (0.426) 4.660 (0.423) 4.894 (0.422) 4.852 (0.422) 4.249 (0.429)
4.899 (0.370) 4.949 (0.369) 4.791 (0.374) 4.853 (0.367) 4.805 (0.382) 4.911 (0.385) 4.799 (0.377) 5.053 (0.375) 4.822 (0.368) 4.333 (0.368)
3.530 (0.354) 3.790 (0.359) 3.551 (0.358) 3.551 (0.360) 3.533 (0.365) 3.564 (0.362) 3.732 (0.355) 3.857 (0.356) 3.508 (0.349) 2.978 (0.352)
0.601 0.078 (0.082) (0.269) 0.391 0.170 (0.108) (0.318)
0.747 (0.196) 0.350 (0.227)
0.812 (0.129) 0.370 (0.165)
1.140 (0.148) 0.462 (0.192)
0.794 (0.145) 0.384 (0.192)
3.016 (0.174) 3.053 (0.177) 3.019 (0.176) 3.051 (0.176) 3.127 (0.179) 3.115 (0.178) 3.132 (0.174) 3.333 (0.173) 3.215 (0.172) 2.947 (0.160)
Age Basic Above 49 Schooling
High School
Vocational Education
College and Beyond
3.806 (0.166) 3.786 (0.171) 3.755 (0.175) 3.732 (0.176) 3.731 (0.181) 3.809 (0.182) 3.736 (0.174) 3.921 (0.175) 3.822 (0.173) 3.414 (0.176)
5.093 (0.812) 5.306 (0.846) 4.923 (0.856) 4.615 (0.863) 4.992 (0.901) 4.735 (0.831) 4.809 (0.820) 4.924 (0.857) 4.570 (0.875) 3.955 (0.952)
4.662 (0.260) 4.607 (0.262) 4.552 (0.268) 4.449 (0.267) 4.463 (0.273) 4.464 (0.271) 4.430 (0.263) 4.661 (0.268) 4.292 (0.268) 3.908 (0.283)
3.529 (4.161) 3.540 (4.148) 3.180 (4.164) 3.824 (4.146) 3.492 (4.144) 3.173 (4.144) 3.183 (4.151) 3.496 (4.154) 3.395 (4.156) 2.651 (4.165)
0.657 (0.093) 0.388 (0.120)
0.498 (0.370) 0.563 (0.424)
1.044 (0.118) 0.506 (0.154)
1.036 (0.273) 0.250 (0.312)
241
1996
0.901 (0.126) 0.710 (0.162)
Women
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table A3.
242
Table A3. (Continued ) Variables
1997 1999 2000
1.027 (0.175) 0.347 (0.147) 0.078 (0.297) –
Children aged less 0.009 than 7 (0.123) Children in 0.066 household (0.105) Age below 25 0.368 (0.158) Age 25–29 0.333 (0.119) Age 30–39 0.391 (0.095) Age above 49 0.033 (0.093) Married 0.059 (0.072) Non-OECD 0.368 citizenship (0.159) Copenhagen 0.041 (0.107)
Women
0.928 (0.114) 0.142 (0.094) 0.198 (0.193) – 0.036 (0.059) 0.072 (0.058) 0.167 (0.092) 0.214 (0.076) 0.251 (0.058) 0.194 (0.062) 0.060 (0.039) 0.244 (0.131) 0.327 (0.085)
Age Below 25
Age 25–29
Age 30–39
Age 40–49
0.858 (0.347) 0.175 (0.357) 0.159 (0.714)
0.646 (0.294) 0.289 (0.270) 0.030 (0.541)
0.934 (0.182) 0.094 (0.179) 0.208 (0.389)
1.059 (0.210) 0.061 (0.190) 0.054 (0.383)
1.330 (0.183) 0.081 (0.146) 0.108 (0.284)
0.945 (0.135) 0.055 (0.115) 0.062 (0.232)
0.848 (0.515) 0.353 (0.567) 0.220 (1.150)
1.128 (0.169) 0.149 (0.152) 0.067 (0.308)
1.144 (0.384) 0.090 (0.303) 0.140 (0.594)
0.376 (0.176) –
0.537 (0.127) – 0.064 (0.117) –
0.136 (0.084) 0.133 (0.135) 0.073 (0.087) –
0.018 (0.100) –
0.205 (0.201) –
0.233 (0.085) 0.074 (0.087) 0.008 (0.111) –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.058 (0.071) 0.527 (0.208) 0.375 (0.133)
0.073 (0.076) 0.514 (0.262) 0.174 (0.134)
0.142 (0.083) –
0.137 (0.061) 0.104 (0.079) 0.084 (0.075) 0.350 (0.109) 0.017 (0.087) 0.049 (0.072) 0.258 (0.066) 0.115 (0.048) 0.014 (0.163) 0.210 (0.101)
0.741 (0.255) 0.346 (0.423) 0.833 (0.450) 0.514 (0.395) 0.450 (0.391) 0.088 (0.348) 0.207 (0.631) 0.235 (0.267) 0.150 (0.451) 0.422 (0.330)
0.036 (0.075) 0.102 (0.115) 0.085 (0.101) 0.101 (0.121) 0.040 (0.117) 0.027 (0.085) 0.102 (0.085) 0.117 (0.064) 0.566 (0.264) 0.097 (0.125)
0.199 (0.153) 0.213 (0.204) 0.341 (0.220) 1.064 (0.560) 0.009 (0.211) 0.250 (0.174) 0.147 (0.192) 0.093 (0.142) 0.583 (0.322) 0.204 (0.211)
0.670 0.047 (0.219) (0.111) 0.401 0.024 (0.505) (0.180) 0.036 0.306 (0.293) (0.172)
Age Basic Above 49 Schooling
0.019 (0.153) –
0.147 (0.161)
High School
Vocational Education
College and Beyond
JAKOB ROLAND MUNCH AND LARS SKIPPER
Female
Men
0.269 (0.089) Experience/100 0.153 (1.411) (Experience/100) 0.444 squared (4.288) Basic schooling 0.107 (0.074) High school 0.154 (0.135) College and 0.167 beyond (0.121) Not member of 0.045 union (0.096) Replacement rate 0.735 of UI benefits (0.200) Remaining weeks 4.413 in open (0.557) unemployment/ 1000 UI membership Construction Metal KAD (female) Production Technical
0.017 (0.041) 0.119 (0.092) 0.094 (0.075) 0.049 (0.062) 0.076 (0.135) 4.288 (0.354)
1.049 (0.178) – –
0.210 (0.194) – –
0.156 (0.135) 0.154 (0.165) 0.380 (0.181)
0.065 (0.065) 0.075 (0.106) 0.003 (0.065)
0.081 0.143 (0.193) (0.130) 0.605 2.493 (2.789) (1.154) – – 0.160 (0.118) 0.469 (0.203) 0.386 (0.252) 0.494 (0.173) 0.051 (0.369) 3.685 (0.768)
0.051 (0.079) 0.092 (0.176) 0.021 (0.143) 0.087 (0.104) 0.431 (0.240) 5.824 (0.633)
0.114 (0.106) 0.305 (2.059) 0.185 (7.270) 0.104 (0.085) 0.125 (0.210) 0.071 (0.130) 0.309 (0.111) 0.859 (0.246) 5.155 (0.680)
1.787 0.438 (0.547) (0.582) – – 1.624 0.125 (0.598) (0.376) 1.384 0.371 (0.486) (0.336) 1.786 0.753 (0.607) (0.443) 1.451 0.358 (0.500) (0.354)
0.661 (0.425) – 0.560 (0.361) 0.533 (0.344) 0.393 (0.352) 0.659 (0.338)
0.271 (0.348) – 0.450 (0.270) 0.210 (0.247) 0.340 (0.282) 0.405 (0.254)
0.158 (0.188) 0.072 (0.226) 0.528 (0.490) 0.296 (0.264) 0.315 (0.457) 1.850 (1.662)
0.047 (0.097) 1.199 (0.722) –
0.177 (0.110) 0.113 (1.575) 0.252 (4.583) 0.186 (0.078) 0.338 (0.533) 0.080 (0.161) 0.051 (0.119) 0.155 (0.250) 7.049 (0.598)
2.775 (0.282) – 0.086 (0.230) 0.258 (0.209) 0.240 (0.263) 0.178 (0.218)
0.289 (0.075) –
0.325 (0.267) –
0.124 (0.084) –
0.130 (0.163) –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.167 (0.088) 0.143 (0.160) 4.210 (0.320)
0.816 (0.369) 0.601 (0.662) 2.452 (1.897)
0.280 (0.086) 0.586 (0.195) 6.107 (0.546)
0.259 (0.164) 1.391 (0.410) 5.610 (1.127)
0.086 (0.179) – 0.383 (0.317) 0.061 (0.081) 0.285 (0.194) 0.018 (0.098)
0.268 (1.236) – –
0.069 (0.204) – 0.042 (0.209) 0.200 (0.155) 0.280 (0.179) 0.093 (0.148)
–
– 0.545 (0.622) 0.088 (0.445)
– – 0.461 (0.615) 0.486 (0.607) 0.061 (0.591)
243
Business
0.076 (0.055) 0.038 (0.348) –
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Smaller city
244
Table A3. (Continued ) Variables
Administration Academics Others
Mass point na2
Age 30–39
Age 40–49
0.666 (0.363) 0.485 (0.389) 0.566 (0.342) 0.747 (0.392)
0.375 (0.261) 0.187 (0.333) 0.304 (0.253) 0.618 (0.321)
0.364 (0.242) 0.529 (0.411) 0.106 (0.212) 0.274 (0.252)
3.576 13.631 10.057 5.041 10.436 10.460 (0.109) (6021.569) (347.209) (0.438) (367.556) (191.109)
4.929 (0.289)
0.087 (0.186) 0.009 (0.219) 0.167 (0.153) 0.300 (0.194)
Women
0.097 (0.087) 0.035 (0.176) 0.008 (0.059) 0.016 (0.116)
Note: See the footnotes of Table A1.
Age Below 25
Age 25–29
0.872 0.018 (0.662) (0.382) 0.146 0.469 (1.257) (0.477) 1.304 0.489 (0.513) (0.343) 2.348 0.968 (0.871) (0.516)
Age Basic Above 49 Schooling
High School
Vocational Education
College and Beyond
0.358 (0.415) 2.847 (1.251) 0.019 (0.467) 1.667 (0.574)
0.016 (0.173) – 0.056 (0.156) 0.238 (0.202)
0.033 (0.597) 0.121 (0.579) 0.345 (0.601) 0.887 (0.689)
10.193 9.961 (357.749) (410.988)
8.742 (47.914)
9.946 (133.307)
0.203 (0.125) – 0.036 (0.078) 0.304 (0.142)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Self-employed
Men
Variables
Effects of Socioeconomic Characteristics on the Hazard into Classroom Training, 1995–2000. Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Log baseline hazards 0–4 weeks 3.350 (0.099) 4–8 weeks 3.408 (0.101) 8–12 weeks 3.266 (0.101) 12–16 weeks 3.379 (0.104) 16–20 weeks 3.315 (0.105) 20–25 weeks 3.366 (0.105) 25–35 weeks 3.338 (0.103) 35–52 weeks 3.382 (0.103) 52–79 weeks 3.276 (0.107) 79– weeks 2.997 (0.107)
9.998 (1.090) 9.986 (1.090) 9.925 (1.091) 10.027 (1.091) 10.096 (1.091) 10.100 (1.091) 10.111 (1.091) 10.276 (1.091) 10.163 (1.091) 9.882 (1.091)
3.351 (0.283) 3.259 (0.280) 3.145 (0.281) 3.000 (0.280) 2.838 (0.282) 2.910 (0.283) 2.464 (0.281) 2.121 (0.278) 1.792 (0.298) 1.803 (0.334)
2.122 (0.211) 2.091 (0.217) 1.967 (0.220) 1.919 (0.225) 1.826 (0.228) 1.745 (0.230) 1.536 (0.227) 1.363 (0.231) 1.017 (0.244) 0.726 (0.249)
2.589 (0.165) 2.417 (0.164) 2.243 (0.165) 2.173 (0.167) 2.125 (0.171) 1.973 (0.172) 1.879 (0.171) 1.840 (0.176) 1.480 (0.188) 1.232 (0.188)
2.161 (0.186) 2.228 (0.188) 1.935 (0.187) 2.031 (0.191) 1.960 (0.195) 1.851 (0.196) 1.744 (0.195) 1.653 (0.197) 1.353 (0.206) 0.995 (0.202)
7.874 (0.336) 7.801 (0.337) 7.663 (0.338) 7.934 (0.340) 7.888 (0.340) 7.970 (0.338) 7.931 (0.335) 7.931 (0.336) 7.802 (0.334) 7.385 (0.331)
3.484 (0.091) 3.503 (0.093) 3.323 (0.093) 3.502 (0.096) 3.533 (0.098) 3.570 (0.097) 3.494 (0.093) 3.614 (0.094) 3.483 (0.094) 3.152 (0.098)
3.694 (0.337) 3.756 (0.338) 3.707 (0.344) 3.476 (0.342) 3.412 (0.351) 3.265 (0.352) 2.997 (0.346) 2.986 (0.357) 2.842 (0.364) 2.413 (0.363)
2.825 (0.138) 2.613 (0.138) 2.502 (0.139) 2.453 (0.142) 2.408 (0.144) 2.350 (0.146) 2.318 (0.145) 2.200 (0.147) 1.949 (0.156) 1.658 (0.154)
2.431 (0.317) 2.556 (0.320) 2.255 (0.321) 2.282 (0.326) 2.113 (0.329) 1.987 (0.327) 1.894 (0.326) 1.833 (0.331) 1.518 (0.344) 1.132 (0.344)
0.474 0.331 0.098 (0.094) (0.081) (0.095) 0.360 0.030 0.255 (0.109) (0.087) (0.101)
0.382 (0.095) 0.831 (0.119)
0.258 (0.066) 0.284 (0.070)
0.074 (0.163) 0.117 (0.187)
0.274 (0.074) 0.054 (0.079)
0.103 (0.114) 0.095 (0.127)
Yearly indicator 1995 1996
0.157 (0.053) 0.023 (0.065)
0.467 0.776 (0.045) (0.147) 0.455 0.164 (0.056) (0.147)
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond
245
Men
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table A4.
246
Table A4. (Continued ) Variables
1997 1999 2000
0.727 (0.071) 0.172 (0.054) 0.250 (0.107) –
Children aged less 0.097 than 7 (0.048) Children in 0.085 household (0.041) Age below 25 0.586 (0.059) Age 25–29 0.254 (0.046) Age 30–39 0.087 (0.036) Age above 49 0.130 (0.040) Married 0.056 (0.028) Non-OECD 0.004 citizenship (0.065) Copenhagen 0.278 (0.039)
Women
Age Below 25
0.851 0.554 (0.060) (0.160) 0.074 0.077 (0.045) (0.130) 0.151 0.087 (0.084) (0.254) –
0.105 (0.088) –
0.020 (0.028) 0.049 0.207 (0.025) (0.094) 0.290 – (0.049) 0.271 – (0.035) 0.206 – (0.029) 0.259 – (0.027) 0.016 0.167 (0.019) (0.106) 0.184 0.107 (0.069) (0.206) 0.094 0.085 (0.036) (0.126)
Age 25–29
Age 30–39
Age 40–49
0.580 0.820 1.029 (0.115) (0.095) (0.112) 0.087 0.057 0.132 (0.090) (0.064) (0.079) 0.393 0.197 0.297 (0.189) (0.130) (0.155) 0.228 (0.064) –
0.212 (0.048) 0.089 (0.050) 0.325 0.077 (0.062) (0.057) – –
Age Basic Above 49 Schooling
Vocational College and Education Beyond
1.139 (0.111) 0.000 (0.072) 0.030 (0.139)
0.941 (0.075) 0.083 (0.054) 0.103 (0.111)
0.788 (0.211) 0.134 (0.152) 0.239 (0.291)
0.936 (0.086) 0.152 (0.058) 0.340 (0.116)
0.844 (0.138) 0.054 (0.094) 0.187 (0.192)
0.048 (0.052) –
0.024 (0.034) 0.049 (0.036) 0.024 (0.042) 0.491 (0.055) 0.154 (0.047) 0.101 (0.039) 0.306 (0.042) 0.025 (0.025) 0.150 (0.064) 0.048 (0.044)
0.257 (0.106) 0.278 (0.153) 0.310 (0.158) 0.671 (0.169) 0.574 (0.146) 0.205 (0.136) 0.085 (0.209) 0.044 (0.096) 0.399 (0.172) 0.154 (0.132)
0.049 (0.039) 0.186 (0.054) 0.149 (0.049) 0.350 (0.066) 0.431 (0.053) 0.206 (0.046) 0.243 (0.047) 0.102 (0.035) 0.243 (0.109) 0.213 (0.059)
0.257 (0.059) 0.004 (0.086) 0.097 (0.075) 0.050 (0.140) 0.392 (0.098) 0.189 (0.071) 0.006 (0.075) 0.111 (0.057) 0.241 (0.179) 0.169 (0.078)
0.177 (0.050) 0.013 (0.070) 0.167 (0.046) –
0.118 (0.053) –
–
–
–
–
–
–
–
–
–
–
–
–
0.079 0.229 0.106 (0.059) (0.040) (0.043) 0.620 0.037 0.179 (0.137) (0.090) (0.139) 0.331 0.248 0.164 (0.083) (0.062) (0.071)
High School
0.076 (0.045) – 0.043 (0.081)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Female
Men
0.016 (0.030) Experience/100 0.599 (0.557) (Experience/100) 1.440 squared (1.695) Basic schooling 0.064 (0.028) High school 0.053 (0.053) College and 0.056 beyond (0.043) Not member of 0.129 union (0.034) Replacement rate 0.103 of UI benefits (0.073) Remaining weeks 2.579 in open (0.193) unemployment/ 1000 UI membership Construction Metal KAD (female) Production Technical
0.349 (0.044) 0.211 (0.057) 0.169 (0.055)
0.055 (0.063) 1.140 (0.544) – 0.244 (0.060) 0.398 (0.086) 0.205 (0.104) 0.273 (0.077) 0.332 (0.164) 1.373 (0.445)
0.032 0.030 (0.050) (0.054) 1.770 0.425 (0.434) (1.126) – 0.037 (3.722) 0.131 0.055 (0.045) (0.047) 0.100 0.022 (0.077) (0.089) 0.162 0.016 (0.068) (0.066) 0.279 0.027 (0.056) (0.057) 0.193 0.619 (0.116) (0.129) 3.096 3.248 (0.351) (0.346)
0.114 (0.060) 0.268 (0.876) 1.667 (1.993) 0.063 (0.046) 0.241 (0.101) 0.136 (0.055) 0.128 (0.057) 0.131 (0.128) 7.534 (0.372)
0.062 0.203 (0.021) (0.095) 0.005 0.293 (0.036) (0.120) 0.069 0.199 (0.033) (0.265) 0.011 0.003 (0.027) (0.122) 0.039 0.529 (0.059) (0.256) 4.895 0.796 (0.136) (0.813)
0.053 (0.048) 0.006 (0.076) 4.182 (0.160)
0.172 0.045 (0.096) (0.236) – – – 0.540 (0.218) 0.082 0.444 (0.036) (0.167) 0.001 0.492 (0.049) (0.293) 0.034 0.409 (0.032) (0.169)
0.212 (0.192) – 0.391 (0.170) 0.568 (0.140) 0.766 (0.179) 0.427 (0.149)
0.182 (0.164) – 0.471 (0.120) 0.526 (0.104) 0.503 (0.124) 0.611 (0.110)
0.222 (0.176) – 0.566 (0.136) 0.433 (0.106) 0.291 (0.127) 0.326 (0.113)
0.178 (0.196) – 0.172 (0.104) 0.016 (0.078) 0.061 (0.090) 0.045 (0.088)
0.114 0.046 (0.032) (0.108) – –
0.054 (0.040) –
0.078 (0.068) –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.004 (0.112) 0.259 (0.296) 1.125 (0.783)
0.180 (0.047) 0.251 (0.097) 3.959 (0.264)
0.150 (0.060) 0.143 (0.141) 3.161 (0.466)
0.529 0.076 (0.090) (0.602) – – 0.271 – (0.111) 0.058 – (0.047) 0.151 0.421 (0.082) (0.236) 0.098 0.308 (0.054) (0.169)
0.390 (0.116) – 0.396 (0.107) 0.009 (0.096) 0.133 (0.104) 0.066 (0.090)
– – – 0.598 (0.305) 0.365 (0.295) 0.447 (0.295)
247
Business
0.049 (0.064) – –
0.082 (0.094) 5.566 (1.371) –
0.103 (0.024) 0.087 (0.162) –
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Smaller city
248
Table A4. (Continued ) Variables
Men
Mass point na3
0.552 (0.039)
Age Below 25
0.002 0.061 (0.038) (0.258) 0.040 0.550 (0.055) (0.384) 0.067 0.321 (0.031) (0.175) 0.175 1.048 (0.063) (0.396) 6.720 2.225 (1.087) (0.123)
Note: See the footnotes of Table A1.
Age 25–29
Age 30–39
Age 40–49
0.041 (0.168) 0.118 (0.190) 0.364 (0.142) 0.964 (0.290)
0.164 (0.121) 0.170 (0.139) 0.170 (0.114) 0.608 (0.159)
0.174 (0.117) 0.357 (0.136) 0.224 (0.110) 0.582 (0.156)
2.118 2.126 2.143 (0.091) (0.071) (0.078)
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond 0.111 (0.099) –
0.011 0.703 (0.074) (0.217) – 0.432 (0.246) 0.035 0.619 (0.038) (0.174) 0.184 0.481 (0.077) (0.400)
0.217 (0.091) 0.008 (0.116)
0.105 (0.287) 0.284 (0.287) 0.508 (0.293) 0.977 (0.320)
4.264 12.881 1.874 (0.306) (10184.497) (0.157)
1.953 (0.063)
2.052 (0.098)
0.072 (0.088) 0.020 (0.113) 0.058 (0.085) 0.101 (0.107)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Administration 0.058 (0.060) Academics 0.060 (0.069) Others 0.232 (0.052) Self-employed 0.422 (0.073)
Women
Variables
Effects of Socioeconomic Characteristics on the Hazard into Residual Programs, 1995–2000. Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Log baseline hazards 0–4 weeks 4.469 (0.247) 4–8 weeks 4.664 (0.245) 8–12 weeks 4.580 (0.250) 12–16 weeks 4.280 (0.249) 16–20 weeks 4.422 (0.253) 20–25 weeks 3.991 (0.246) 25–35 weeks 3.960 (0.245) 35–52 weeks 3.749 (0.247) 52–79 weeks 3.452 (0.258) 79– weeks 2.662 (0.262)
3.444 (0.210) 3.775 (0.210) 3.677 (0.208) 3.668 (0.210) 3.684 (0.210) 3.669 (0.207) 3.810 (0.204) 3.976 (0.206) 4.048 (0.208) 3.681 (0.213)
4.807 (0.450) 5.319 (0.449) 5.044 (0.448) 4.561 (0.449) 4.981 (0.464) 4.744 (0.458) 4.903 (0.464) 4.469 (0.470) 3.808 (0.576) 1.600 (0.611)
4.697 (0.377) 4.857 (0.380) 4.701 (0.387) 4.463 (0.382) 4.624 (0.388) 4.122 (0.380) 4.052 (0.369) 4.160 (0.382) 3.694 (0.394) 2.853 (0.405)
4.773 (0.322) 4.823 (0.312) 4.682 (0.325) 4.588 (0.324) 4.527 (0.325) 4.335 (0.321) 4.280 (0.317) 4.142 (0.321) 3.790 (0.334) 2.773 (0.358)
4.453 (0.370) 4.799 (0.372) 4.843 (0.365) 4.493 (0.371) 4.644 (0.372) 4.368 (0.371) 4.521 (0.362) 4.614 (0.360) 4.717 (0.371) 3.898 (0.373)
3.905 (0.727) 3.785 (0.734) 3.567 (0.733) 3.978 (0.732) 3.442 (0.734) 3.544 (0.732) 3.834 (0.715) 3.714 (0.734) 3.640 (0.742) 3.152 (0.753)
3.652 (0.179) 3.998 (0.181) 3.755 (0.183) 3.681 (0.185) 3.734 (0.188) 3.681 (0.189) 3.931 (0.184) 4.018 (0.184) 4.099 (0.188) 3.634 (0.187)
5.336 (0.692) 5.692 (0.690) 5.718 (0.690) 5.171 (0.686) 5.390 (0.695) 5.427 (0.703) 5.012 (0.673) 4.784 (0.663) 4.448 (0.697) 3.543 (0.853)
4.844 (0.299) 4.990 (0.296) 4.947 (0.296) 4.796 (0.300) 4.870 (0.308) 4.544 (0.301) 4.621 (0.300) 4.844 (0.297) 4.780 (0.303) 4.241 (0.311)
6.819 (1.250) 6.643 (1.249) 6.901 (1.269) 6.568 (1.257) 6.685 (1.255) 6.142 (1.254) 6.179 (1.252) 5.966 (1.245) 5.918 (1.257) 5.604 (1.256)
0.358 0.744 (0.101) (0.209) 0.412 0.723 (0.117) (0.223)
0.182 (0.165) 0.477 (0.195)
0.737 (0.131) 0.758 (0.152)
0.601 (0.159) 0.757 (0.176)
0.636 (0.219) 0.637 (0.264)
0.506 (0.095) 0.631 (0.111)
0.218 (0.270) 0.298 (0.285)
0.582 (0.135) 0.614 (0.147)
0.630 (0.221) 0.710 (0.262)
Yearly indicator 1995 1996
0.504 (0.108) 0.595 (0.124)
Age Basic High Above 49 Schooling School
Vocational College and Education Beyond
249
Men
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table A5.
250
Table A5. (Continued ) Variables
1997 1999 2000
0.955 (0.136) 0.533 (0.135) 0.318 (0.268) –
Children aged less 0.053 than 7 (0.114) Children in 0.151 household (0.099) Age below 25 0.625 (0.112) Age 25–29 0.021 (0.097) Age 30–39 0.014 (0.078) Age above 49 0.732 (0.095) Married 0.075 (0.067) Non-OECD 0.157 citizenship (0.129) Copenhagen 0.019 (0.091)
Women
Age Below 25
0.911 0.435 (0.135) (0.278) 0.235 0.242 (0.108) (0.249) 0.257 0.312 (0.255) (0.502) –
0.121 (0.131) –
0.141 (0.071) 0.009 0.150 (0.067) (0.145) 0.386 – (0.082) 0.131 – (0.086) 0.131 – (0.065) 0.175 – (0.077) 0.016 0.844 (0.050) (0.177) 0.077 0.057 (0.183) (0.376) 0.203 0.337 (0.083) (0.191)
Age 25–29
Age 30–39
Age 40–49
Age Basic High Above 49 Schooling School
Vocational College and Education Beyond
0.827 (0.231) 0.417 (0.227) 0.006 (0.463)
1.034 1.218 (0.166) (0.207) 0.565 0.304 (0.172) (0.182) 0.293 0.517 (0.358) (0.408)
1.260 (0.326) 0.069 (0.234) 0.641 (0.627)
0.872 (0.133) 0.098 (0.111) 0.285 (0.243)
0.336 (0.399) 0.659 (0.384) 0.744 (0.773)
0.970 (0.174) 0.346 (0.162) 0.332 (0.346)
1.208 (0.289) 0.475 (0.283) 0.488 (0.619)
0.203 (0.108) –
0.203 0.192 (0.078) (0.089) 0.047 0.007 (0.097) (0.123) 0.014 0.255 (0.097) (0.090) – –
0.052 (0.138) –
0.202 (0.050) 0.265 (0.079) 0.128 (0.072) 0.844 (0.086) 0.145 (0.072) 0.024 (0.070) 0.418 (0.086) 0.011 (0.057) 0.347 (0.142) 0.254 (0.088)
0.192 (0.178) 0.725 (0.355) 0.146 (0.360) 1.154 (0.281) 0.133 (0.276) 0.089 (0.241) 0.896 (0.492) 0.008 (0.222) 0.502 (0.331) 0.033 (0.231)
0.032 (0.078) 0.210 (0.107) 0.062 (0.099) 0.531 (0.118) 0.274 (0.116) 0.341 (0.084) 0.325 (0.104) 0.171 (0.072) 0.374 (0.171) 0.024 (0.115)
0.268 (0.129) 0.107 (0.208) 0.119 (0.196) 0.040 (0.483) 0.152 (0.193) 0.088 (0.151) 0.587 (0.184) 0.007 (0.131) 0.439 (0.317) 0.184 (0.182)
0.104 (0.106) –
0.119 (0.146) –
–
–
–
–
–
–
–
–
–
–
–
–
0.232 (0.099) 0.393 (0.228) 0.386 (0.152)
0.103 0.255 (0.077) (0.078) 0.254 0.416 (0.163) (0.258) 0.081 0.152 (0.116) (0.140)
0.088 (0.116) – 0.186 (0.264)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Female
Men
0.248 (0.076) Experience/100 2.039 (1.283) (Experience/100) 1.408 squared (4.286) Basic schooling 0.013 (0.063) High school 0.003 (0.117) College and 0.180 beyond (0.120) Not member of 0.396 union (0.072) Replacement rate 0.000 of UI benefits (0.166) Remaining weeks 1.033 in open (0.445) unemployment/ 1000 UI membership Construction Metal KAD (female) Production Technical
0.084 (0.123) 1.073 (0.971) –
0.045 0.476 (0.049) (0.143) 0.308 0.462 (0.087) (0.195) 0.046 1.054 (0.099) (0.869) 0.113 0.167 (0.050) (0.125) 0.119 1.973 (0.165) (0.394) 0.924 9.217 (0.404) (1.204)
0.648 (0.158) – –
0.821 (0.257) – –
0.389 (0.116) 0.100 (0.157) 0.405 (0.158)
0.010 (0.065) 0.057 (0.136) 0.025 (0.070)
0.892 (0.386) – 0.960 (0.394) 0.754 (0.339) 0.812 (0.484) 0.349 (0.356)
0.267 (0.103) 0.174 (0.169) 0.099 (0.205) 1.269 (0.129) 0.186 (0.268) 2.645 (0.741)
0.216 0.284 (0.082) (0.101) 0.700 2.127 (0.690) (1.982) – 0.589 (7.372) 0.158 0.171 (0.083) (0.086) 0.259 0.083 (0.150) (0.149) 0.114 0.095 (0.140) (0.143) 0.163 0.350 (0.075) (0.100) 0.350 0.287 (0.217) (0.241) 0.289 2.510 (0.540) (0.682)
0.086 (0.173) 1.383 (2.010) 0.394 (6.038) 0.010 (0.110) 0.194 (0.593) 0.004 (0.188) 0.031 (0.138) 0.337 (0.346) 5.929 (0.979)
0.034 (0.371) – 0.396 (0.289) 0.382 (0.246) 0.363 (0.332) 0.860 (0.277)
0.061 0.793 (0.287) (0.393) – – 0.416 0.520 (0.217) (0.265) 0.075 0.633 (0.189) (0.226) 0.445 0.163 (0.231) (0.280) 0.370 0.031 (0.202) (0.253)
1.142 (0.725) – 0.043 (0.698) 0.198 (0.658) 0.481 (0.667) 0.198 (0.674)
0.107 (0.058) –
0.105 (0.197) –
0.136 (0.084) –
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.156 (0.138) 0.116 (0.163) 0.103 (0.355)
0.294 (0.181) 1.311 (0.512) 0.786 (1.414)
0.247 (0.073) 0.067 (0.195) 1.042 (0.592)
0.269 (0.118) 0.420 (0.322) 0.400 (0.981)
0.177 (0.201) – 0.397 (0.194) 0.105 (0.082) 0.655 (0.164) 0.663 (0.104)
1.344 (1.086) – –
0.632 (0.229) – 0.052 (0.211) 0.092 (0.181) 0.011 (0.202) 0.105 (0.178)
–
– 0.640 (0.330) 0.302 (0.370)
0.244 (0.159) –
– – 0.064 (1.314) 0.258 (1.316) 0.707 (1.321)
251
Business
0.111 0.179 (0.054) (0.145) 0.629 2.120 (0.435) (1.454) – –
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Smaller city
252
Table A5. (Continued ) Variables
Administration Academics Others
Mass point na4
Women
Age Below 25
0.022 (0.177) 0.119 (0.186) 0.101 (0.131) 0.271 (0.162)
0.048 0.243 (0.105) (0.469) 0.161 0.160 (0.171) (1.075) 0.006 0.752 (0.064) (0.352) 0.037 0.115 (0.108) (0.425)
3.290 (0.092)
11.848 4.992 (871.765) (0.383)
Note: See the footnotes of Table A1.
Age 25–29
Age 30–39
Age 40–49
0.301 (0.678) 0.961 (0.817) 0.324 (0.666) 0.646 (0.680)
Vocational College and Education Beyond
0.280 0.005 (0.154) (0.465) – 0.214 (0.560) 0.199 0.343 (0.085) (0.351) 0.344 0.254 (0.125) (0.547)
0.082 (0.175) 0.064 (0.201)
0.267 (1.310) 0.200 (1.306) 0.380 (1.312) 0.030 (1.325)
3.346 8.524 9.229 14.153 10.895 4.884 (0.205) (109.378)(174.907) (15916.145) (588.103) (0.617)
6.535 (9.008)
9.268 (199.353)
0.928 (0.330) 1.637 (0.371) 0.558 (0.255) 0.995 (0.368)
0.176 0.064 (0.220) (0.277) 0.059 0.009 (0.247) (0.300) 0.042 0.265 (0.197) (0.241) 0.074 0.216 (0.249) (0.267)
Age Basic High Above 49 Schooling School
0.223 (0.207) –
JAKOB ROLAND MUNCH AND LARS SKIPPER
Self-Employed
Men
Variables
Effects of Socioeconomic Characteristics on the Hazard out of Employment, 1995–2000. Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Log baseline hazards 0–4 weeks 2.157 (0.043) 4–8 weeks 2.623 (0.045) 8–12 weeks 2.859 (0.045) 12–16 weeks 2.970 (0.047) 16–20 weeks 3.117 (0.048) 20–25 weeks 3.143 (0.048) 25–35 weeks 3.135 (0.046) 35–52 weeks 3.232 (0.046) 52–79 weeks 3.990 (0.048) 79–104 weeks 4.024 (0.050) 104–156 weeks 4.407 (0.051) 156– weeks 4.721 (0.053)
3.699 (0.047) 4.493 (0.049) 4.808 (0.050) 4.929 (0.051) 5.101 (0.053) 5.077 (0.052) 5.190 (0.050) 5.531 (0.050) 5.971 (0.051) 6.166 (0.055) 6.399 (0.054) 6.538 (0.056)
3.775 (0.077) 4.218 (0.079) 4.414 (0.080) 4.484 (0.081) 4.574 (0.082) 4.629 (0.082) 4.713 (0.079) 4.888 (0.080) 5.594 (0.080) 5.714 (0.084) 5.948 (0.083) 6.243 (0.085)
3.567 (0.077) 4.149 (0.078) 4.396 (0.080) 4.519 (0.082) 4.628 (0.083) 4.637 (0.082) 4.737 (0.079) 4.997 (0.078) 5.546 (0.080) 5.647 (0.085) 5.983 (0.085) 6.262 (0.089)
3.685 (0.061) 4.312 (0.063) 4.599 (0.064) 4.698 (0.065) 4.863 (0.064) 4.847 (0.062) 4.963 (0.062) 5.131 (0.064) 5.726 (0.068) 5.907 (0.067) 6.177 (0.071) 6.449 (0.067)
3.564 (0.073) 4.228 (0.075) 4.485 (0.077) 4.639 (0.078) 4.880 (0.081) 4.873 (0.081) 4.897 (0.077) 5.085 (0.076) 5.713 (0.079) 5.679 (0.082) 6.173 (0.083) 6.299 (0.085)
Age Basic High Above 49 Schooling School
3.160 (0.082) 3.926 (0.084) 4.314 (0.086) 4.405 (0.088) 4.595 (0.091) 4.627 (0.091) 4.436 (0.085) 4.701 (0.086) 5.426 (0.091) 5.416 (0.097) 5.680 (0.097) 5.990 (0.106)
3.689 (0.064) 4.251 (0.065) 4.538 (0.066) 4.687 (0.067) 4.796 (0.068) 4.768 (0.066) 4.798 (0.067) 4.984 (0.065) 5.639 (0.067) 5.719 (0.070) 6.007 (0.070) 6.216 (0.073)
3.335 (0.218) 4.024 (0.222) 4.424 (0.224) 4.383 (0.225) 4.575 (0.225) 4.770 (0.227) 4.763 (0.223) 5.099 (0.223) 5.671 (0.225) 5.800 (0.228) 5.920 (0.227) 6.054 (0.228)
Vocational College and Education Beyond
3.236 (0.051) 3.919 (0.053) 4.126 (0.053) 4.215 (0.054) 4.413 (0.056) 4.427 (0.055) 4.470 (0.053) 4.600 (0.053) 5.265 (0.055) 5.324 (0.057) 5.686 (0.057) 5.965 (0.060)
3.651 (0.124) 4.432 (0.125) 4.696 (0.127) 4.811 (0.128) 5.015 (0.130) 5.050 (0.130) 5.166 (0.127) 5.521 (0.127) 6.033 (0.129) 6.253 (0.133) 6.571 (0.134) 6.682 (0.135)
253
Men
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Table A6.
254
Table A6. (Continued ) Variables
Yearly indicator 1995 1996 1997
2000 Female
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
0.267 (0.017) 0.102 (0.017) 0.030 (0.017) 0.053 (0.019) 0.123 (0.035)
0.166 (0.017) 0.082 (0.017) 0.009 (0.017) 0.077 (0.019) 0.182 (0.034)
0.148 (0.031) 0.037 (0.031) 0.046 (0.032) 0.050 (0.037) 0.045 (0.073)
0.178 (0.028) 0.067 (0.029) 0.002 (0.029) 0.022 (0.033) 0.046 (0.063)
0.203 (0.022) 0.093 (0.022) 0.028 (0.023) 0.062 (0.024) 0.124 (0.045)
0.184 (0.025) 0.041 (0.026) 0.038 (0.026) 0.066 (0.028) 0.127 (0.053)
0.225 (0.029) 0.145 (0.029) 0.058 (0.029) 0.004 (0.031) 0.231 (0.054)
0.161 (0.019) 0.039 (0.019) 0.012 (0.020) 0.051 (0.022) 0.209 (0.041)
0.240 (0.049) 0.122 (0.047) 0.084 (0.050) 0.039 (0.053) 0.184 (0.097)
0.218 (0.018) 0.110 (0.018) 0.021 (0.019) 0.019 (0.020) 0.032 (0.037)
0.207 (0.031) 0.129 (0.031) 0.053 (0.032) 0.064 (0.035) 0.136 (0.064)
–
0.154 (0.024) –
0.083 0.048 (0.023) (0.019) – 0.157 (0.020) 0.179 0.159 (0.023) (0.024) – –
0.111 (0.022) 0.049 (0.028) 0.164 (0.019) –
0.058 (0.025) – 0.144 (0.030) –
0.013 (0.016) 0.053 (0.022) 0.167 (0.019) 0.280 (0.027) 0.221 (0.024) 0.133 (0.019) 0.368 (0.020) 0.103 (0.015)
0.039 (0.034) 0.060 (0.065) 0.115 (0.063) 0.698 (0.070) 0.542 (0.060) 0.338 (0.053) 0.048 (0.088) 0.158 (0.040)
0.097 (0.016) 0.022 (0.021) 0.174 (0.019) 0.494 (0.029) 0.577 (0.023) 0.378 (0.020) 0.424 ((0.020) 0.169 (0.015)
0.114 (0.024) 0.297 (0.035) 0.127 (0.033) 0.832 (0.070) 0.798 (0.040) 0.439 (0.030) 0.364 (0.035) 0.189 (0.024)
–
Children aged less 0.034 than 7 (0.021) Children in 0.160 household (0.018) Age below 25 0.626 (0.024) Age 25 to 29 0.591 (0.020) Age 30 to 39 0.328 (0.017) Age above 49 0.431 (0.018) Married 0.131 (0.014)
0.095 (0.018) 0.119 (0.017) 0.277 (0.026) 0.297 (0.021) 0.215 (0.016) 0.384 (0.018) 0.129 (0.012)
0.009 (0.033) –
Age Basic High Above 49 Schooling School
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
0.187 (0.018)
0.079 (0.022)
0.073 (0.041)
0.099 0.186 (0.025) (0.017)
Vocational College and Education Beyond
JAKOB ROLAND MUNCH AND LARS SKIPPER
1999
Men
0.212 (0.042) 0.087 (0.020) Smaller city 0.105 (0.014) Experience/100 6.423 (0.282) (Experience/100) 12.787 squared (0.834) Basic schooling 0.130 (0.013) High school 0.084 (0.024) College and 0.160 beyond (0.021) Not member of 0.009 union (0.017) Replacement rate 0.172 of UI benefits (0.031) UI membership Construction Metal KAD (female) Production Technical
0.177 (0.020) 0.205 (0.028) 0.217 (0.029)
0.264 0.445 0.146 (0.083) (0.063) (0.050) 0.004 0.071 0.058 (0.038) (0.031) (0.027) 0.056 0.163 0.144 (0.026) (0.025) (0.020) 4.010 12.841 4.468 (0.356) (0.651) (0.1607) – 44.133 – (2.676) 0.170 0.279 0.179 (0.024) (0.025) (0.019) 0.282 0.136 0.018 (0.033) (0.035) (0.032) 0.339 0.389 0.302 (0.064) (0.037) (0.025) 0.076 0.053 0.046 (0.031) (0.028) (0.023) 0.210 0.274 0.278 (0.066) (0.060) (0.043)
0.169 (0.050) – –
0.055 (0.054) – 0.042 (0.062) 0.056 (0.046) 0.285 (0.058) 0.211 (0.050)
0.101 (0.045) – 0.291 (0.062) 0.076 0.105 (0.023) (0.039) 0.042 0.570 (0.036)’ (0.069) 0.113 0.279 (0.021) (0.042)
0.093 (0.049) – 0.136 (0.051) 0.097 (0.040) 0.239 (0.052) 0.069 0.045
0.113 (0.084) 0.073 (0.036) 0.050 (0.023) 4.395 (0.475) 5.154 (1.667) 0.063 (0.021) 0.173 (0.050) 0.096 (0.028) 0.014 (0.028) 0.167 (0.050)
– 0.058 (0.040) 0.136 (0.027) 2.157 (0.460) 3.268 (1.279) 0.017 (0.023) 0.096 (0.080) 0.116 (0.035) 0.308 (0.056) 4.930 (0.170)
0.255 (0.056) – 0.274 (0.058) 0.250 (0.044) 0.090 (0.056) 0.116 (0.050)
0.281 (0.064) – 0.310 (0.065) 0.337 (0.051) 0.048 (0.059) 0.187 (0.056)
0.050 (0.050) 0.047 (0.027) 0.038 0.0182 8.341 (0.315) 18.868 (1.049) –
0.483 (0.094) 0.141 (0.047) 0.055 (0.041) 7.131 (0.912) 22.440 (3.661) –
0.079 (0.065) 0.028 (0.026) 0.131 (0.016) 6.241 (0.349) 13.819 (1.043) –
0.529 (0.072) 0.116 (0.031) 0.082 (0.027) 5.733 (0.526) 15.734 (1.835) –
–
–
–
–
–
–
–
–
0.014 (0.022) 0.209 (0.040)
0.049 (0.038) 0.010 (0.090)
0.051 (0.020) 0.169 (0.037)
0.124 (0.026) 0.398 (0.055)
0.166 (0.054) – 0.193 (0.049) 0.141 (0.044) 0.086 (0.056) 0.006 (0.049)
0.102 (0.236) – –
0.152 (0.028) – 0.116 (0.042) 0.135 (0.024) 0.216 (0.033) 0.142 (0.028)
0.083 (0.140) – –
– 0.236 (0.203) 0.134 (0.197)
– 0.094 (0.110) 0.178 (0.111)
255
Business
0.187 (0.024) – –
0.540 (0.052) 0.099 (0.021) 0.039 (0.016) 8.122 (0.306) 23.795 (1.057) 0.160 (0.014) 0.055 (0.024) 0.286 (0.018) 0.107 (0.016) 0.004 (0.034)
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Non-OECD citizenship Copenhagen
256
Table A6. (Continued ) Variables
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
Administration 0.065 (0.025) Academics 0.446 (0.036) Others 0.233 (0.025) Self-employed 0.747 (0.038)
0.081 (0.023) 0.116 (0.035) 0.019 (0.020) 0.277 (0.039)
0.143 (0.055) 0.435 (0.104) 0.249 (0.044) 0.712 (0.145)
0.418 (0.055) 0.455 (0.063) 0.173 (0.049) 0.648 (0.090)
0.282 (0.045) 0.347 (0.053) 0.202 (0.042) 0.641 (0.061)
0.122 (0.049) 0.259 (0.063) 0.251 (0.048) 0.605 (0.067)
0.049 (0.054) 0.151 (0.080) 0.189 (0.055) 0.313 (0.065)
0.220 (0.034) 0.260 (0.030) 0.022 (0.024) 0.073 (0.034)
0.220 (0.041) 0.091 (0.025) 0.026 (0.024) 0.056 (0.037)
0.179 (0.056) 0.030 (0.058) 0.156 (0.059) 0.032 (0.043)
0.042 0.088 (0.056) (0.052) 0.273 0.220 (0.046) (0.037) 0.030 0.034 (0.045) (0.030) 0.073 0.019 (0.060) (0.050)
0.364 (0.059) 0.161 (0.043) 0.030 (0.037) 0.118 (0.058)
1.441 (0.013)
1.430 (0.014)
1.110 (0.037)
1.470 (0.023)
Treatment effects Private OJT Public OJT CT Residual Mass point ne
Note: See the footnotes of Table A1.
1.427 (0.028)
1.466 (0.018)
Age Basic High Above 49 Schooling School 0.007 (0.048) –
Vocational College and Education Beyond
0.130 (0.046) 0.529 (0.059)
0.022 (0.197) 0.187 (0.077) 0.111 (0.196) 0.479 (0.222)
0.140 (0.030) – 0.301 (0.027) 0.548 (0.045)
0.220 (0.106) 0.129 (0.106) 0.042 (0.111) 0.493 (0.123)
0.848 (0.093) 0.347 (0.049) 0.317 (0.044) 0.897 (0.075)
0.193 (0.040) 0.140 (0.028) 0.078 (0.025) 0.056 (0.036)
0.024 (0.100) 0.036 (0.085) 0.122 (0.070) 0.018 (0.076)
0.218 (0.039) 0.190 (0.033) 0.054 (0.031) 0.083 (0.045)
0.085 (0.104) 0.063 (0.061) 0.026 (0.046) 0.176 (0.072)
1.387 (0.024)
1.390 (0.016)
1.287 (0.041)
1.422 (0.015)
1.499 (0.025)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Men
Variables
nw1 nw2 Yearly indicator 1995 1996 1997 1999 2000 Female
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
5.508 (0.010) 4.102 (0.012)
5.301 (0.010) 3.913 (0.011)
6.675 (0.043) 5.278 (0.023)
7.004 (0.023) 5.395 (0.016)
6.730 (0.016) 5.442 (0.012)
6.677 (0.021) 5.395 (0.017)
6.540 (0.022) 5.522 (0.020)
6.556 (0.017) 5.435 (0.015)
5.208 (0.077) 3.427 (0.079)
6.669 (0.015) 5.431 (0.012)
6.513 (0.029) 5.456 (0.025)
0.128 (0.005) 0.085 (0.005) 0.063 (0.005) 0.036 (0.005) 0.034 (0.009)
0.137 (0.004) 0.094 (0.004) 0.058 (0.004) 0.035 (0.004) 0.034 (0.008)
0.126 (0.001) 0.094 (0.001) 0.051 (0.011) 0.031 (0.011) 0.024 (0.020)
0.147 (0.007) 0.099 (0.007) 0.069 (0.007) 0.038 (0.007) 0.020 (0.013)
0.137 (0.005) 0.086 (0.005) 0.065 (0.005) 0.038 (0.005) 0.042 (0.010)
0.119 (0.007) 0.086 (0.007) 0.061 (0.007) 0.040 (0.008) 0.019 (0.013)
0.119 (0.009) 0.078 (0.010) 0.041 (0.009) 0.021 (0.009) 0.056 (0.015)
0.133 (0.005) 0.081 (0.005) 0.060 (0.005) 0.033 (0.006) 0.031 (0.010)
0.123 (0.015) 0.100 (0.015) 0.068 (0.017) 0.013 (0.016) 0.088 (0.029)
0.131 (0.005) 0.091 (0.005) 0.057 (0.005) 0.034 (0.005) 0.034 (0.009)
0.135 (0.009) 0.097 (0.008) 0.067 (0.008) 0.043 (0.008) 0.024 (0.015)
–
0.077 (0.008) –
0.103 0.104 0.102 (0.005) (0.004) (0.005) – 0.001 0.001 (0.005) (0.006) 0.010 0.001 0.016 (0.005) (0.005) (0.005) – – –
0.086 (0.006) 0.028 (0.027) 0.002 (0.008) –
0.108 (0.004) 0.007 (0.005) 0.010 (0.005) 0.049 (0.007) 0.001 (0.006) 0.000 (0.005)
0.089 (0.010) 0.002 (0.020) 0.017 (0.020) 0.185 (0.020) 0.070 (0.017) 0.064 (0.017)
0.106 (0.004) 0.004 (0.005) 0.008 (0.005) 0.015 (0.006) 0.006 (0.005) 0.002 (0.004)
0.071 (0.005) 0.004 (0.008) 0.007 (0.007) 0.101 (0.013) 0.001 (0.008) 0.007 (0.007)
–
Children aged less 0.002 than 7 (0.005) Children in 0.012 household (0.005) Age below 25 0.050 (0.006) Age 25–29 0.009 (0.005) Age 30–39 0.008 (0.004)
0.009 (0.004) 0.005 (0.004) 0.063 (0.006) 0.008 (0.005) 0.001 (0.004)
0.002 (0.012) –
Age Basic Above 49 Schooling
–
–
–
–
–
–
–
–
–
–
High School
Vocational College and Education Beyond
257
Men
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Effects of Socioeconomic Characteristics on the Hourly Wage Rate, 1995–2000.
Table A7.
258
Table A7. (Continued ) Variables
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
0.009 (0.005) Married 0.002 (0.003) Non-OECD 0.018 citizenship (0.011) Copenhagen 0.027 (0.005) Smaller city 0.002 (0.003) Experience/100 0.432 (0.067) (Experience/100) 1.288 squared (0.210) Basic schooling 0.020 (0.003) High school 0.030 (0.006) College and 0.026 beyond (0.005) Not member of 0.006 union (0.004) Replacement rate 0.616 of UI benefits, (0.007) lagged
0.013 (0.005) 0.011 (0.003) 0.014 (0.014) 0.051 (0.004) 0.013 (0.003) 0.323 (0.074) 0.629 (0.272) 0.019 (0.003) 0.002 (0.004) 0.053 (0.004) 0.003 (0.003) 0.507 (0.007)
–
–
–
–
–
0.012 (0.015) 0.011 (0.031) 0.040 (0.011) 0.030 (0.008) 0.751 (0.140) – 0.059 (0.008) 0.106 (0.009) 0.034 (0.020) 0.037 (0.009) 0.387 (0.020)
0.004 (0.005) 0.006 (0.019) 0.031 (0.007) 0.010 (0.005) 0.552 (0.139) 1.938 (0.548) 0.016 (0.006) 0.001 (0.007) 0.064 (0.008) 0.005 (0.006) 0.489 (0.012)
0.005 (0.004) 0.000 (0.012) 0.045 (0.006) 0.006 (0.004) 0.462 (0.101) 1.099 (0.435) 0.013 (0.004) 0.013 (0.006) 0.049 (0.005) 0.006 (0.004) 0.590 (0.009)
0.007 (0.005) 0.060 (0.021) 0.048 (0.008) 0.009 (0.006) 1.024 (0.117) 2.688 (0.397) 0.012 (0.005) 0.053 (0.011) 0.033 (0.007) 0.008 (0.006) 0.623 (0.011)
0.013 (0.006) 0.004 (0.081) 0.006 (0.010) 0.018 (0.007) 0.632 (0.112) 1.612 (0.314) 0.005 (0.006) 0.047 (0.020) 0.040 (0.008) 0.014 (0.007) 0.712 (0.012)
0.056 (0.012)
0.016 (0.015)
0.017 (0.011)
0.013 (0.010)
0.023 (0.014)
0.009 (0.015)
Age above 49
UI membership Construction
0.014 (0.006)
Age Basic Above 49 Schooling
High School
Vocational College and Education Beyond
0.012 (0.005) 0.007 (0.004) 0.019 (0.013) 0.037 (0.006) 0.015 (0.004) 0.740 (0.081) 1.896 (0.269) –
0.044 (0.028) 0.018 (0.012) 0.036 (0.030) 0.047 (0.013) 0.007 (0.012) 1.643 (0.262) 6.709 (1.076) –
0.004 (0.006) 0.006 (0.003) 0.001 (0.015) 0.036 (0.006) 0.008 (0.004) 0.464 (0.081) 1.072 (0.254) –
0.020 (0.008) 0.003 (0.005) 0.018 (0.022) 0.038 (0.007) 0.006 (0.006) 0.282 (0.111) 0.714 (0.398) –
–
–
–
–
–
–
–
–
0.005 (0.005) 0.606 (0.009)
0.014 (0.011) 0.357 (0.024)
0.010 (0.004) 0.566 (0.008)
0.001 (0.006) 0.542 (0.012)
0.031 (0.012)
0.031 (0.081)
0.006 (0.006)
0.007 (0.027)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Men
–
KAD (female)
–
Production
0.017 (0.005) Technical 0.025 (0.007) Business 0.055 (0.008) Administration 0.004 (0.007) Academics 0.138 (0.009) Others 0.045 (0.006) Self-employed 0.044 (0.010) Treatment effects Private OJT Public OJT CT Residual Var(u)
0.072 (0.009) 0.078 (0.008) 0.048 (0.007) 0.054 (0.008) 0.116 (0.000)
0.050 (0.021) –
–
–
0.029 (0.006) 0.076 (0.008) 0.000 (0.005) 0.051 (0.005) 0.198 (0.007) 0.043 (0.005) 0.009 (0.010)
0.074 (0.020) 0.053 (0.013) 0.045 (0.019) 0.112 (0.014) 0.048 (0.020) 0.026 (0.030) 0.132 (0.014) 0.110 (0.052)
0.050 (0.014) 0.010 (0.010) 0.006 (0.013) 0.059 (0.011) 0.006 (0.012) 0.123 (0.013) 0.087 (0.010) 0.017 (0.020)
0.028 (0.010) 0.032 (0.007) 0.028 (0.006) 0.035 (0.007) 0.110 (0.000)
0.034 (0.017) 0.017 (0.017) 0.062 (0.016) 0.045 (0.011) 0.151 (0.000)
0.051 (0.014) 0.057 (0.010) 0.024 (0.010) 0.033 (0.011) 0.111 (0.000)
–
–
–
–
–
–
0.029 0.014 (0.011) (0.016) 0.002 0.018 (0.008) (0.012) 0.039 0.059 (0.011) (0.014) 0.034 0.012 (0.010) (0.014) 0.006 0.036 (0.010) (0.013) 0.162 0.151 (0.011) (0.015) 0.066 0.042 (0.009) (0.012) 0.019 0.034 (0.013) (0.016)
0.038 (0.019) 0.013 (0.013) 0.016 (0.014) 0.037 (0.015) 0.010 (0.014) 0.109 (0.016) 0.058 (0.013) 0.013 (0.016)
0.043 (0.012) 0.028 (0.010) 0.001 (0.013) 0.058 (0.012) 0.007 (0.012) 0.199 (0.037) 0.085 (0.010) 0.019 (0.015)
0.024 (0.074) 0.069 (0.072) 0.188 (0.073) 0.063 (0.072) 0.182 (0.072) 0.263 (0.073) 0.031 (0.071) 0.134 (0.078)
0.045 (0.011) 0.003 (0.005) 0.004 (0.008) 0.048 (0.006) 0.014 (0.007) 0.152 (0.046) 0.075 (0.006) 0.023 (0.010)
0.096 (0.044) 0.022 (0.025) 0.087 (0.023) 0.035 (0.024) 0.008 (0.022) 0.153 (0.022) 0.059 (0.023) 0.022 (0.025)
0.056 (0.015) 0.058 (0.008) 0.043 (0.007) 0.042 (0.010) 0.108 (0.000)
0.093 (0.022) 0.027 (0.014) 0.034 (0.013) 0.047 (0.021) 0.106 (0.000)
0.048 (0.010) 0.043 (0.007) 0.035 (0.008) 0.048 (0.007) 0.108 (0.000)
0.004 (0.031) 0.072 (0.025) 0.033 (0.018) 0.017 (0.021) 0.161 (0.000)
0.034 (0.011) 0.047 (0.008) 0.033 (0.007) 0.028 (0.009) 0.111 (0.000)
0.092 (0.023) 0.067 (0.014) 0.041 (0.012) 0.027 (0.017) 0.119 (0.000)
0.029 (0.019) 0.050 (0.012) 0.026 (0.010) 0.064 (0.013) 0.112 (0.000)
259
Note: See the footnotes of Table A1.
–
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
Metal
Variables
Mass-Point Probabilities, 1995–2000.
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
p(nu,na1,0,0,0,ne)
–
–
–
–
–
–
0.054 (0.044) –
–
p(nu,0,0,na3,0)
0.018 (0.001) –
–
–
–
0.295 (0.014) 0.144 (0.012) 0.136 (0.012) –
–
–
–
– 0.020 (0.009) –
0.095 (0.014) 0.008 (0.010) –
0.090 (0.017) –
–
0.031 (0.035) 0.039 (0.034) 0.025 (0.006) –
0.013 (0.005) 0.038 (0.006) –
p(nu,na1,na2,0,na4,ne)
–
–
p(nu,na1,0,na3,na4)
–
–
0.133 (0.017) –
0.014 (0.025) –
0.035 (0.005) 0.047 (0.014) –
0.048 (0.006) 0.073 (0.016) –
p(nu,na1,na2,na3,na4,ne)
–
–
–
–
–
p(0,na1,na2,0,0,ne)
–
–
–
–
–
p(0,na1,0,na3,0,ne)
–
0.145 (0.005) 0.035 (0.003) –
–
–
–
–
p(0,na1,0,0,na4,ne)
–
–
–
–
p(0,na1,na2,0,na4,ne)
–
0.060 (0.004) –
–
–
p(0,na1,na2,na3,0,ne)
–
–
0.011 (0.006)
0.011 (0.005)
p(nu,na1,0,0,na4,ne) p(nu,0,na2,0,na4,ne) p(nu,0,0,na3,na4,ne)
–
–
Age Basic Above 49 Schooling 0.051 (0.008) – 0.021 (0.008) –
High School 0.009 (0.030) 0.019 (0.018) 0.029 (0.028) –
Vocational College and Education Beyond – –
0.040 (0.012) –
–
0.071 (0.013) 0.048 (0.009) –
0.051 (0.024) –
–
–
0.014 (0.008) –
0.038 (0.013) –
0.038 (0.021) –
–
–
–
0.023 (0.004) –
–
–
–
–
0.159 (0.011) –
–
0.074 (0.023) 0.016 (0.009) –
–
–
–
–
–
–
–
–
–
0.030 (0.010) –
–
–
–
–
–
–
–
–
–
–
0.008 (0.003)
0.016 (0.005)
0.024 (0.012)
0.131 (0.007) 0.009 (0.003)
–
0.029 (0.004)
0.016 (0.006)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Men
p(nu,na1,na2,0,0,ne)
260
Table A8.
–
p(0,na1,na2,na3,na4,ne) 0.174 (0.009) p(0,0,na2,0,na4,ne) –
– –
0.028 (0.007) – –
0.012 (0.004) –
0.006 (0.003) 0.013 (0.003) –
0.025 (0.005) 0.022 (0.005) –
–
–
–
–
p(0,0,na2,na3,na4,ne)
–
0.016 (0.002) –
p(0,0,0,na3,na4,ne)
–
–
p(nu,na1,0,0,0,0)
–
–
0.336 (0.0339) –
p(nu,0,0,na3,0)
–
–
–
0.071 (0.006) 0.077 (0.006) 0.054 (0.005) –
–
–
–
–
–
0.041 (0.026) –
p(nu,na1,na2,0,0,0) p(nu,na1,0,0,na4,0) p(nu,0,na2,0,na4,0) p(nu,0,0,na3,na4,0)
–
–
–
p(nu,na1,na2,na3,na4,0)
–
p(0,na1,na2,0,0,0)
–
p(0,na1,0,na3,0,0)
–
0.374 (0.009) 0.161 (0.008) –
p(0,na1,0,0,na4,0)
–
p(0,na1,na2,0,na4,0)
–
0.164 (0.008) –
–
–
–
0.032 (0.004) –
–
–
–
–
0.039 (0.008) –
0.004 (0.002) –
–
–
0.012 (0.003) –
0.074 (0.114) –
–
–
–
–
–
–
0.181 (0.009) –
0.181 (0.091) 0.084 (0.095) –
0.184 (0.038) 0.079 (0.034) –
0.198 (0.045) –
–
0.079 (0.014) 0.140 (0.042) –
0.079 (0.010) 0.114 (0.011) 0.037 (0.005) –
–
–
0.035 (0.004) 0.018 (0.004) –
0.018 (0.004) 0.020 (0.006) –
0.024 (0.003) –
–
–
–
–
–
–
0.117 (0.037) 0.130 (0.029) –
0.248 (0.071) –
–
0.027 (0.152) 0.044 (0.046) 0.334 (0.112) 0.043 (0.108) –
–
–
–
0.056 (0.036) –
–
–
0.325 (0.016) –
–
0.037 (0.078) –
0.126 (0.037) –
0.178 (0.070) –
–
–
–
–
–
–
–
–
–
–
–
–
0.266 (0.015)
–
–
–
0.208 (0.034) –
0.207 (0.070) –
0.062 (0.017) 0.146 (0.037) –
–
–
–
–
–
–
–
–
–
–
–
–
0.144 (0.010) –
–
–
–
–
–
– –
261
p(nu,na1,na2,0,na4,0)
–
Program Participation, Labor Force Dynamics, and Accepted Wage Rates
p(0,na1,0,na3,na4,ne)
262
Table A8. (Continued ) Variables
Women
Age Below 25
Age 25–29
Age 30–39
Age 40–49
p(0,na1,na2,na3,0,0)
–
–
–
–
0.049 (0.004) –
–
0.080 (0.011) 0.091 (0.009) 0.058 (0.010) –
0.138 (0.011) 0.090 (0.008) 0.048 (0.009) –
0.128 (0.013) 0.102 (0.011) 0.068 (0.013) –
–
p(0,na1,0,na3,na4,0)
0.068 (0.0135) 0.038 (0.011) 0.052 (0.010) – 0.046 (0.008)
0.040 (0.004)
0.041 (0.004)
–
p(0,na1,na2,na3,na4,0) p(0,0,na2,0,na4,0) p(0,0,na2,na3,na4,0)
–
0.045 (0.004) –
Age Basic Above 49 Schooling
High School
–
0.140 (0.009) 0.101 (0.009) –
–
–
0.095 (0.017) 0.084 (0.022) 0.138 (0.023) –
–
0.042 (0.004)
0.050 (0.011)
–
Vocational College and Education Beyond 0.130 (0.011) 0.104 (0.009) 0.078 (0.010) –
0.125 (0.016) 0.083 (0.011) 0.070 (0.015) –
0.052 (0.005)
0.046 (0.008)
JAKOB ROLAND MUNCH AND LARS SKIPPER
Men
WHEN IS ATE ENOUGH? RISK AVERSION AND INEQUALITY AVERSION IN EVALUATING TRAINING PROGRAMS Rajeev Dehejia ABSTRACT Programs are typically evaluated through the average treatment effect and its standard error. In particular, is the treatment effect positive and is it statistically significant? In theory, programs should be evaluated in a decision framework, using social welfare functions and posterior predictive distributions for outcomes of interest. This chapter discusses the use of stochastic dominance of predictive distributions of outcomes to rank programs, and, under more restrictive parametric and functional form assumptions, the chapter develops intuitive mean-variance tests for program evaluation that are consistent with the underlying decision problem. These concepts are applied to the GAIN and JTPA datasets.
1. INTRODUCTION Program evaluation is typically carried out by considering the average treatment effect (ATE) of a new program under consideration (called the Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 263–287 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00009-6
263
264
RAJEEV DEHEJIA
treatment) relative to a status quo program (called the control). This is true both in experimental settings where the ATE can be estimated by a simple difference in means for outcomes of interest between the treatment and control groups, and also in non-experimental settings where the ATE is often a parameter in a more complicated model. Uncertainty regarding the treatment impact is summarized by the standard error of the ATE, often through the statistical significance of the point estimate. In contrast, decision theory offers a more comprehensive method for evaluating programs. A decision-theoretic analysis leads to a choice that maximizes (minimizes) expected utility (loss) given a likelihood model of the data. This result was first established by Wald (1950) and has led many researchers to formalize and extend the decision-theoretic framework. Of course, one might argue that this formalized statement of objectives misses intangible features of the decision problem. It is clear, however, that the decision-theoretic framework, while adding a layer of complexity to simple Neyman–Fisher hypothesis testing, provides a solid foundation for inference that explicitly links empirical estimates with the broader framework of rational, economic decision-making. At a more practical level, a decision theory approach is more general along two dimensions than simply looking at ATE. First, it accounts for uncertainty regarding the treatment impact in a systematic way, through the decision-maker’s risk attitude in an expected utility setting. To the extent that the Neumann and Morgenstern (1944), Wald (1950) approach is widely accepted in economics, the decision-theoretic approach reflects how we should account for uncertainty. Second, the decision approach allows for the decision-maker to exhibit aversion with respect to inequality in individual outcomes, which would lead him or her to consider the treatment impact on features of the distribution other than the mean. The aim of this chapter is to develop rules of thumb for evaluating programs, which are consistent with the decision framework. Interpreted literally – as adopting a program when its treatment effect is positive and statistically significant, we show that the traditional approach to evaluating programs is valid only under very strong assumptions. We then relax these assumptions to develop simple techniques – rules of thumb – for evaluating programs that are valid under more general conditions. The chapter proceeds as follows. In Section 2, we set up the general framework for evaluation. In Section 3, we establish conditions for equivalence between the traditional approach and the decision approach. In Section 4, we develop several rules of thumb for evaluating programs. In Section 5, we apply these rules in evaluating the Greater Avenues for
When is ATE Enough?
265
Independence (GAIN) and Job Training Partnership Act (JTPA) datasets, and Section 6 concludes the chapter.
2. A FRAMEWORK FOR EVALUATION 2.1. Decision Theory and Evaluation In a world without uncertainty, the policy-maker adopts a social welfare function, S(u1(y1), u2(y2), y, uN(yN)) which, for the population of interest, n=1, y, N, aggregates individual utility over the outcome of interest, yn. In general, we think of yn as being the net outcome of interest. For example, in an evaluation context, yn could be post-program earnings or post-program earnings net of costs. Note that this is a ‘‘welfarist’’ social welfare function (see Sen, 1973; Heckman & Smith, 1998), because it is defined over individual utility, un( ) from the outcome of interest. In the context of program evaluation, each outcome of interest could, ð0Þ potentially, be observed under treatment, yð1Þ i , and control, yi . Assignment to treatment, ti=1 or 0, determines which of these is observed. The utilityð1Þ n relevant outcomes are summarized by W ¼ fyð0Þ i ; yi gi¼1 . The observed data n 1 are given by Z ¼ fyi ; ti gi¼1 , where yi ¼ ti yi þ ð1 ti Þy0i . There is a distribution for Z and W: ðZ; W Þ Q
(1)
where Z Q¼
Py dpðyÞ y2Y
The outcome that will be realized depends on a decision by the policymaker, a=0,1: n hðW ; aÞ ¼ fyðaÞ i gi¼1
Ex post expected social welfare then is given by: n SWFðaÞ ¼ E Q fU½SðfyðaÞ i gi¼1 Þg
where a=0,1 and U( ) is a Neumann–Morgenstern utility function and S is a social welfare function. A simplification that we will often adopt is to assume that un(yn)=yn, so that the policy-maker is concerned only with
266
RAJEEV DEHEJIA
outcomes at the individual level. The expectation is with respect to the n predictive distribution, fyðaÞ i gi¼1 jZ. An alternative formulation considers ex ante uncertainty. Uncertainty is accounted for at the individual level. Each individual faces uncertainty regarding the outcome of interest, represented by Y ðaÞ n , a distribution ðaÞ over yðaÞ , and realizes some expected utility u ðY Þ. The policy-maker’s n n n decision is based on a social welfare function defined over these ðaÞ ðaÞ individual expected utilities, Sðu1 ðY ðaÞ 1 Þ; u2 ðY 2 Þ; . . . ; un ðY n ÞÞ or their certainty equivalents: ðaÞ ðaÞ SðyðaÞ 1 ; y2 ; . . . ; yn Þ
(2)
where yðaÞ 1 is the certainty equivalent of individual i’s earnings (see Dehejia, 1999). Both formulations are useful, depending on the context. We adopt the first formulation when focusing on the role that risk attitude plays in the evaluation, since it allows us to assume that the decision-maker is neutral with respect to inequality in individual outcomes, while remaining agnostic about risk attitude. Instead, we adopt the latter approach when allowing the decision-maker to be averse to inequality in outcomes across program participants, since this is more transparent in the latter approach.
2.2. Decision Theory in Practice Three ingredients must be specified in order to implement this approach. First, given the observed data, an appropriate model must be specified for the outcome. For example, in an evaluation context, a model for earnings would be needed. If the data are drawn from a randomized trial, then the model would be straightforward; if there are issues of sample selection, then a more complicated model would be needed. From the model, we either approximate (using maximum likelihood) or simulate (using Markov Chain Monte Carlo) the posterior distribution of the parameters, p(y|Z). Based on the posterior distribution of the parameters, we are interested in the n predictive distribution of individual earnings, fyðaÞ i gi¼1 jZ. This is the distribution in the space of the outcome that accounts for all parameter and intrinsic uncertainty, conditional on model choice, and is readily simulated. Second, we must specify the policy-maker’s risk attitude, and third, must specify the policy-maker’s inequality attitude.
When is ATE Enough?
267
Two simplifications that are often adopted are risk neutrality and inequality neutrality. Inequality neutrality implies that the policy-maker is concerned only with the average outcome over the population of interest. This is stated in Observation 1. Observation 1. When S( ) is utilitarian (linear) and un(yn)=yn, the policy decision turns on comparing Uð¯y0 Þ with Uð¯y1 Þ. This implies that the policy-maker would examine the predictive distribution of y¯ for both programs under consideration. The assumption of risk neutrality implies that the policy-maker is concerned only with the mean predicted value of social welfare. The two assumptions together have the strong implication that the policy-maker will compare the mean of the predictive distribution of average earnings across the two programs. Under the additional assumption of normality of the outcome, the mean of the predictive distribution of earnings will simply correspond to the sample mean. This is stated in Observation 2. Observation 2. Assume S( ) is inequality neutral, U( ) is risk neutral, ui(yi)=yi, yi1BN(m1,s1), and yi0BN(m0,s0). Then the treatment program will be adopted if y¯ 1 y¯ 0 40.
2.3. The Traditional Approach Rather than focusing on the predictive distribution, the traditional approach to program evaluation instead examines the ATE and its standard error. One version of this is stated below as Rule of Thumb 0. Rule of Thumb 0 (RT0). Adopt program 1 (the treatment) if t¼
y¯ 1 y¯ 0 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 4ta=2 sp 1=n1 þ 1=n0
(RT0)
i.e., if the treatment effect is positive and statistically significant, where s2p ¼ ððn1 1Þs1 þ ðn0 1Þs0 Þ=ðn1 þ n0 2Þ and ni, y¯ i , and s2i are the sample size, mean, and standard deviation for group i=1,0. Essentially the salience of the treatment effect is evaluated through its t-statistic. Though this approach might seem overly simplistic, it is the most widely used rule of thumb in evaluating programs (see, e.g., Friedlander, Greenberg, & Robins, 1997, Table 3).
268
RAJEEV DEHEJIA
3. LOOKING FOR EQUIVALENCE In this section, we discuss the assumptions within a decision framework that justify the use of the ATE in evaluating programs. 3.1. There is No General Equivalence Proposition 1 sketches the argument that RT0 is not in general justified as the outcome of an expected utility decision process. The proposition discusses the case of normality. For non-normal distributions, a similar argument can be made with respect to the asymptotic distribution of sample means. Proposition 1. Assume S( ) is inequality neutral, yi1BN(m1,s2), and yi0BN(m0,s2). There exist some n1 and n0 and samples such that RT0 is not valid under a decision framework. Sketch Proof. For simplicity, assume that the difference in program cost between the treatment and control programs is constant, c, so that yi1 and yi0 are the outcomes. Consider three cases. (A) Positive program cost, c, and symmetric priors for programs 1 and 0: Consider n1=n0=n-N. As n grows large, numerator of RT0 approaches m1m0, and the denominator of RT0 shrinks. For a sufficiently large n, any positive mean difference will be statistically significant. But then program 1 will be chosen if there is a small mean difference with program 0, even though the difference is less than c. This is inconsistent with U( ) increasing in earnings net of program costs. (B) Zero cost and symmetric priors: Consider the case with s1=s0 and n1=n0. If utility is increasing in y, y¯ 1 4¯y0 is sufficient to choose program 1, which is not satisfied by RT0. (C) Zero cost and asymmetric prior: This case includes situations where there is a prior preference for one of the programs or extra information about one of the programs. However, even in this case, for n1 and n0 sufficiently large, the data will dominate the prior and we return to case (B). 3.2. A Special Case Thus, the justification, if any, for the standard decision procedure will rely on fixed sample size arguments. One argument justifying RT0 for a fixed sample size is offered in Proposition 2.
When is ATE Enough?
269
Proposition 2. RT0 can be justified by state-dependent preferences, which are defined on the decision space rather than the outcome space. Proof. Consider the following table: Choice Made (c) State of the world (s)
Treatment (1)
Control (0)
m1>m0 (1) m1om0 (0)
a11 a01
a10 a00
The elements of the table would normally correspond to the utility from the choice made, given that state of the world. The decision rule is given by: EUðc ¼ 1jdataÞ ¼ a11 Prðm1 4m0 jdataÞ þ a01 Prðm1 om0 jdataÞ EUðc ¼ 0jdataÞ ¼ a10 Prðm1 4m0 jdataÞ þ a00 Prðm1 om0 jdataÞ and EUðc ¼ 1jdataÞ EUðc ¼ 0jdataÞ ¼ a01 a00 þ ða11 a10 ða01 a00 ÞÞ Prðm1 4m0 jdataÞ Thus c=1 if Prðm1 4m0 jdataÞ4
a00 a01 k a11 a01 ða10 a00 Þ
(3)
For fixed n1 and n0 and appropriate values of a01, a11, a10, and a00, k=ta/2. There are two limitations to this result. First, as mentioned above, it relies on a fixed sample size. Second, the preferences adopted in Proposition 2 are state dependent and defined over choices rather than outcomes. This is a problematic assumption, because the policy-maker’s preferences depend on a state (m1>m0 versus m1om0) which is unobserved both before and after the decision is made.1 Normally state-dependent preferences rely on a state of the world that is unobserved when a decision is being made, but which is observed when the outcome is experienced.2 In conclusion, the rule of thumb that is widely adopted to evaluate programs is not in general valid. In next section we examine rules of thumb that are derived directly from a Bayesian decision procedure.
270
RAJEEV DEHEJIA
4. RELAXING RISK AND INEQUALITY NEUTRALITY In this section, we develop rules of thumb that unlike RT0 are valid within the standard decision framework. In Section 4.1, we develop these rules by making specific assumptions about the likelihood model for earnings (normal, log normal, a mixture distribution) while maintaining the assumption of inequality neutrality. In Section 4.2, while still assuming inequality neutrality, we relax the parametric assumptions by using a multinomial likelihood. In Section 4.3, we discuss the inclusion of covariates, using a maximum likelihood framework. Finally in Section 4.4, we allow for inequality aversion. 4.1. Relaxing Risk Neutrality A policy-maker who is inequality neutral, as we have seen in Observation 1, will consider y¯ , the average earnings in each program, but will also have to account appropriately for uncertainty. Table 1, rows 1–5, summarizes rules of thumb that allow for risk aversion, but maintain the assumption of inequality neutrality, under a range of parametric assumptions on the earnings process. The simplest assumption is that yitBN(mt,st), for t=1 (treatment) and 0 (control), in which case the predictive distribution of average earnings under treatment (control) is given by 1 1 y¯ t jData N y¯ T¼t ; þ 1 s2t Nðmt ; s2t Þ (4) n nt where y¯ T¼t and s2t represent the sample mean and variance of the empirical distribution and nt is sample size, for t=1,0. Note that y¯ t is the average predicted earnings for the sample of interest, whereas y¯ T¼t is the sample mean of earnings for the treatment (control) group.3,4 We derive a decision rule by using the concepts of first- and second-order stochastic dominance. This is a useful concept because it allows us to compare distributions without making specific assumptions on the utility function.5 The limitation of the concept is that not all distributions can be ranked according to first- and second-order stochastic dominance (though, of course, one can appeal to higher orders of stochastic dominance). Row 1 of Table 1 states the conditions under which the treatment first-order stochastically dominates the control. The conclusion is that – if all we know about the decision-maker is that his utility function is increasing in the
1 2 3 4
5
Rules of Thumb for Evaluating Programs.
Assumption on U( )
Assumption on S( )
Distribution
Solution Concepta
– – – –
Linear Linear Linear Linear
Normal Log normal Mixtureb Mixture
SD1 SD2 SD1 SD1
m1 4m0 and s21 os20 c s21 os20 and m1 þ s21 =24m0 þ s20 =2d p0 4p1 ; m0 om1 ; and s0 4s1 e p0 p1 4Prðy1 cjm1 ; s1 Þ
EU
Prðy0 cjm2 ; s2 Þ; m0 om1 ; and s0 os1 f bm1 þ cv1 þ cm21 4bm0 þ cv0 þ cm20 g
–
Quadratic
–
Adopt Treatment If
When is ATE Enough?
Table 1.
a
SDn=nth order stochastic dominance; EU, expected utility. Mixture=Pr(yit=0)=pt, Pr(yit>0)=1pt, t=1,0. For yiBlogN|yi>0. c mi ¼ y¯ i ; s2i ¼ n1i ðn1i þ 1Þs2i , i=1 for treatment group, =0 for control group. b
d
mi ¼ y¯ i ; s2i ¼ n1i ðn1i þ 1Þs2i , sample statistics from log earnings data; i=1 for treatment group, =0 for control group. pi=proportion of zeros in earnings data, mi ¼ y¯ i ; s2i ¼ n1i ðn1i þ 1Þs2i , sample statistics from log of positive earnings data; i=1 for treatment group, =0 for control group. ðm =s2 Þ ðm1 =s21 Þ ðððm0 =s20 Þ ðm1 =s21 ÞÞ2 4ðð1=2s20 Þ ð1=2s21 ÞÞððm20 =s20 Þ ðm21 =s21 Þ þ log s0 log s1 ÞÞ1=2 f Sample statistics as in (c) with c ¼ 0 0 ð1=2s20 Þ ð1=2s21 Þ g 2 S(y)=a+by+cy , cn=certainty equivalent of individual predictive distributions of earnings, mi=mean(cn), vi=variance(cn), i=1 for treatment group, =0 for control group. e
271
272
RAJEEV DEHEJIA
average earnings from the program – the programs can be ranked unambiguously only if one of the programs has both a higher mean and a lower variance. A limitation of the rule described in row 1 is the assumption that the outcome is normally distributed, which is unrealistic for earnings. A more common assumption is that earnings are log normally distributed. Eq. (1) can be reinterpreted as describing the predictive distribution of average log earnings (where the sample moments are also for the log distribution). The policy-maker compares the exponential of this distribution for each program. (Note that this is not precisely correct, since the exponential is not a linear operator. The policy-maker should compare the average of the exponential of the individual predictive distributions of earnings. Unfortunately this distribution will not have a convenient parametric form, which is why we consider this alternative.) From row 2, we see that if a distribution is to be preferred by a decision-maker under all risk-averse preferences to another, then it must have a lower variance. However, its mean can be lower, because of the second condition. Thus, a mean-variance tradeoff exists to the extent that if the first distribution has a lower variance, its mean must be higher to a sufficient degree to obtain second-order stochastic dominance. Thus this rule is more demanding than the previous one. Log normality can also be a problematic assumption under some circumstances. In particular, in many labor-training programs a large proportion of observations is concentrated at zero (representing zero earnings, which is common among welfare recipients). Rows 3 and 4 generalize the likelihood to a mixture distribution, in which there is a probability p of zero (and a probability 1p of positive) earnings. Conditional on earnings being positive, they are described by a log normal distribution. Row 3 presents the obvious generalization of the rule in row 2: it requires the mass point in the control distribution to be larger, along with the conditions described in row 2. Row 4 considers an important extension. Even if the conditions from row 2 are not satisfied, if the mass point in the control distribution is sufficiently larger than the mass point in the treatment distribution, the treatment can still first-order stochastically dominate the control.
4.2. Non-Normal Data: A Non-Parametric Approach In this section, we consider a non-parametric (or flexibly parametric) approach to modeling the data and ranking programs. In order to achieve
When is ATE Enough?
273
this generalization, we assume that the data are discrete (or can be finely discretized to an arbitrary degree) and use a multinomial distribution. Divide the support of the earnings distribution into M bins, b1, b2, y, bM. For each bin, we observe n11 , n21 , y, nM1 individuals in the treatment group and n10 , n20 , y, nM0 units in the control group. In this setting, a multinomial likelihood with probability of a draw from each bin (y1t, y2t, y, yMt), t=1,0, is completely unrestrictive. The advantage of this setup is that it allows us to consider risk aversion under more general parametric assumptions. Since the model does not incorporate covariates, we still cannot consider the role of inequality aversion. A conjugate prior for this model is a Dirichlet prior of the form D(z1t, z2t, y, zMt), which leads to a posterior of the form ðy1t ; y2t ; . . . ; yMt Þjdata Dðz1t þ n1t ; z2t þ n2t ; . . . ; zMt þ nMt Þ, for t=1,0. Chamberlain and Imbens (1996) suggest that an improper prior where zmt-0 is preferable. The predictive distribution of earnings is given by: Z PrðY it ¼ bm jdataÞ ¼
PrðY it ¼ bm jymt Þpðymt jdataÞdymt Z
¼ ¼
ymt pðymt jdataÞdymt nmt ; Snmt
where the last step is from the properties of the Dirichlet distribution. Thus the predictive distribution of yit is essentially the empirical distribution of the data. Since the model does not allow for covariates, each individual has the same expected utility in each program,
EUt ¼
X
pmt Uðbm Þ
m
where Nt=Snmt and pmt=nmt/Nt, for t=1,0, and the ranking of each program will be determined by comparing EU1 with EU0. The stochastic dominance concepts discussed in the previous section could also be applied in this context, but there are no additional restrictions, beyond the definition of stochastic dominance, under which a positive mean difference becomes a sufficient condition for stochastic dominance.6
274
RAJEEV DEHEJIA
4.3. Non-Normal Data: A Maximum Likelihood Approach An alternative to the multinomial model discussed in the previous section is to use maximum likelihood methods. This would allow for the inclusion of covariates, and more generally allows for the estimation of relatively complicated models with ease. Once a likelihood model has been specified for the outcomes of interest, then the posterior distribution of the parameters can be approximated from the maximum likelihood estimate using the usual formula: 2 1 ! @ ‘ðyML Þ ^ yjz N yML ; @y@y0 approx
Then the predictive distribution of the outcome p(y|z)=p(y|y)p(y|z) is readily simulated by first drawing from the posterior distribution of the parameters, and then drawing from the outcome distribution specified in the likelihood. The predictive distributions of the outcomes for the individuals of interest can then be fed into the decision problem outlined in Section 2.
4.4. Relaxing Inequality Neutrality The final extension we consider is to allow for inequality neutrality. We use the framework described in Eq. (2), Section 3, to allow for the possibility of inequality aversion. The policy-maker’s social welfare for a program is a function of each individual’s certainty equivalent of earnings under that program. The policy-maker chooses the program with the higher social welfare. In practice, in order to implement this approach we would have to make assumptions about individual preferences and would also need to adopt a likelihood for the data.7 In order to obtain rules of thumb for ranking programs, we must make functional form assumptions about S( ). If S( ) is quadratic, then the programs can be ranked by the means and variances of the certainty equivalents of earnings for each individual in each program. This is shown in row 5 of Table 1.
When is ATE Enough?
275
5. TWO EXAMPLES 5.1. The GAIN Demonstration, Riverside County In this sub-section, we re-analyze data from the Riverside County portion of the GAIN experiment. GAIN was a welfare-to-work demonstration in which the randomly assigned treatment consisted of a set of labor-training and job-search initiatives imposed on recipients of Aid to Families with Dependent Children. Riverside County was widely viewed as the most successful portion of the GAIN experiment, and more generally as one of the success stories of welfare reform. We demonstrate that the traditional approach of analyzing t-statistics provides a narrow and misleading view of this program. Table 2 summarizes outcome information for Riverside County. It shows that an analysis of Riverside based on the statistical significance of the treatment impact leads to an ambiguous ranking. From panel (a), we note that the overall treatment effect is positive and significant, but if we break out positive earnings and consider them separately, panels (b) and (c) reveal that the treatment impact, though positive, is no longer statistically significant. Note that there is a substantial difference in the proportion of zeros in the treatment and control groups. The rules of thumb from Table 1 give a more consistent ranking of the program. Based on positive earnings, we note that treatment earnings have Table 2.
GAIN, Riverside County.
(a) Average quarterly earnings Sample mean Sample standard deviation Proportion of zeros Average treatment effect (b) Average quarterly positive earnings Sample mean Sample standard deviation Average treatment effect (c) Average log positive earnings Sample mean Sample standard deviation Average treatment effect
Treatment
Control
712 1,351 0.5485
550 1,199 0.6249 163 (36)
1,578 1,637
1,465 1,579 112 (72)
6.6499 1.4870
6.5209 1.5257 0.1290 (0.0660)
276
RAJEEV DEHEJIA
both a higher mean and higher variance than control earnings; thus, rules 1–3 do not apply. But when we consider the zeros, since the mass point at zero is greater for the control distribution by a sufficient degree than the mass point for the treated distribution, rule 4 implies first-order stochastic dominance. Thus, the ranking of programs is unambiguous. Fig. 1 confirms this conclusion in a non-parametric setting. Fig. 1 plots the predictive distribution of earnings under treatment and control from the multinomial model. We see that the CDF of treatment earnings is everywhere to the right of the CDF of control earnings, implying firstorder stochastic dominance. The case of Riverside offers one illustration of the fact that t-statistics can be a very misleading means of accounting for the uncertainty in an evaluation. Another dimension along which we can extend the analysis beyond the traditional approach is to consider the role played by inequality aversion in evaluating the Riverside experiment. Recall that a t-statistic implicitly
Cumulative probability, Solid=treated,Dashed=control
1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0
1
2
3 Earnings
Fig. 1.
4
5 x104
Cumulative Distribution Function, Average Earnings, Riverside.
When is ATE Enough?
277
assumes inequality neutrality. To allow for inequality aversion, we assume that individual preferences are constant relative risk aversion, and apply rule 5 from Table 1. Table 3 presents the means and variances of the certainty equivalents. The means for the treatment program are higher than for the control program, but the variance is also higher. The final column gives the restrictions on quadratic preferences for which the treatment is preferred. Since co0, b>>0 is required. Calibrating quadratic preferences is not an intuitive exercise. Table 4 instead uses CRRA preferences to illustrate how risk aversion and inequality aversion are combined into a ranking of the programs. The table depicts the ranking of the programs using CRRA preferences both for individuals and the policy-maker, with a range of coefficients of risk and Table 3. Coefficient of Relative Risk Aversion (q) 2 3 4 5
Certainty Equivalents, Riverside.
Treatment
Control
Mean
Variance
Mean
Variance
Restrictions to Prefer Treatment (b/|c|>)
68 34 23 17
11,740 941 283 131
45 24 16 12
1,865 295 107 54
0.0018 0.0082 0.0150 0.0217
Table 4. Welfare Rankings, Allowing for Risk and Inequality Aversion, Riverside. Coefficient of Relative Risk Aversion (q) 4 3 2 1 0 1 2 3 4 5
Coefficient of Inequality Aversion (e) 4
3
2
1
0
+ + + + +
+ + + + +
+ +
+
+
+ + + +
+ + + +
+ + + +
+ + + +
+ + + +
1
2
3
4
5
+
+ + + +
+ + + +
a
Note: +, treatment preferred to control; , control preferred to treatment. q=0.4, e=0 corresponds to RT0, which accepts the treatment.
a
278
RAJEEV DEHEJIA
inequality aversion. Note that for a range of values of risk and inequality aversion the treatment is not preferred to the control program. Interestingly this includes the case of e=0, inequality neutrality, which appears to contradict Fig. 1 which indicates first-order stochastic dominance when e=0. This is accounted for by the fact that Table 4 is based on a model which allows for treatment effect heterogeneity, whereas Fig. 1 assumes a constant treatment effect. This reinforces a point that has been made in the evaluation literature, namely the importance of allowing for treatment effect heterogeneity (see Dehejia, 1999; Heckman & Smith, 1998). When we increase the degree of inequality aversion, the control program is preferred for a broad range of coefficients of risk aversion. There is no combination of the parameters q and e that directly correspond to RT0, the rule which would normally be followed. But a partial correspondence can be established as follows. RT0 embodies inequality neutrality, since it focuses on the ATE, hence e=0. It does not also correspond to q=0 because if the policy-maker is risk neutral as well then program 1 is chosen if y¯ 1 4¯y0 . For RT0, y¯ 1 y¯ 0 4c, where c is the denominator of RT0 multiplied by the right-hand side of the same expression. Given the distribution of y¯ 0 , we can numerically compute q such that the treatment would be chosen if y¯ 1 y¯ 0 4c. For the Riverside program, this corresponds to q=0.4. Thus RT0 corresponds to slightly riskaverse preferences. In summary, the traditional analysis of GAIN is misleading along two dimensions. First, the t-statistic on the ATE in positive earnings is not significant, though for total earnings the difference is significant. Instead, the rules of thumb deliver a more consistent ranking, which is confirmed by a non-parametric analysis. Second, the traditional approach ignores heterogeneity in the treatment impact.
5.2. The JPTA Data In this section, we provide an analysis of data from the National JTPA Study, and also illustrate how program costs can be incorporated into the analysis we have presented thus far. The JTPA data were obtained from a randomized evaluation implemented between 1987 and 1989 to evaluate the efficacy of the employment and training services provided through the JTPA. The outcome which we study is the sum of 30-month earnings. Costs are measured according to the type of services each individual received (classroom training, on-the-job training, or other services). We present
When is ATE Enough?
279
Table 5.
JTPA Data, Earnings (in dollars).
(a) Average quarterly earnings Sample mean Sample standard deviation Proportion of zeros Average treatment effect (b) Average quarterly positive earnings Sample mean Sample standard deviation Average treatment effect (c) Average log positive earnings Sample mean Sample standard deviation Average treatment effect
Treatment
Control
19,520 19,912 0.1027
18,404 18,760 0.1039 1,117 (580)
21,754 1,983
20,538 1,868 1,216 (610)
9.3084 1.5375
9.2782 1.4645 0.0302 (0.0475)
1.4
1.2
Cumulative probability
1
0.8
0.6
0.4
0.2
0 -2
0
2
4
6
8
10
12
14
Earnings
Fig. 2.
CDFs of Treatment and Control Earnings, JPTA.
16 x104
280
RAJEEV DEHEJIA
results for the adult male sample. Table 5 presents average earnings and training costs from the program. The treatment program has both a higher level and greater variation in treatment earnings compared to control earnings. The average cost per trainee is $515, whereas the ATE is $1,117 (with a standard error of 580). The treatment effect is not statistically significant. The rules of thumb from Table 1 are not able to rank the programs. Likewise, Fig. 2, which depicts non-parametric estimates of the CDFs of treatment and control earnings, is inconclusive. Unlike the GAIN example, the CDFs of treatment and control earnings cross twice, and furthermore first- or second-order stochastic dominance do not help to rank the distributions. One important difference between the JPTA and GAIN data is that the former has a much smaller proportion of zeros (typically 0.1, compared to 0.6 for GAIN). The cost component of the program can be taken into account by netting the training costs out of earnings under treatment. The CDFs of net earnings are presented in Fig. 3. Training costs 1.4
1.2
Cumulative probability
1
0.8
0.6
0.4
0.2
0 -2
0
2
4
6
8
Earnings
Fig. 3.
10
12
14
16 x104
CDFs of Treatment and Control Earnings Net of Training Costs, JTPA.
When is ATE Enough?
281
do not alter the basic conclusion: first- and second-order stochastic dominance are, in this case, inconclusive in ranking the programs. This hints that their ranking could depend on the degree of risk aversion of the policy-maker. We consider this issue below. Before turning to specific functional forms for the social welfare functions, it is useful to consider the role that inequality aversion could play in ranking the programs. Figs. 4–6 (below) present the percentiles of the earnings distribution of each program. The first figure considers earnings, ignoring training costs. We can see that in terms of percentiles of earnings the two programs start out together at $0 until about the 10th percentile. Then for the next range (roughly from the 10th to the 15th percentiles) the treatment distribution is higher. For the 15–30th percentiles, control earnings are higher, and above the the 30th percentile treatment earnings are higher. In terms of inequality the treatment program raises earnings at both ends of the distribution.
log(earnings+1031), Solid=treated, Dashed=control
16 15 14 13 12 11 10 9 8 7 6
0
20
40
60 Percentile
Fig. 4.
Percentiles of Earnings, JTPA.
80
100
282
RAJEEV DEHEJIA
log(earnings+1031), Solid=treated, Dashed=control
16 14 12 10 8 6 4 2 0
0
20
40
60
80
100
Percentile
Fig. 5.
Percentiles of Earnings Net of Training Costs, JTPA.
When we add training costs into the analysis, the conclusions remain similar, except that the treatment distribution is pulled down by the training costs. This implies that the treatment distribution only overtakes the control distribution at the upper end of the distribution. This is depicted in Figs. 5 and 6. The two strands of the analysis – risk aversion and inequality aversion – are combined in the tables below. They present the welfare rankings of the treatment and control programs using CRRA preferences and allowing for a range of risk and inequality attitudes. Ignoring training costs, we see that the treatment program is preferred to the control as long as the policymaker’s risk attitude is risk neutral or risk loving, and if the coefficient of inequality aversion is o5. When the coefficient of inequaliy aversion is in an intermediate range, between 2 and 4, the treatment is also preferred when the policy-maker is risk averse. RT0 corresponds to e=0 and q=0.1, a very low level of risk aversion. From Table 5 we know that RT0 leads to the treatment being rejected (Tables 6 and 7).
When is ATE Enough?
283
0.3
Solid=without costs,Dashed=with costs
0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6
0
20
40
60
80
100
Percentile
Fig. 6. Difference in Percentiles of Treatment and Control Earnings, JTPA (detail: 6–100th percentile).
Table 6.
Welfare Rankings, Earnings, JTPA.
Coefficient of Relative Risk Aversion (q) 4 3 2 1 0 1 2 3 4 5 a
Coefficient of Inequality Aversion (e) 4
3
2
1
0
+ + + + +
+ + + + +
+ + + + +
+ + + + +
+ + + + +
1
2
3
4
5
+ + + + +
+ + + + +
+ + + + +
+ + + +
+ + + +
+ + + +
+ +
a
+
+
q=0.1, e=0 corresponds to RT0, which rejects the treatment.
284
Table 7.
RAJEEV DEHEJIA
Welfare Rankings, Earnings Net of Training Costs, JTPA.
Coefficient of Relative Risk Aversion (q) 4 3 2 1 0 1 2 3 4 5 a
Coefficient of Inequality Aversion (e) 4
3
2
1
0
+ + + + +
+ + + + +
+ + + + +
+ + + + +
+ + + + +
1
2
3
4
5
+ + + + +
+ + + + +
+ + + + +
a
q=0.1, e=0 corresponds to RT0 and rejects the treatment.
When we take the training costs into account, the range of risk and inequality preferences for which the treatment is preferred shrinks. In particular, now the treatment is strictly preferred if the policy-maker is risk neutral or risk loving and coefficient of inequality aversion is o5. Strong inequality aversion (e=5) can lead to a preference for the control, presumably since the treatment lifts earnings at the upper end of the distribution and reduces earnings for the lower range. But for less extreme degrees of inequality aversion, the preference over programs is decided purely by the degree of risk aversion. In this case, RT0 again corresponds to e=0 and q=0.1, and rejects the treatment.
6. CONCLUSION This chapter has developed a set of simple rules of thumb that can be used to evaluate programs. These rules of thumb account for risk aversion and for inequality aversion. By applying these rules to data from the GAIN and JTPA experiments, we demonstrated that these two factors are very important in ranking programs. For the GAIN data, which is widely cited as an example of a successful training program, the treatment is not always preferred to the control, when we allow for inequality aversion. For the JTPA, the simple rules of thumb were unable to rank the programs, but the full analysis which combines inequality aversion, risk aversion, and
When is ATE Enough?
285
heterogeneous treatment impacts demonstrated that for risk or inequality averse preferences the control program is preferred. One of the limitations of the methods discussed is that they rely on strong assumptions. The rules of thumb use assumptions such as normality and log normality, and the non-parametric analysis presented here does not account for covariates. All of these can be readily addressed by applying more general models and techniques. But at that point the analysis would cease to be through simple rules of thumb and would become a full-blown decision analysis.
NOTES 1. For example, in choosing between a picnic and an afternoon at the museum, preferences may be influenced by the weather, which is unknown when choosing, but during the afternoon the weather becomes known and of course affects the enjoyment of the activity. In an evaluation context, the state of the world, m1>m0 or m1om0, is unknown both when making the decision and also after the decision has been taken. The policy-maker’s utility from the realization of each program cannot reasonably be state dependent, because the underlying states are never observed by the policy-maker. In the previous example, it would be like having utility from a picnic and an afternoon at the museum depend on the weather on Mars. 2. To adopt state-dependent preferences, we would have to assume that the policymaker eventually knows the true state of the world. This could be justified if the program will be evaluated multiple times, so over many trials the true parameter values will become known. However, in reality after a few evaluations, either the treatment or control program is chosen, so either m1 or m0 will become known, but not both. 3. In principle, Eq. (4) can also be interpreted as the predictive distribution from a classical regression model. This would allow us to relax the assumption of no sample selection bias. 4. The assumption of normality is common, indeed one of the foundations, of mean-variance comparisons in the portfolio choice literature (see Ingersol, 1987; Samuelson, 1970). The normality assumption can be justified asymptotically, depending on the model. If the variable of interest is defined as a parameter in a likelihood model (e.g., the mean of the earnings distribution), then we know that it will be asymptotically normal (see inter alia DeGroot, 1970; Le Cam, 1953). 5. Stochastic dominance has found increasing application in applied empirical work. See inter alia Maasoumi and Heshmati (2000) and Maasoumi and Millimet (2005). 6. A positive difference in means is a necessary, but not sufficient, condition for the treatment program to first- and second-order stochastically dominate the control. But in the case of discrete distributions, it is very easy to check for dominance. The treatment first-order stochastically dominates the control if and only if: Pm1 Pm0 8m; strict for some m
286
RAJEEV DEHEJIA
P where Pmt ¼ m i¼1 pit ; t=1,0. The treatment second-order stochastically dominates the control program if: X ðPmt Pmt Þ 0 8m; strict for some m m
7. The likelihood must allow for heterogeneity in individual earnings. Without heterogeneity, the distribution of the certainty equivalents of earnings under treatment (control) across individuals would be degenerated. The models considered in Sections 4.1 and 4.2, for example, do not allow for treatment effect heterogeneity.
ACKNOWLEDGMENTS The author gratefully acknowledges conversations and collaboration with Joshua Angrist that sparked and helped to refine this research and comments from Daniel Millimet, two anonymous referees, Alberto Abadie, Gary Chamberlain, Richard Ericson, Andrew Gelman, Jinyong Hahn, Gur Huberman, Charles Jones, David Kranz, Adriana Lleras-Muney, and Dale Poirier. Thanks to seminar participants at Brown, Columbia, EUI Florence, UC Irvine, and University of Wisconsin, Madison.
REFERENCES Chamberlain, G., & Imbens, G. (1996). Hierarchical Bayes models with many instrumental variables. Harvard Institute of Economic Research, Paper no. 1781. DeGroot, M. (1970). Optimal statistical decisions. New York: Mc-Graw Hill. Dehejia, R. (1999). Program evaluation as a decision problem. NBER Working Paper no. 6954, forthcoming, Journal of Econometrics. Friedlander, D., Greenberg, D., & Robins, P. (1997). Evaluating government training programs for the economically disadvantaged. Journal of Economic of Literature, 35, 1809–1855. Heckman, J., & Smith, J. (1998). Evaluating the welfare state. NBER Working Paper no. 6542. Ingersol, J. (1987). Theory of financial decision making (pp. 37–42). New York: Rowman & Littlefield. Le Cam, H. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. University of California Publications in Statistics, 1(11), 277–330. Maasoumi, A., & Millimet, D. L. (2005). Robust inference concerning recent trends in US environmental quality. Journal of Applied Econometrics, 20, 55–77.
When is ATE Enough?
287
Maasoumi, E., & Heshmati, A. (2000). Stochastic dominance amongst swedish income distributions. Econometric Reviews, 19, 287–320. Neumann, J. von, & Morgenstern, O. (1944). Theory of games and economic behaviour. Princeton: Princeton University Press. Samuelson, P. (1970). The fundamental approximation theorem of portfolio analysis in terms of means, variances and higher moments. Review of Economic Studies, 37(4), 537–542. Sen, A. (1973). On economic inequality. Oxford: Clarendon Press. Wald, A. (1950). Statistical decision functions. New York: John Wiley & Sons.
MATCHING ESTIMATION OF DYNAMIC TREATMENT MODELS: SOME PRACTICAL ISSUES Michael Lechner ABSTRACT Lechner and Miquel (2001) approached the causal analysis of sequences of interventions from a potential outcome perspective based on selectionon-observables-type assumptions (sequential conditional independence assumptions). Lechner (2004) proposed matching estimators for this framework. However, many practical issues that might have substantial consequences for the interpretation of the results have not been thoroughly investigated so far. This chapter discusses some of these practical issues. The discussion is related to estimates based on an artificial data set for which the true values of the parameters are known and that shares many features of data that could be used for an empirical dynamic matching analysis.
1. INTRODUCTION This chapter addresses practical issues associated with the non- or semiparametric estimation of dynamic treatment models that are identified by sequential selection-on-observables (or conditional independence) assumptions.
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 289–333 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00010-2
289
290
MICHAEL LECHNER
While the effects of dynamic selection bias and the impact of sequential interventions have received little attention in the applied econometrics literature so far, there is a substantial literature about the estimation of average ‘‘causal effects’’ of interventions using large micro data sets based on a static causal model. Angrist and Krueger (1999) and Heckman, LaLonde, and Smith (1999) provide comprehensive overviews of this vast literature. Several authors have addressed dynamic causal issues by using ad hoc modifications of the static causal framework. For example, Bergemann, Fitzenberger, and Speckesser (2004) evaluate training program sequences, and Lechner (1999) and Sianesi (2004) propose procedures to deal with participants entering labor market programs at different points in their unemployment spell. In a related setting, Cre´pon and Kramarz (2002) use different ‘‘start times’’ to analyze the effects of the introduction of a policy to reduce standard working hours in France. A similar problem is the issue of program duration as analyzed by Behrman, Cheng, and Todd (2004), and Behrman, Sengupta, and Todd (2005) in the context of a school subsidy experiment. Because these papers use static models of potential outcomes, it is difficult to define the desired causal effect in such a way that the impact of the (implicit) assumptions about the dynamic selection process on the estimand becomes apparent.1 Robins (1986) first suggested an explicitly dynamic causal framework based on potential outcomes that allows the definition of causal effects of dynamic interventions and systematically addresses this type of selection problem. His approach was subsequently applied in epidemiology and biostatistics (e.g. Robins, 1989, 1997, 1999; Robins, Greenland, & Hu, 1999a, 1999b, for discrete treatments; Gill & Robins, 2001, for continuous treatments) to define the effect of treatments in discrete time. Identification is achieved by sequential randomization assumptions (see the very comprehensible summary by Abbring, 2003). The effects are typically estimated using parametric models. Based on this framework, Murphy (2003) proposes estimators for optimal treatment rules that specify how the treatment changes over time depending on how covariates change. Recently, Lechner and Miquel (2001) extend Robins’ (1986) framework to comparisons of more general sequences, to different parameters and selection processes, and to identifying assumptions that are more relevant in typical microeconometric studies. Focusing on the case when all elements that influence selection and outcomes at each stage of the sequence are observable, Lechner and Miquel (2001) discuss different identification conditions required for particular dynamic causal effects. Since the assumptions used in Lechner
Matching Estimation of Dynamic Treatment Models
291
and Miquel (2001) bear some similarity to the selection on observables or conditional independence assumption (CIA) that is prominent in the static evaluation literature, Lechner (2004) proposed matching and inverse probability weighting estimators that are dynamic extensions of similar estimators used in the static model. These estimators retain most of the flexibility and convenience of the static methods that have made them the workhorse in empirical evaluation studies, particularly in Europe (see the excellent survey by Imbens, 2004; Heckman et al., 1999). Applications of the explicit dynamic causal framework based on potential outcomes are very rare in econometrics so far.2 Since there are few experiences so far with this type of estimation for these models, this chapter discusses several issues that come up when the dynamic approach is applied in practice. To begin with, three examples are chosen to show that the dynamic model can be fruitfully used to address questions that surface in applied evaluation studies and that are hard to address within a static framework, because the latter is not able to handle selection problems that occur while a particular treatment is in operation. The first example concerns the effects of sequences of programs. The breakdown of the static model occurs when selection into the second and any subsequent programs is influenced by the outcome of the previous programs, so that particular control variables become endogenous in particular ways with respect to the complete sequence. The static model provides no way to handle such intermediate outcomes. The second example concerns the effects of earlier or later program starts. As mentioned above, there have been attempts to estimate such effects in the evaluation literature by Sianesi (2004).3 However, her adjusted static framework does not clearly spell out the causal contrasts being estimated and does not explicitly define the exogeneity conditions required for the control variables to identify the underlying causal effect. The latter deficiency is shared by papers that try to mitigate the problem of different starting dates in evaluation studies by randomly drawing start dates (see Lechner, 1999; Gerfin & Lechner, 2002; and the critique of this procedure by Fredriksson & Johansson, 2003). The third example is the effect of the actual duration of a program. The problem with the actual duration of a program is that it could be endogenous: For example, if the effect of a program comes from the signal that participation sends, then it is very likely that people will leave the program while it is under way. This attrition is, however, an effect of beginning and staying up to that point in the program. So far, empirical evaluation studies
292
MICHAEL LECHNER
have circumvented this problem by considering the effects of planned duration only (e.g. Lechner, Miquel, & Wunsch, 2004). Such studies estimate a different parameter that may or may not be of interest in the particular situation. The CIA that justifies matching estimation in the static context is sometimes called a data hungry identification and estimation strategy. If static matching is data hungry, then dynamic matching is starving for data. This starvation relates to the number of observations necessary in the particular sequences to obtain precise inference, to the time-varying variables required to obtain identification, and to the heterogeneity of the characteristics observed in the particular treatments that may lead to support and over-parametrization problems. Taken together, these factors could lead to the undesirable situation that the price to pay for using the much more informative dynamic approach is that it produces very noisy estimates on a common support that has no policy relevance. Therefore, this chapter considers these issues in greater depth from several different perspectives: (i) a comparison to static matching; (ii) the relation between the common support and the length of the specified sequences; and (iii) the number of regressors included in the propensity score estimation. This chapter does not derive new analytical results. The discussion is based on known properties of the estimators, as well as on the performance of the estimation procedures in the data. These data come from a rather elaborate attempt to generate artificial data similar to that available in actual evaluation studies of European-type active labor market programs (many covariates, 40 periods with autocorrelation, 4 programs with different start dates and lock-in effects, etc.). Since in terms of computing time its generation is far too expensive for a Monte Carlo study, only one replication is used. For the limited illustrative goals of this chapter, this suffices. The chapter proceeds as follows: Section 2 outlines the dynamic causal framework suggested by Lechner and Miquel (2001) and Lechner (2004). The notation is introduced and the basic identification conditions are restated. The estimation problem is explained in Section 3 and sequential matching as proposed by Lechner (2004) is reviewed. Section 4 details the artificial data. Section 5 presents the three empirical examples. Section 6 covers the brief comparison to static matching. The issues of additional variables and the relation between length of sequence and common support are discussed in Section 7. Section 8 concludes and Appendix A contains some descriptive statistics concerning the distribution of the true values of the potential outcomes.
Matching Estimation of Dynamic Treatment Models
293
2. THE DYNAMIC CAUSAL MODEL: NOTATION, EFFECTS, AND IDENTIFICATION This section briefly repeats the definition of the dynamic causal model as well as the identification results derived by Lechner and Miquel (2001) for the case of sequential selection on observables. To ease the notational burden, I use a three-period-two-treatments model to discuss the most relevant issues that distinguish the dynamic from the static model, although in the application more periods and more treatments are considered. As usual in the econometric evaluation literature, I use the standard statistics terminology based on treatments and potential outcomes to define causal effects. 2.1. Basic Structure of the Model Suppose that there is an initial period in which everybody is in the same treatment, plus two subsequent periods in which different treatment states are realized. The periods are indexed by t or t (t,tA{0,1,2}). The treatment defined over all periods is described by a vector of Bernoulli random variables (RV), S ¼ ðS1 ; S 2 Þ. For notational convenience, the treatment of the initial period (S0=0) is not always mentioned explicitly. A particular realization of St is denoted by st 2 f0; 1g. Denote the history of variables up to and including period t by a bar below that variable, i.e. s2 ¼ ðs1 ; s2 Þ.4 Since we are not restricting effect heterogeneity over time, it makes sense to define potential outcomes in terms of sequences of potential states of the world. Thus, in period 1, an individual (or a firm, country, or any other unit of interest) is observed in exactly one of two treatments. In period 2, the same treatment occurring in that period will be captured by two different potential outcomes depending on the treatment in period 1. Therefore, an individual participates in one of four treatments, defined by the sequences (0,0), (1,0), (0,1), and (1,1). Thus, every individual participates in exactly one sequence defined by s1 and another sequence defined by the same value s1 and a value of S2. To sum up, in the two (plus one)-period-two-treatments example we consider six different overlapping potential outcomes corresponding to two mutually exclusive states defined by treatment status in period 1 only, plus four mutually exclusive states defined by treatment status in periods 1 and 2 together. Variables used to measure the effects of the treatment in period t, i.e. the s potential outcomes, are indexed by treatments and denoted by Y t 1 (tZ1) or s2 Y t (tZ2). They are measured at the end of each period, whereas treatment
294
MICHAEL LECHNER
status is measured in the beginning of each period. For each sequence length (one or two periods), one of the potential outcomes is observable and denoted by Yt. The resulting observation rules are defined in Eq. (1): Y 1 ¼ S1 Y 11 þ ð1 S1 ÞY 10 ; 01 Y 2 ¼ S1 Y 12 þ ð1 S1 ÞY 20 ¼ S1 S2 Y 11 2 þ ð1 S 1 ÞS 2 Y 2
þ S 1 ð1
S 2 ÞY 10 2
þ ð1 S1 Þð1
(1)
S2 ÞY 200
Finally, variables that may influence treatment selection and (or) potential outcomes are denoted by X. The K-dimensional vector Xt may contain functions of Yt and is observable at the same time as Yt.
2.2. Average Causal Effects As in the static model, the potential outcomes are used to define several average causal effects. Eq. (2) defines the causal effect (for period t) of a sequence of treatments up to period 1 or 2 (t; t0 ) compared to an alternative sequence of the same or a different length for a population defined by one of those sequences or a third sequence: s kt ;s l 0
yt
t
sl0
sk
ðst~j Þ ¼ EðY t t jS t~ ¼ st~j Þ EðY t t jS t~ ¼ st~j Þ,
0 t~ ;
1 t; t0 2;
t~ t0 ; t;
kal;
(2)
t0
k 2 ð1; . . . ; 2 Þ; l 2 ð1; . . . ; 2 Þ; j 2 ð1; . . . ; 2t~ Þ t
The treatment sequences indexed by k, l, and j may correspond to Eq. (0) or Eq. (1) if t (or tu) denotes period 1, or to the longer sequences (0,0), (0,1), stk ;s l 0
(1,0), or (1,1) if t (or tu) equals 2. Lechner and Miquel (2001) call yt stk ;s l 0 t
dynamic average treatment effect (DATE). Accordingly, yt
t
the
ðstk Þ, as well as
stk ;s l 0 t
yt ðstl0 Þ are termed DATE on the treated (DATET) and DATE on the s2k ;s2i l nontreated. There are cases in between, like yt ðs1 Þ, for which the conditioning set is defined by a sequence shorter than the ones defining the causal contrast. Note that the effects are symmetric for the same stk ;s l 0
population (yt
t
s l0 ;s kt
ðstk Þ ¼ yt t
ðstk Þ). This feature, however, does not restrict stk ;s l 0
effect heterogeneity across individuals (yt
t
stk ;s l 0
ðstk Þayt
t
ðstl0 Þ).
Matching Estimation of Dynamic Treatment Models
295
2.3. Identification Assume that a large sample fs1i ; s2i ; x0i ; x1i ; x2i ; y1i ; y2i gi¼1:N of size N is available, randomly drawn from a very large population of interest. The latter is characterized by the corresponding random variables ðS 1 ; S 2 ; X 0 ; X 1 ; X 2 ; Y 1 ; Y 2 Þ.5 Furthermore, assume that all conditional expectations that are of interest in the remainder of this chapter exist. To ease notation further, from now on assume that interest centers on the effects of sequences of length two only (effects of sequences of length one are not interesting, because their identification is discussed in the extensive static literature). Lechner and Miquel (2001) show that if we can observe the variables that jointly influence selection at each stage as well as the outcomes some average treatment effects are identified by weak CIAs:6 Weak dynamic conditional independence assumption (W-DCIA)7 ‘ 10 01 11 (a) Y 00 2 ; Y 2 ; Y 2 ; Y 2 ‘ S 1 jX 0 ¼ x0 ; 10 01 11 (b) Y 00 S 2 jX 1 ¼ x1 ; S1 ¼ s1 ; 2 ; Y2 ; Y 2 ; Y2 (c) 14PðS 1 ¼ 1 jX 0 ¼ x0 Þ40, 14PðS 2 ¼ 1 jX 1 ¼ x1 ; S1 ¼ s1 Þ40; 8x1 2 w1 , 8s1 : s1 2 f0; 1g. w1 ¼ ðw0 ; w1 Þ denotes the support of X0 and X1. Part (a) of W-DCIA states that the potential outcomes are independent of treatment choice in period 1 (S1) conditional on X0. This is the standard version of the static CIA (e.g. Rubin, 1974). Part (b) states that conditional on the treatment in period 1, on observable outcomes in period 1 (which may be part of X1), and on the confounding variables from periods 0 and 1 (X 1 ), potential outcomes are independent of treatment choice in period 2 (S2). To see whether such an assumption is plausible in a particular application, we have to ask which variables influence potential changes in treatment status as well as outcomes and whether they are observable. If the answer to the latter question is yes, and if there is common support (defined in part (c) of W-DCIA), then we have identification, even if some or all of the conditioning variables in period 2 are influenced by the outcome of the treatment in period 1. Lechner and Miquel (2001) show that, for example, 11 11 11 quantities like EðY 11 2 Þ, EðY 2 jS 1 ¼ 0Þ, EðY 2 jS 1 ¼ 1Þ, or E½Y 2 jS 2 ¼ ð1; 0Þ 11 11 are identified, but that E½Y 2 jS2 ¼ ð0; 0Þ or E½Y 2 jS2 ¼ ð0; 1Þ are not s k ;s2l
identified. Thus, y22 s k ;s l
s k ;sl2
, y22
ðs1j Þ are identified 8sk1 ; sk2 ; sl1 ; sl2 ; s1j ; s2j 2 f0; 1g, but
y22 2 ðs2j Þ is not identified if sl1 ask1 , or s1l as1j , or s1k as1j . This result states that pair-wise comparisons of all sequences are identified, but only for groups of
296
MICHAEL LECHNER
individuals defined by their treatment status in periods 0 or 1. The relevant distinction between the populations defined by treatment states in the first and subsequent periods is that in the first period, treatment choice is random conditional on exogenous variables, which is the result of the initial condition stating that S0 ¼ 0 holds for everybody. However, in the second period, randomization into these treatments is conditional on variables already influenced by the first part of the treatment. W-DCIA has an appeal for applied work as a natural extension of the static framework. However, W-DCIA does not identify the classical treatment effects on the treated if the sequences of interest differ in the first period. Lechner and Miquel (2001) show that to identify all treatment parameters, W-DCIA must be strengthened by essentially imposing that the confounding variables used to control selection into the treatment of the second period are not influenced by the selection into the first-period treatment. This can be s ‘ summarized by an independence condition like Y 22 S 2 jX 1 (Lechner & Miquel, 2001, call this the strong dynamic conditional independence assumption, S-DCIA). Note that the conditioning set includes the outcome variables from the first period. This is the usual CIA used in the multiple treatment framework (with four treatments). In other words, when the control variables are not influenced by the previous treatments, the dynamic problem collapses to a static problem of four treatments with selection on observables. It is a problem for any attempt at nonparametric estimation of these effects that adjustments based on a potentially high-dimensional vector of characteristics and intermediate outcomes (X) are required (details below). Therefore, in the applied static matching literature balancing scores are a popular device to reduce the dimension of the estimation problem (see Rosenbaum & Rubin, 1983). Lechner and Miquel (2001) show that similar properties hold for the dynamic model as well. Balancing score property for W-DCIA If the conditions of W-DCIA hold, then ‘ 10 01 11 (a) Y 00 S 1 jb1 ðX 0 Þ ¼ b1 ðx0 Þ holds for all b1 ðx0 Þ such that 2 ; Y2 ; Y2 ; Y2 E½ ps1 ðx0 Þjb1 ðX 0 Þ ¼ b1 ðx0 Þ ¼ ps1 ðx0 Þ; 10 01 11 (b) Y 00 2 ; Y2 ; Y2 ; Y2 such that
‘
ps1 ðx0 Þ :¼ PðS1 ¼ s1 jX 0 ¼ x0 Þ
S 2 jb2 ðX 1 ; S 1 Þ ¼ b2 ðx1 ; s1 Þ holds for all b2 ðx1 ; s1 Þ
E½ ps2 js1 ðx1 Þjb2 ðX 1 ; S1 Þ ¼ b2 ðx1 ; s1 Þ ¼ ps2 js1 ðx1 Þ; ps2 js1 ðx1 Þ :¼ PðS 2 ¼ s2 jX 1 ¼ x1 ; S 1 ¼ s1 Þ
Matching Estimation of Dynamic Treatment Models
297
A low-dimensional choice for balancing scores suggested by Lechner and Miquel (2001) consists of conditional transition probabilities in combination with the variable indicating the selection in the previous period (which of course can be ignored in the first period): b1 ðx0 Þ ¼ ps1 ðx0 Þ, b2 ðx1 ; s1 Þ ¼ ½ ps2 js1 ðx1 Þ; s1 .
3. ESTIMATION 3.1. Structure of Sequential Estimators Lechner (2004) shows that these scores are convenient for constructing sequential propensity score-matching estimators to correct for selection bias under W-DCIA. I focus on this particular estimator because of its simplicity and because of its frequent use in empirical evaluation studies. Other static matching-type estimators can be adapted to the dynamic context in a similar way (see Imbens, 2004, for an overview of available estimators). I refrain from discussing estimation based on the S-DCIA explicitly, because estimation under S-DCIA is essentially a static problem with an increased number of treatments. Such estimation problems for multiple treatments are discussed by Imbens (2000) and Lechner (2001, 2002) and need not be explained here. Nevertheless, the suggested estimators are consistent under S-DCIA as well. Thus, a comparison of estimators that are consistent under both W-DCIA and S-DCIA (sequential matching) and those that are consistent under S-DCIA only (static matching) could serve as the basis for checking the plausibility of S-DCIA (or the endogeneity of some covariates). However, this point is not developed further in this chapter. Using the balancing scores suggested above, the following estimand (quantity to be estimated by sample analogs of observables) results for quantities identified under W-DCIA: sk
EðY 2 2 jS 1 ¼ s1j Þ 8 < E ¼ E : s k js k sk p 1 ðX 0 Þ
p
s2k js1k ;s1
½EðY 2 jS 2 ¼ sk2 ; p
sk2 jsk1 ;sk1
p 2 1 ðX 1 Þ k
k
ðX 1 Þ :¼ ½ ps2 js1 ðX 1 Þ; ps1 ðX 0 Þ;
s1k
9 =
ðX 1 ÞÞjS1 ¼ s1k ; p ðX 0 ÞjS1 ¼ s1j , ; s1k ; s2k ; s1j ; s1 2 f0; 1g (3)
To learn the counterfactual outcome for the population participating in s1j (the target population) had they participated in sequence sk2 , we need to
298
MICHAEL LECHNER
reweight the participants in s2k to make them comparable to the target population (s1j ). The dynamic, sequential structure of the causal model restricts the possible ways to do so. Intuitively, for the participants in the target population, we should reweight participants in the first element of the k sequence of interest (s1k ) such that they have the same distribution of ps1 ðX 0 Þ as the target distribution. Call this artificially created comparison group 1. Yet, to estimate the effect of the full sequence, the outcomes of participants in sk2 instead of sk1 are required. Thus, an artificial subpopulation of participants k k k in sk2 that has the same joint distribution of ps1 ðX 0 Þ and ps2 js1 ðX 1 Þ as the artificially created comparison group 1 is required. The same principle applies for DATEs in the population. All proposed estimators in Lechner (2004) have the same structure, in the sense that they are computed as weighted means of the outcome variables observed in subsample S2 ¼ sk2 . The weights depend on the specific effects of interest and are functions of the balancing scores.
Note that in the case of more than two treatments, the balancing scores for Eqs. (4) and (5) will differ with respect to the probability of participating in the first period. For Eq. (4), the required quantity is PðS 1 ¼ sk1 jX 0 ¼ x0 ; S 1 2 fsk1 ; sl1 gÞ, whereas for Eq. (5), in which all of the population is the target, PðS 1 ¼ sk1 jX 0 ¼ x0 Þ is appropriate. 3.2. Sequential Matching (SM) Estimators Lechner and Miquel (2001) propose to extend the simple pair-matching estimators that are highly popular in applied evaluation studies to the dynamic context. The idea is to perform the required adjustments by sequentially choosing close pairs of observations in the various steps, so as to mimic the sequential conditional expectations appearing in expressions (4) and (5). The first step is the same for both effects and consists in finding for every member of S 1 ¼ sk1 a member of S 2 ¼ sk2 with very similar (the same) k k k values of ps2 js1 ðx1;i Þ and ps1 ðx0;i Þ. Note that matching must be with
Matching Estimation of Dynamic Treatment Models
299
replacement, because the target population may be larger than the treatment population. In the second step, every member of S 1 ¼ s1j (Eq. (4)) or S0 ¼ 0 (Eq. (5)) is to be paired with a member of S 1 ¼ sk1 with very similar (same) k values of ps1 ðx0;i Þ. The positive weights that are attached to some or all members of S2 ¼ sk2 coming from step 1 are then updated depending on how often an observation in S2 ¼ sk2 is matched to an observation of the target population via the intermediate matching step. This procedure leads to the following weights: j
sk ;s1
wi 2
¼
1 XX N
j
s1
sk
k
j n2s1 m2sk1
v2 ½ p
wi 2 ¼
k
v1 ½ ps1 ðx0;n Þ; ps1 ðx0;m Þ;
sk2 jsk1 ;sk1
(6) ðx1;m Þ; p
sk2 jsk1 ;sk1
ðx1;i Þ; sk1 ; ;
8i 2 S2 ¼ sk2
N X k k 1X v1 ½ ps1 ðx0;n Þ; ps1 ðx0;m Þ; N n¼1 k
(7)
m2s1
k k k
k k k
v2 ½ ps2 js1 ;s1 ðx1;m Þ; ps2 js1 ;s1 ðx1;i Þ; sk1 ; ;
8i 2 S 2 ¼ sk2
j
N s1 denotes the number of observations for which S1 ¼ s1j . The function k k k k v1 ½ ps1 ðx0;n Þ; ps1 ðx0;m Þ; is defined to be 1 if ps1 ðx0;m Þ is closest to ps1 ðx0;n Þ among all observations belonging to the subsample defined by S1 ¼ sk1 , and k k k k k k zero otherwise. Similarly, v2 ½ ps2 js1 ;s1 ðx1;m Þ; ps2 js1 ;s1 ðx1;i Þ; sk1 ; is 1 if observak k tion i is closest to observation m (with s1;m ¼ sk1 ) in terms of ps2 js1 ðx1;i Þ and sk1 p ðx0;i Þ, and zero otherwise. The Mahalanobis metric (a quadratic form of the variables defining the distance weighted by the inverse of their sample covariance matrix) is a frequently used measure for similarity. Note that the weight of observation i is 0 if it is not matched to any member of the target population. On the other extreme, if observation i is matched to every member of the target population its weight would be 1. A specific variant of s1 ;s0
this estimator is shown in Table 1 for the example of estimating yt 2 2 ðs11 Þ. Some remarks about this protocol that are already contained in Lechner (2004) are worth repeating: first, matching is with replacement. Every step of the matching sequence is essentially the same as for matching in a static framework. However, sequential propensity score matching involves several probabilities in the second-period matching step. Second, some issues arise from the sequential nature of matching. By choosing observations as matches with similar values of the probabilities instead of the same values (because such observations may not be available), it may happen that the probabilities
300
MICHAEL LECHNER
Table 1.
s1 ;s0
A Sequential Matching Estimator for yt 2 2 ðs11 Þ Based on Propensity Scores. Delete all observations not belonging to the reference population (s11 ) or the treatment sequences of interest (s12 ; s02 )
Step 0: Sample reduction Step A: Match s02 ¼ ðs01 ; s02 Þ to
s0
A.1.0
Define a weight wi 2 ¼ 0 for every observation in s02 .
A.1.P
Estimate a probit for PðS 1 ¼ s01 jX 0 ¼ x0 Þ and calculate the individual
s0
s11 ðEðY t 2 jS 1 ¼ s11 ÞÞ s0
0
first-period participation probabilities p^ s1 ðx0;i Þ ¼ p^ i 1 . s0
A.1.CS
Delete all observations in s11 with lower or higher values of p^ i 1 than the potential comparison observations in s02 :
A.1.M
For every observation in s11 that has not been deleted in A.1.CS find s0
the observation in s01 that is closest in terms of p^ i 1 (a match). s0 p^ i 1
A.1.C
of the observation For the matched observation keep the value of in s11 to which they have been matched. Some observations in s01 may appear many times in this matched sample.
A.2.R A.2.P
Define the sample of observations in s01 . Estimate a probit for PðS2 ¼ s02 jS1 ¼ s01 ; X 1 ¼ x1 Þ that leads to s0 js0
0 0
A.2.CS
individual transition probabilities (p^ s2 js1 ðx1;i Þ ¼: p^ i 2 1 ) Delete all observations of the matched comparison sample of s11 (as computed by A.1.C), as well as the corresponding elements of the s0
A.2.M
s0 js0
target population s11 , with lower or higher values of p^ i 1 and p^ i 2 1 than the potential comparison observations in s02 . For every observation in the matched comparison sample of s11 that is still in the common support (after A.2.CS and A.1.CS) find the s0 js0
s0
observation in s02 that is closest in terms of p^ i 2 1 and p^ i 1 using the Mahalanobis metric (where the covariance is computed in the remaining target population s11 ). Every time an observation in s02 is s0
matched, its weight wi 2 is increased by 1. Step B: Match s12 ¼ ðs11 ; s12 Þ to
s1 wi 2
¼ 0 for every observation in s12 .
B.1.0
Define a weight
B.2.R B.2.P
Reduce the sample to the observations in s11 that are in the common support after the matching steps so far. Estimate a probit for PðS2 ¼ s12 jS1 ¼ s11 ; X 1 ¼ x1 Þ that leads to
B.2.CS
individual transition probabilities (p^ s2 js1 ðx1;i Þ ¼: p^ i 2 1 ). Delete all observations remaining in s11 with lower or higher values of
s1
s11 ðEðY t 2 jS 1 ¼ s11 ÞÞ
1 1
s1 js11
p^ i 2
than observations in s12 .
s1 js1
Matching Estimation of Dynamic Treatment Models
301
Table 1. (Continued ) B.2.M
For every observation in s11 not deleted in B.2.CS find the observation s1 js1
in s12 that is closest in terms of p^ i 2 1 . Every time an observation in s1
s12 is matched, its weight wi 2 is increased by 1. s12
Step C: Joint C.1 common support
Reduce wi by 1 for every observation i that is matched to an observation in s11 deleted in A.1.CS or A.2.CS (in this case it is not required if B.2.R is strictly enforced).
C.2
Reduce wi 2 by 1 for every observation i that is matched to an observation in s11 deleted in B.2.CS (in this case it is not required if B.2.R is strictly enforced).
s0
P
Step D: Estimation D.1 of
s1 ;s0 yt 2 2 ðs11 Þ
s1 ;s0 y^ t 2 2 ðs11 Þ
i2s1 2
¼ P i2s1 2
1 0
s1
wi 2
1 0 d y^ s2 ;s2 ðs1 Þ Var½ t 1
¼
i2s0 2
s0
wi 2 yi
P
P
Step E: Estimation D.2 s ;s of Var½y^ t 2 2 ðs11 Þ
P
s1
wi 2 y i
i2s1 2
s0 wi 2 0 i2s 2 s1 s1 ½ðwi 2 Þ2 VarðY t jS¼s12 ;w¼wi 2 Þ
P i2s1 2
þ
s1
j
s
j 2Þ
¼
P j j s i2Iðs2 ;wi 2 Þ
s0
P i2s0 2
X
1 N
j
s0
c ½ðwi 2 Þ2 V arðY t jS¼s02 ;w¼wi 2 Þ
i2s0 2
ðwi 2 Þ2
d t jS ¼ s j ; w ¼ ws2 Þ ¼ VarðY i 2
N i2Iðs2 ;wi
P
c
j
j j s i2Iðs2 ;wi 2 Þ s
1; y¯ Iðs2 ;wi
j 2Þ
j
j
N
s
j 2
½yi y¯ I ðs2 ;wi Þ 2 ; s
j
i2Iðs2 ;wi 2 Þ
1
¼
;
s0
ðwi 2 Þ2
j j s Iðs2 ;wi 2 Þ
X
yi s
j
j
i2Iðs2 ;wi 2 Þ
j pffiffiffiffiffi s The set Iðs2j ; wi 2 Þ is determined as the 2 N closest neighbors to j
observation ‘‘i’’ with respect to the value of ws2 of observations with the same observed treatment as observation ‘‘i’’.
Note: tW1. Changes required for the case t=1 are obvious. The number of neighbors in the k-NN estimation of the conditional variance is the one used in Lechner (2004). In the empirical application below, the results appear not to be sensitive to small deviations from this value.
attached to observations in early matching steps (relating to transitions in early periods) change over different sequential matching steps due to imprecise matching. To prevent this from happening, everyl matched comparison s observation in period 2 is recorded with the values p^ i 1 of the observation it was matched to in period 1, instead of its own (p^ denotes a consistent estimate of p). Hence, the ‘‘history’’ of the match, or, in other words, the characteristics of the reference distribution, does not change when the next match occurs in the subsequent period.
302
MICHAEL LECHNER k
Third, to compute EðY s2 jS1 ¼ sl1 Þ the only information that is needed for sk
l
the N s1 participants in sl1 is p^ i 1 . Similarly, for participants in sk2 , all sk jsk1 ;sk1
probabilities of the type p^ i 2 sk1
are required. For participants in sk1 but not l
in sk2 only p^ i is needed, and so on. To estimate EðY s2 jS1 ¼ sl1 Þ instead of k EðY s2 jS 1 ¼ sl1 Þ, the only change in the matching protocol is that the initial sl
matching step on p^ i 1 is redundant. When interest is in the average effect in k the population (EðY s2 Þ), then the whole population plays the role of the first reference group (instead of sl1 ). In this case, in the matching step based on sk
p^ i 1 , all participants in sk1 are matched to themselves. In addition, selected participants in sk1 are matched to participants in the remaining treatments in the first period. When matching is on the propensity score instead of directly on the confounding variables, there is the issue of selecting a probability model. It seems that so far even in the static model the literature has not addressed this thoroughly. However, the consensus seems to be that a flexibly specified (and extensively tested) parametric model is sufficiently rich and that the choice of the model does not really matter (see, e.g. the Monte Carlo results by Zhao, 2004). Similarly, the suggestions in the literature to guide the specification choice by the ability to achieve balance of the respective covariates (e.g. Rosenbaum & Rubin, 1984; Rubin, 2004) can be applied here as well (in each step). Next, there is the issue of consistent estimation of the standard errors that is not yet resolved for the static matching literature. Based on the simulation results presented in Lechner (2004), the standard errors are computed conditional on the weights. In other words, the fact that the weights are estimated quantities is ignored. Furthermore, the outcomes may be conditionally heteroscedastic. However, heteroscedasticity is only relevant in this context if related to the weights. Therefore, a simple k-nearest neighbor estimator is used as in Lechner (2004) to adjust for any such heteroscedasticity. Although such an estimator performed well in Lechner (2004), there is potential for improvement. The final remark about the matching protocol concerns the common support. The region of common support – defined on the reference distribution for which the effect is desired – has to be adjusted period by period with respect to the conditioning variables of that period. The matching estimator makes it easy to trace back the impact of this procedure on the reference distribution.8
Matching Estimation of Dynamic Treatment Models
303
3.3. Multiple Treatments and Many Periods The main issue concerns the specification of the propensity scores: For example, when specifying the probability of participating in sk2 conditional on participating in sk1 , is it necessary to take account of the fact that not participating in sk2 implies a range of possible other states in period 2? The answer is no, because in each step the independence assumption relates sk ‘ only to a binary comparison, e.g. Y 22 1ðS2 ¼ sk2 ÞjS1 ¼ sk1 ; X 1 ¼ x1 , and k s ‘ Y 21 1ðS 1 ¼ sk1 ÞjS 1 2 fs1j ; sk1 g; X 0 ¼ x0 (sj1 being the target population as before). Therefore, the conditional probabilities of not participating in the event of interest conditional on the history are sufficient.9 Hence, as already noted PðS 2 ¼ sk2 jS1 ¼ sk1 ; X 1 ¼ x1 Þ and PðS 1 ¼ sk1 jX 0 ¼ x0 ; S 1 2 fsl1 ; sk1 gÞ may be used in the matching step in period 1. The multiple treatment feature of the problem does not add to the dimension of the propensity scores.
4. DATA The artificial data are generated to look similar to individual panel data found in actual evaluation studies of European-type active labor market programs (such as Gerfin & Lechner, 2002; Lechner et al., 2004), although the exact properties of those data sets used for static evaluation studies are not reflected in the artificial data. In these data, a period is rationalized as a quarter. In that sense, there is detailed information about many employment-related variables on a quarterly basis for 9 years (quarter 1(1) to quarter 9(4)). In addition, there are summary measures that aim to capture the events before the data are recorded in quarterly intervals.10 The sample is selected to contain all ‘‘individuals’’ who are unemployed in the last quarter of year 2, denoted as 2(4). Starting in the first quarter in year 3, individuals may participate in active labor market programs. If not having done so before, they may start a program every quarter up to and including 4(2). In addition, in 4(2) they may start a new program, even if they already participated in a program completed before 4(2). There are two employment programs (treatments ‘‘3’’ and ‘‘4’’) and two training programs (treatments ‘‘1’’ and ‘‘2’’). The main difference between the two types is that employment programs have a smaller lock-in effect and that the positive medium-run effects that all programs have deteriorate at a faster rate than for the training programs.
304
MICHAEL LECHNER
We consider (potential) outcome processes for all sequences of different programs that relate to employment status (employed, unemployed, out of the labor force) and earnings. All processes show considerable state dependence, embody time trends and are influenced by several covariates. Shorter programs with a mean of about 5 months for training and 6 months for the employment programs (standard deviation (SD) 2 and 4 months) have much shorter lock-in effects than the longer programs of the same category (mean duration 20, 18; SD: 2, 4), but the (positive) effects also depreciate faster. However, for the sake of brevity the longer program types are not contained in the descriptive statistics given in Table 2. Table 2 contains descriptive statistics as well as a characterization of the variable type for the most important time-dependent and time-independent variables in the specific subsamples defined by treatment status. They are the usual types of variables with typical codings, means, and standard errors. A full set of statistics for all variables is available on request from the author. Note that the treatment sequences that define the appropriate subsamples are abbreviated by strings that contain a treatment state for every period. For example, 222222 means six periods for treatment 2, whereas 01000 would mean one period of treatment 0, followed by a period of treatment 1, followed again by four periods of treatment 0. Selection is based on an index model (multinomial probit) with five alternatives. Choices depend on observables that also appear in the outcome processes as well as normally distributed unobservables that are mutually dependent but independent of the observables and unobservables appearing in the outcome equations. All selection processes fulfill W-DCIA. With the exception of nationality, all variables appearing in Table 2 influence selection in each period, but only earnings, the employment states, and the assessment by the caseworker are influenced by the treatment and are thus considered outcome variables, or intermediate outcomes for those variables that relate to periods in which the sequences are not yet completed. Comparing the value of the covariates and intermediate outcomes across sequences reveals considerable selection effects as well as considerable differences in the moments of the outcome variables. The actual probit models underlying the results presented in the following sections are misspecified, but in ways that remain largely undetected by conventional specification tests. The misspecification relates to the functional form (only single models instead of the underlying multiple index models are used) as well as to omission of some covariates that are highly correlated with covariates included in the sample. In this respect as well, the artificial
Variable
Descriptive Statistics for Selected Variables and Selected Subsamples. Type
Subsample Defined by Treatment Status in Periods 3(1) up to 4(2) 0 Mean
000000 SD
Mean
SD
1 Mean
1111 SD
Mean
2 SD
Mean
222222 SD
Mean
SD
C C C C C
1,349 1,095 0 358 1,241
1,789 1,732 0 1,034 1,933
1,030 786 0 525 768
1,089 1,028 0 832 1,038
4,673 4,323 0 1,337 3,964
4,071 4,137 0 3,107 4,270
4,851 4,386 0 722 4,040
4,074 4,156 0 2,246 4,345
5,449 5,009 0 1,278 5,229
3,883 4,040 0 3,093 4,533
5,432 4,953 0 444 5,091
3,895 4,067 0 1,943 4,512
Unemployment in % Total months in last 10 years 1(1) 2(4) 4(1) 9(4)
D I I I I
12 10 100 54 10
7 – – – –
11 9 100 23 21
7 – – – –
11 6 100 60 5
7 – – – –
12 7 100 65 5
7 – – – –
12 5 100 60 7
7 – – – –
13 6 100 78 9
7 – – – –
Employment in % Total months in last 10 years 1(1) 2(4) 4(1) 9(4)
D I I I I
85 78 0 31 68
20 – – – –
86 76 0 49 58
22 – – – –
89 90 0 32 86
19 – – – –
87 87 0 27 87
20 – – – –
89 90 0 35 83
18 – – – –
88 89 0 17 80
18 – – – –
305
Monthly earnings in EUR Last 10 years (average) 1(1) 2(4) 4(1) 9(4)
Matching Estimation of Dynamic Treatment Models
Table 2.
306
Table 2. (Continued ) Variable
Type
Subsample Defined by Treatment Status in Periods 3(1) up to 4(2) 0 Mean
000000 SD
Mean
Labor market prospects as assessed by caseworker (1, 2, 3, 4) 2(4) D 2.8 1.1 3(1) D 2.3 1.1 4(1) D 2.6 1.1
3.1 2.7 3.4
5 5 27
12 13 14
Other variables Age in 1(1), years Female Schooling (8–12) Vocational degree (0,1,2) Nationality (1–5) Regional share of service sector Regional share of production sector Sectoral unemployment rate Occupational unemployment rate
6 – 1.3 0.5 1.1 13 11 4 21
39 45 10 0.8 1.6 58 29 12 11
Observations
D I D D D C C C C
40 41 10 0.8 1.6 58 30 12 12 69,951
–
16,871
SD
1.0 1.1 0.8 5 5 5 6 – 1.2 0.5 1.1 14 11 18 5 –
Mean
2.7 2.0 2.5 12 13 14 41 36 11 1.3 1.6 60 29 12 11 8,997
1111 SD
1.1 1.1 1.1 5 5 5 6 – 1.3 0.7 1.1 13 11 19 21 –
Mean
2.6 1.9 2.3 12 13 15 41 39 11 1.4 1.6 59 30 12 12 888
2 SD
1.1 1.1 1.1 5 5 5 6 – 1.3 0.7 1.1 13 11 4 5 –
Mean
2.4 1.9 2.4 12 13 14 40 27 12 1.6 1.6 63 28 13 13 8,665
222222 SD
1.2 1.0 1.2 5 5 5 6 – 0.9 0.6 1.1 12 10 4 4 –
Mean
SD
2.5 1.9 2.1
1.2 1.0 1.1
12 13 14 40 28 12 1.6 1.6 62 28 13 13 3,404
5 5 5 6 – 0.9 0.6 1.1 12 10 4 4 –
Note: I, binary indicator variable (0, 1); D, discrete variable; C, continuous variable; Mean, mean in subsample; SD, standard deviation in subsample. For indicator variables the share of ones in % is given. The number of observations in the complete sample is 100,000. For treatment 1 we show a subsample based on a sequence of only 4 periods instead of a sequence of 6 periods as for treatments 0 and 2, because for this short program there would be only 34 observations in the latter group. Descriptive statistics for the variables and subsamples not included are available on request from the author.
MICHAEL LECHNER
Regional unemployment rate in %-points (85 regions) 2(4) D 12 3(1) D 13 4(1) D 15
1
Matching Estimation of Dynamic Treatment Models
307
data seem to exhibit the same problems and questions as real data sets usually do. The effects of the programs are heterogeneous depending on the type and duration of program, as well as on several covariates. They show lock-in effects that depend on program duration. The effects of all programs depreciate, but with different speeds. The autocorrelation in the outcome process may increase the lock-in effect and dampen the depreciation. Tables A1a and A1b in Appendix A, as well as Tables 6–8, show means and standard deviations for some outcomes of selected sequences and subsamples.
5. INTERESTING CAUSAL EFFECTS AND ESTIMATION RESULTS The static model of potential outcomes may also be used to define potential states of the world for dynamic phenomena and to estimate the effects by the usual econometric methods. Using a static model, each possible sequence of treatments corresponds to one ‘‘static’’ treatment. The limitation of the static models relates to how restrictions on the joint distribution of selection variables and potential outcome variables have to be formulated in order to identify some average causal effects. By the nature of the static model, those restrictions cannot take into account selection effects based on intermediate outcomes. For example, if a CIA (selection on observables) is deemed plausible, then variables that are determined by the treatment (a particular type of ‘‘endogenous’’ variables) should not appear in the conditioning set, thus ruling out intermediate outcomes.11 Thus, it would be straightforward to accommodate dynamic phenomena based on the S-DCIA. This is not the case for the W-DCIA, which allows some specific forms of endogeneity of the covariates. The following sections give some examples (sequences of programs, waiting for the start of a program, and duration of a program) when such considerations appear to be particularly relevant, discuss potential ways of setting up the estimation problems, and present some results. A nice byproduct of the dynamic approach compared to the static approach is that one has to be explicit about what the alternative treatment state is, i.e. whether we compare two periods of treatment to one period of no treatments, or two periods of no treatment, or any other treatment-by-period combination.
308
MICHAEL LECHNER
In many empirical evaluation papers, the no-treatment state is not clearly defined.
5.1. Program Sequences Consider an example coming from the literature on evaluating training programs in which interest is not in the effect of one particular program, but of a sequence of programs. However, if the first course of such a sequence is very effective, many unemployed individuals may find that their employment chances have drastically increased afterward and may not want to attend the next course as originally intended. If interest is not in the first course, but in the sequence of courses, such behavior creates a selection problem that cannot be addressed in the static model. For example, controlling for pre-training variables does not work for obvious reasons in the static model, because an important selection variable, i.e. the outcome of the first participation, is missing. However, controlling for variables realized after the first training course that influence selection into the second course entails the potential problem that they may be influenced by the first part of the training sequence. Thus, they are ‘‘endogenous’’ in the static model and, therefore, ruled out as control variables. If it is true that selection in each period is based on what is known about the unemployed so far, and that this information is observed, then W-DCIA holds. Table 3 shows some comparisons of the (monthly) earnings effects of different sequences of programs (each sequence starts in period 3(1) and contains the treatment states for the following periods). To differentiate between short-run and medium- to long-run effects, the columns of Table 3 (as well as Tables 4 and 5) provide estimates for the periods 4(4), 7(4), and 9(4) for the different sequences under investigation. Whereas in period 4(4) some of the programs of interest may still be running, in period 9(4) the different lengths of the sequences and programs are not important anymore, because all potential outcomes are close to their long-run levels. The rows of these tables contain further information about the sample sizes of the target populations after imposing the common support condition, and on the two counterfactual mean outcomes and the resulting treatment effects. The cells that contain the latter are shaded. Estimated standard errors appear in parentheses below the corresponding estimates of outcomes and effects. For the interpretation of the estimated effects and counterfactual outcomes, it is important to note that whenever the potential state ‘‘0’’ appears, it does not necessarily mean that the individual is unemployed, but
Table 3.
Earnings Effects of Program Sequences Estimated by Sequential Matching. Matching Estimation of Dynamic Treatment Models 309
Table 4.
The Earnings Effects of Waiting Estimated by Sequential Matching. 310 MICHAEL LECHNER
The Earnings Effects of the Duration of the Program Estimated by Sequential Matching.
Matching Estimation of Dynamic Treatment Models
Table 5.
311
312
MICHAEL LECHNER
merely that she is not participating in some program in that period. The first practical problem that appears is that there are many possible comparisons. Here, we focus on different programs of a length of three periods that may or may not be followed by the start of a further program in period 4(2). The duration of the latter program is not part of the explicit definition of the effect. To capture the effects of having or not having a second program in period 4(2), all sequences have to be fully specified for at least six periods. Thus, a sequence like 111002 should be interpreted as participating in program 1 for three quarters and then starting program 2 in 4(2). The dynamic causal model allows the researcher to fix the duration of the first program, and thus isolate the effects of different program durations (see Section 5.2) from sequences of programs. Of course, the duration of the second program could be fixed as well, but that would lead to very small sample sizes in this example. Table 3 shows a variety of potentially interesting comparisons for different target populations, namely those who participated in a program in the first period and the nonparticipants in that period. Since W-DCIA has only limited identifying power, the reference populations are based on the treatment status in the first period. Compared to nonparticipation, the results show lock-in effects of different sizes, and – with the exception of program 4 – considerable positive effects thereafter (although there is sometimes a lack of precision), a finding that is in line with the true values that can be found in Appendix A. There is considerable effect heterogeneity across target populations.12 Furthermore, the estimates from the comparisons of the different programs to each other are too noisy to pin down any medium- to long-run effects in a precise way. The precision of the estimates depends on the number of ‘‘useful’’ observations (i.e. observations that are comparable to those in the target populations) in the sequences, which is of course related to the length of the specified sequence. The longer the sequence the more precise is the meaning of the causal contrast but the smaller the remaining number of observations available to estimate it. Furthermore, an increased number of observations in the target population increases precision as well, as the estimates that compare the same sequences for different target populations in Table 3 illustrate.
5.2. Delayed Program Start: The Effects of Waiting In the previous section, the specification of the target quantity for which the causal contrast is desired is relatively straightforward. In contrast, when
Matching Estimation of Dynamic Treatment Models
313
interest lies in the effect of waiting for the start (or delaying the start) of a program, there are different ways to state the causal parameter. The first possibility is to just concentrate on the beginning of the program and to take no account of the fact that programs that may start at different points in time may also differ in other ways such as their duration. Such a comparison is displayed in the top panel of Table 4. Note that the required sequences have different lengths. It appears that for program 1 the long-run effects of delaying are small in this setup, which is again in line with the true values. The alternative (or the complement) to this approach is to require some minimum program duration, as is shown in the bottom panel for program 3. For that program the estimation results do not change much when different minimum lengths are considered. The price to pay for specifying longer sequences is of course a reduction in the precision of the estimator. Although program 3 is a longer program, there is not very much attrition in the first three periods considered in Table 4 (which is also the reason why the results do not change much when increasing the length of the sequences). Note that the comparisons of the different effects for the same target population are sometimes hampered by the fact that the common support may shrink drastically when the length of the sequence increases, because the participants tend to become more homogenous the longer the sequences. This issue is taken up again below. Of course, many more interesting contrasts could be considered, such as requiring exactly the same length for both programs. However, they are beyond the illustrative nature and the space constraints of this chapter.
5.3. The Effects of the Duration of a Program In this section, I take up the issue of how to measure the effects of different program durations. The comparison of interest is the effect of an extension of a program. Table 5 shows the results for extensions of one, two, or three periods. The results differ according to whether the extensions concern only the minimum duration (upper panel) or the actual duration (lower panel). In the long run, the estimates reflect positive effects of program duration, although, as observed before, precision becomes an issue when the specified sequence gets longer. Note that the static evaluation literature typically refrains from estimating the effects of actual duration, because the programs themselves can cause attrition and thus actual duration cannot be used to differentiate between shorter and longer programs. As long as the relevant factors are observable,
314
MICHAEL LECHNER
so that the W-DCIA holds, this issue is not a problem here and the effects can be estimated directly.
6. A COMPARISON TO STATIC MATCHING Having discussed some of the questions that can be fruitfully addressed using the dynamic treatment approach, this section is concerned with a comparison of the sequential estimators and static approximations of the dynamic problem. For the latter, there are (at least) two possible implementations: The first one ignores the role of the intermediate outcomes as selection variables for the reason that they are endogenous relative to the beginning of the sequence. The second approximation includes those variables in the list of covariates. There are two a priori reasons why static and dynamic matching estimators may deviate. First, the static version is biased if S-DCIA fails. Second, the estimates may be based on different populations because of different common supports. Different common supports become an issue particularly when the static approach does not use the intermediate outcomes. As the bias of the static estimator is extensively documented in Lechner (2004), I refrain from searching for an example for which the bias is sufficiently substantial on the same common support. Instead, I employ the same example that will appear below when discussing the common support problem. In Table 6, column (1) defines the relevant comparisons and the target population. Column (2) gives the sample sizes for the treatment groups as well as for the target populations after having imposed common support. The latter depends on the particular sequences under consideration. Column (3) shows the (unadjusted) sample means of the outcome variables in the various subpopulations. Comparing these sample means to the true mean values of the potential outcomes given in column (5) gives some indication of the selection bias that should be corrected for by matching. Columns (7)–(10) contain the results for the static matching, whereas columns (11) and (12) contain the corresponding results for sequential matching. The first observation is that the results for static matching may differ substantially depending on whether the intermediate outcomes are included, or not. In the latter case, the results are typically closer to those of the dynamic matching, which in turn are not too far away from the true values (although due to the sampling uncertainty it is hard to pin down differences precisely with the sample sizes used). The second observation is that the common support issue does play a major role, as can be seen by the changing
The Comparison with Static Matching – Earnings and Employment in Period 9(4).
Matching Estimation of Dynamic Treatment Models
Table 6.
315
316
MICHAEL LECHNER
Matching Estimation of Dynamic Treatment Models
317
number of observations, by the sample means of the outcome variable for the target population as well as by the different estimates for the shortest sequence ‘‘1’’ that is not subject to the bias due to ignored dynamics. Since the common support issue resurfaces in almost every discussion of the results so far, it is discussed explicitly in Section 7.
7. SOME POTENTIAL PROBLEMS 7.1. Common Support It has already been pointed out that the comparisons of the different effects for the same target population may be hampered by the fact that the common support may shrink drastically when the length of the sequences increases, because the participants tend to become more homogenous the longer the sequences. Table 7, which is organized in the same way as Table 6, addresses this issue explicitly for two comparisons. First, the effect of waiting defined by explicitly increasing waiting time (and thus the length of the comparison sequence) from one period to five periods is considered. It appears that the sample size of the target population (participants in the one-period treatment) drops from 8,800 to only 3,400 for the sequence defined over six periods. Comparing the changing values of the true potential outcomes in the target populations before and after imposing common support shows that the populations are indeed very different. However, in the initial target population mean potential earnings for at least one period of program 1 was 4,022 EUR, but within the common support for the comparison of 1 to 000001 the mean potential outcome is less than half that value at 1,590 EUR. A similar picture arises in the bottom panel of Table 7, where different minimum durations of participation in program 2 are compared to spells of nonparticipation of the same minimum duration, although the drop is much less dramatic. Again, these examples point out that there is a price to pay for getting a more precise definition of treatment and the selection effects within a nonparametric framework, because only relatively small parts of the sample may contain useful information for particular dynamic sequences. The problem of small sample size can be remedied in the three usual ways: The first option is to change the parameter of interest by specifying shorter sequences. This corresponds to aggregating over some longer sequences. The second option is to use parametric assumptions to add additional smoothness to the
318
Table 7. The Issue of Common Support – Earnings in 9(4).
MICHAEL LECHNER
Matching Estimation of Dynamic Treatment Models
319
320
MICHAEL LECHNER
estimation problem. This is, however, not the subject of this chapter. The third option is to collect more data. Note that these findings are fairly robust to a more conservative definition of the common support, such as the region between the 10th largest/smallest value of the respective scores as opposed to the second largest/smallest value as in the baseline estimation. 7.2. Too Many Covariates May be Needed to Control Selection A potential drawback of the sequential matching approach is that the number of theoretically required covariates increases with the length of the sequence specified, because all the past intermediate outcomes observable at each node of a sequence should be included. When matching is on the propensity score, it implies that the dimension of the matching problem is increasing even if a parametric propensity score is used. The reason is that all past scores should be included as well. Furthermore, estimating a propensity score becomes more demanding when moving along the sequences, as the number of variables increases. Both potential problems could become more severe due to fact that the number of observations decreases when moving along the sequence. Finally, increasing the number of control variables could also potentially decrease the number of observations that remain in the target population that remain after imposing common support and thus change the estimand considerably. Table 8 documents some of these problems by comparing two sequential matching estimators based on different specifications of the covariates applied to both a long and a short sequence. ‘‘All X’’ defines a specification in which all variables that are theoretically required are included (up to 28 for sequence 01 and up to 39 for 000001).13 ‘‘Few X’’ defines a specification that includes only the most recent information about the employment history and intermediate outcomes, leading to only 22 variables in both cases.14 When considering the results for employment rates and earnings, there appear to be no significant differences for the employment rates. For the longer sequences and their effect on earnings, the target populations change to some extent in this example (compare the true values of the target population adjusted for the common support given by the two different estimators). However, the results for both specifications are still close to the true values. The fact that both specifications are close is probably because the observed intermediate outcomes are highly correlated and including only the last one is (almost) sufficient. On the other hand, the specification with all variables included does not seem to be plagued by small sample problems due to
Matching Estimation of Dynamic Treatment Models
Table 8. The Importance and Danger of Including a Rich Set of Covariates – Earnings and Employment in 9(4).
321
322
MICHAEL LECHNER
Matching Estimation of Dynamic Treatment Models
323
potential over-parametrization (indeed it seems to be preferable to the one that leaves out some variables), so that, in this example, it seems safe to include all of the variables in the specification that are required by theory.
8. CONCLUSIONS This chapter attempted to show that dynamic matching estimation can be a useful additional tool for empirical researchers because in many important cases it allows the researcher to define the causal parameter of interest more precisely and to address selection problems that occur while the treatment under consideration is in progress. However, there is a price for getting a more precise definition of treatment and the selection effects within this nonparametric framework, because only relatively small parts of the sample may contain useful information for the particular dynamic sequences of interest (dynamic matching is an identification and estimation strategy starving for data). This problem of small sample sizes can be addressed by aggregating over some longer sequences. This, however, changes the parameter of interest and brings the dynamic model closer to the static one. Alternatively, one may consider parametric assumptions to add additional smoothness to the estimation problem. In the future, this option may be explored in more depth.
NOTES 1. There are further connections to other strands of econometrics: For example, the literature on dynamic panel data models identified by sequential moment conditions (e.g. Chamberlain, 1987, 1992) and this approach are related. Another connection is with the literature on social learning. In particular, Manski (2004) is concerned with dynamic selection problems from one cohort to the next. However, he assumes that the outcome distribution is stationary over time, which is in sharp contrast to our modeling of the outcomes. Therefore, in his framework, as time goes by more information is revealed about the same counterfactual outcome distribution and social learning can be regarded as a process of reducing ambiguity resulting from the selection process. In the framework by Lechner and Miquel (2001) used here, stationarity is not required and the uncertainty does not necessarily decrease over time. Finally, the work by Abbring and van den Berg (2003) addresses dynamic issues by using variation in the start time of treatment spells to identify the effects within a duration framework. Abbring and Heckman (forthcoming) provide an overview of the different dynamic approaches. 2. An exception is Ding and Lehrer (2003) who use this framework and related work by Miquel (2002, 2003) to evaluate a sequentially randomized class size study using difference-in-difference-type estimation methods.
324
MICHAEL LECHNER
3. See also the related approaches by Li, Propert, and Rosenbaum (2001) that appeared in the statistics literature. 4. To differentiate between different sequences, sometimes a letter (e.g. j) is used to index a sequence, as in stj . As a further convention, capital letters usually denote random variables, whereas small letters denote specific values of a random variable. When we deviate from this convention, the intended meaning will be obvious. 5. To simplify the notation further, we consider period 2 as the only period relevant for the outcome of interest. However, for all that follows X2 and Y2 should be considered as measured at some point in time after treatment 2 occurred. The exact timing is determined by the substantive interest of the researcher conducting the empirical study. 6. The following assumptions relate to identification of all treatment effects that could possibly be defined by the notation in Section 2. If the desired comparison involves‘fewer periods, the required changes are obvious. 7. A BjC ¼ c means that each element of the vector of random variables B is independent of the random variable A conditional on the random variable C taking a value of c in the sense of Dawid (1979). 8. In the application the support is defined slightly more conservatively than given in Table 1. It is the region between the second largest and second smallest values of the respective propensity scores in the reference distribution. 9. Imbens (2000) and Lechner (2001) develop the same argument to show that in static multiple treatment models conditioning on appropriate one-dimensional scores is sufficient. 10. Since space constraints do not allow reproducing the data generating process exactly, a Gauss 6.0 program is available from the author on request. 11. See the papers by Fro¨lich (2006), Rosenbaum (1984), and Lechner (forthcoming) on the type of endogenous variables that can be allowed for in the static causal model under selection on observables and on the consequences if variables that exhibit other types of endogeneity are included. 12. Note that their effective sample sizes and composition after imposing common support depend on the sequences under investigation. Thus, computing the effects of 222001 compared to 222000 for participants in 2 by using the results of 222000 compared to 000000 for participants in 2 and 222001 compared to 000000 for participants in 2 is not strictly valid because those comparisons may be based on different common supports. The direct comparison of 222000 to 222001 could be based on yet another common support. 13. For the longer sequence the intermediate outcomes of only every second period are included in the probits, because otherwise it would not converge for the last node in the sequence. 14. For the comparison of 1 to 01, the specification with fewer covariates captures the employment history before 2(4) only with variables from period 2(2) as well as with the usual time-constant variables, whereas the full specification uses a much richer set of covariates.
ACKNOWLEDGMENTS This chapter benefited considerably from previous work with Ruth Miquel about sequential treatment models, as well as from the consistency checks
Matching Estimation of Dynamic Treatment Models
325
she performed with the artificial data. I thank Conny Wunsch for careful proofreading of a previous version of this chapter. Furthermore, I thank Jeff Smith and two anonymous referees for very helpful comments and suggestions. I revised this chapter while visiting the University of Michigan. The hospitality of the Department of Economics at University of Michigan is appreciated.
REFERENCES Abbring, J. H. (2003). Dynamic econometric program evaluation. IZA, Discussion Paper, 804. Abbring, J. H., & Heckman, J. J. (forthcoming). Dynamic policy analysis. Mimeo [Forthcoming as Chapter 24. In: L. Matyas & P. Sevestre (Eds), The econometrics of panel data (3rd ed.). Dordrecht: Kluwer]. Abbring, J. H., & van den Berg, G. (2003). The nonparametric identification of treatment effects in duration models. Econometrica, 71, 1491–1517. Angrist, J. D., & Krueger, A. B. (1999). Empirical strategies in labor economics. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. III A, pp. 1277–1366). Amsterdam: North-Holland. Behrman, J., Cheng, Y., & Todd, P. (2004). Evaluating preschool programs when length of exposure to the program varies: A nonparametric approach. Review of Economics and Statistics, 86, 108–132. Behrman, J., Sengupta, P., & Todd, P. (2005). Progressing through PROGRESA: An impact assessment of a school subsidy experiment in Mexico. Economic Development and Cultural Change, 54/1, 237–275. Bergemann, A., Fitzenberger, B., & Speckesser, S. (2004). Evaluating the dynamic employment effects of training programs in East Germany using conditional difference-in-differences. ZEW, Discussion Paper 04-41. Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34, 305–334. Chamberlain, G. (1992). Comment: Sequential moment restrictions in panel data. Journal of Business & Economic Statistics, 10, 20–26. Cre´pon, B., & Kramarz, F. (2002). Employed 40 hours or not-employed 39: Lessons from the 1982 workweek reduction in France. Journal of Political Economy, 110, 1355–1389. Dawid, A. P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B, 41, 1–31. Ding, W., & Lehrer, S. F. (2003). Estimating dynamic treatment effects from Project STAR. Mimeo. Fredriksson, P., & Johansson, P. (2003). Program evaluation and random program starts. IFAU Discussion Paper, 2003(1), Uppsala. Fro¨lich, M. (2006). A note on parametric and nonparametric regression in the presence of endogenous control variables. Discussion Paper. Department of Economics, University of St. Gallen. Gerfin, M., & Lechner, M. (2002). Microeconometric evaluation of the active labour market policy in Switzerland. The Economic Journal, 112, 854–893. Gill, R. D., & Robins, J. M. (2001). Causal inference for complex longitudinal data: The continuous case. The Annals of Statistics, 2001, 29(6), 1–27.
326
MICHAEL LECHNER
Heckman, J. J., LaLonde, R. J., & Smith, J. A. (1999). The economics and econometrics of active labor market programs. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. III A, pp. 1865–2097). Amsterdam: North-Holland. Imbens, G. W. (2000). The role of the propensity score in estimating dose–response functions. Biometrika, 87, 706–710. Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1), 4–29. Lechner, M. (1999). Earnings and employment effects of continuous off-the-job training in East Germany after unification. Journal of Business & Economic Statistics, 17, 74–90. Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In: M. Lechner & F. Pfeiffer (Eds), Econometric evaluation of active labour market policies (pp. 43–58). Heidelberg: Physica. Lechner, M. (2002). Programme heterogeneity and propensity score matching: An application to the evaluation of active labour market policies. Review of Economics and Statistics, 84, 205–220. Lechner, M. (2004). Sequential matching estimation of dynamic causal models. Discussion Paper. University of St. Gallen (revised 2005). Lechner, M. (forthcoming). A note on endogenous control variables in causal studies. Statistics and Probability Letters. Lechner, M., & Miquel, R. (2001). A potential outcome approach to dynamic programme evaluation – Part I: Identification. Discussion Paper 2001-07. Department of Economics, University of St. Gallen (revised 2005). Lechner, M., Miquel, R., & Wunsch, C. (2004). Long-run effects of public sector sponsored training in West Germany. Discussion Paper 2004-19. Department of Economics, University of St. Gallen. Li, Y. P., Propert, K. J., & Rosenbaum, P. (2001). Balanced risk set matching. Journal of the American Statistical Association, 96, 870–882. Manski, C. F. (2004). Social learning from private experiences: The dynamics of the selection problem. Review of Economics Studies, 71, 443–558. Miquel, R. (2002). Identification of dynamic treatments effects by instrumental variables. Discussion Paper 2002-11. University of St. Gallen. Miquel, R. (2003). Identification of effects of dynamic treatments with a difference-in-differences approach. Discussion Paper 2003-06. University of St. Gallen. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65, 331–366. Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512, with 1987 Errata to: A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Computers and Mathematics with Applications, 14, 917–921; 1987 Addendum to: A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Computers and Mathematics with Applications, 14, 923–945; and 1987 Errata to: Addendum to ‘A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect’. Computers and Mathematics with Applications, 18, 477.
Matching Estimation of Dynamic Treatment Models
327
Robins, J. M. (1989). The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: L. Sechrest, H. Freeman & A. Mulley (Eds), Health service research methodology: A focus on aids (pp. 113–159). Washington, DC: Public Health Service, National Centre for Health Services Research. Robins, J. M. (1997). Causal inference from complex longitudinal data. Latent variable modelling and applications to causality. In: M. Berkane (Ed.), Lecture notes in statistics (Vol. 120, pp. 69–117). New York: Springer. Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese, 121, 151–179. Robins, J. M., Greenland, S., & Hu, F. (1999a). Estimation of the causal effect of a timevarying exposure on the marginal mean of a repeated binary outcome. Journal of the American Statistical Association, 94, 687–700. Robins, J. M., Greenland, S., & Hu, F. (1999b). Estimation of the causal effect of a timevarying exposure on the marginal mean of a repeated binary outcome: Rejoinder. Journal of the American Statistical Association, 94, 708–712. Rosenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable that has been affected by the treatment. Journal of the Royal Statistical Society, Series A, 147, 656–666. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–50. Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701. Rubin, D. B. (2004). On principles for modeling propensity scores in medical research. Pharmacoepidemiology and Drug Safety, 13, 855–857. Sianesi, B. (2004). An evaluation of the Swedish system of active labour market programmes in the 1990s. The Review of Economics and Statistics, 86, 133–155. Zhao, Z. (2004). Using matching to estimate treatment effects: Data requirements, matching metric and a Monte Carlo study. The Review of Economics and Statistics, 86, 91–107.
APPENDIX A. DESCRIPTION OF OUTCOME HETEROGENEITY Tables A1a, A1b, A2a and A2b contain some descriptive statistics for the true potential outcome variables that can be used to deduce the true effects. To avoid flooding the reader with numbers that are not helpful in interpreting the results that appear in the main body of the paper, we give means and standard deviations for earnings (Tables A1a and A1b) and employment (Tables A2a and A2b) only for sequences and subsamples that have some relation to the effects considered above.
328
Table A1a.
Mean and Standard Deviation of Potential Earnings in 7(4) in Selected Subsamples.
Sequence
Subsample (Treatment State in Period 3(1)) 0
2
3
4
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
933 1,432 1,506 1,342 1,283 1,470 1,481 1,500 1,510 1,432 1,498 1,497 1,517 1,337 1,440
1,587 1,588 1,735 1,722 1,720 1,581 1,557 1,588 1,617 1,488 1,570 1,583 1,615 1,661 1,685
3,613 4,055 4,209 4,089 3,936 4,073 4,105 4,166 4,216 4,005 4,094 4,150 4,297 4,102 3,781
4,108 3,975 4,321 4,320 4,239 3,968 3,933 4,012 4,097 3,781 3,905 3,995 4,176 4,208 4,157
4,378 4,815 5,000 4,830 4,724 4,822 4,858 4,924 4,977 4,683 4,784 4,908 4,968 4,885 4,406
4,169 4,023 4,316 4,384 4,309 3,920 3,967 4,042 4,128 3,778 3,906 4,024 4,120 4,236 4,329
624 1,121 1,172 1,013 955 1,157 1,164 1,179 1,186 1,129 1,148 1,177 1,194 1,009 1,173
796 877 959 924 935 858 867 885 902 832 853 832 903 880 1,074
542 1,026 1,079 923 858 1,062 1,070 1,084 1,089 1,035 1,054 1,089 1,097 913 1,104
782 838 941 911 889 828 838 857 872 797 822 872 872 869 1,069
MICHAEL LECHNER
000000 01 001 0001 000001 1 11 111 1111 10 110 1110 11110 111000 111002
1
1,793 1,777 1,795 1,783 1,800 1,796 1,802 1,830 1,657 1,723 2,041 1,867 1,056 1,655 1,057 1,661 1,065 1,676 1,067 1,707 1,068 1,713 1,066 1,731 1,120 1,630 1,124 1,673 1,033 1,642 1,168 1,658 69,951
4,338 4,214 4,351 4,232 4,377 4,266 4,437 4,361 4,197 4,152 4,567 4,304 3,829 4,295 3,838 4,310 3,867 4,354 3,802 4,398 3,812 4,419 3,830 4,469 3,853 4,292 3,668 4,275 3,757 4,320 3,850 4,348 8,997
4,973 4,311 4,9,89 4,328 5,017 4,359 5,086 4,445 4,841 4,252 5,254 4,424 4,593 4,356 4,603 4,371 4,635 4,412 4,589 4,511 4,601 4,534 4,622 4,586 4,611 4,349 4,418 4,381 4,523 4,388 4,581 4,430 8,665
1,505 1,152 2,263 2,625 1,510 1,162 1,507 1,179 1,369 1,080 1,726 1,186 754 870 1,570 2,624 1,581 2,648 1,574 2,671 755 878 751 885 824 850 835 874 735 832 879 865 4,964
1,456 1,173 1,457 1,176 1,459 1,183 1,454 1,200 1,326 1,107 1,669 1,199 655 847 655 850 660 857 686 873 686 877 682 886 748 838 763 837 662 833 812 844 7,423
Matching Estimation of Dynamic Treatment Models
2 22 222 2222 222000 222001 3 33 333 0003 00033 000333 4 000004 444000 444003 Number of observations
Note: The first element of each sequence refers to period 3(1). Mean, mean in subsample; SD, standard deviation in subsample.
329
330
Table A1b.
Mean and Standard Deviation of Potential Earnings in 9(4) in Selected Subsamples.
Sequence
Subsample (Treatment State in Period 3(1)) 0
2
3
4
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
990 1,275 1,290 1,234 1,242 1,337 1,348 1,362 1,398 1,297 1,331 1,351 1,405 1,275 1,866
1,638 1,707 1,832 1,819 1,785 1,690 1,707 1,738 1,775 1,635 1,680 1,730 1,771 1,707 1,856
3,696 3,975 4,065 3,982 3,923 4,022 4,054 4,109 4,192 3,896 3,989 4,083 4,186 3,959 4,372
4,109 4,169 4,382 4,397 4,263 4,087 4,133 4,216 4,314 3,940 4,063 4,194 4,300 4,321 4,240
4,487 4,757 4,837 4,779 4,720 4,786 4,821 4,883 4,968 4,646 4,752 4,855 4,959 4,762 5,037
4,173 4,219 4,486 4,512 4,391 4,131 4,175 4,251 4,339 3,996 4,113 4,233 4,328 4,367 4,355
670 938 951 896 923 1,007 1,015 1,026 1,057 980 1,003 1,016 1,065 816 1,173
863 981 1,062 1,021 1,019 967 975 993 1,022 938 960 987 1,021 954 1,074
587 848 851 800 814 916 924 933 963 886 912 923 971 729 1,500
850 961 1,043 1,011 977 955 963 978 1,009 929 952 971 1,008 940 1,215
MICHAEL LECHNER
000000 01 001 0001 000001 1 11 111 1111 10 110 1110 11110 111000 111002
1
1,991 1,025 1,990 1,843 1,889 1,853 1,977 1,878 1,791 1,751 2,060 1,922 1,025 1,712 1,026 1,717 1,034 1,734 1,002 1,752 1,003 1,759 1,004 1,774 1,116 1,715 1,041 1,743 1,056 1,716 1,047 1,734 69,951
4,502 4,258 4,511 4,274 4,528 4,307 4,567 4,388 4,274 4,147 4,559 4,323 3,705 4,341 3,713 4,355 3,743 4,399 3,685 4,446 3,696 4,466 3,716 4,508 3,913 4,352 4,502 4,258 3,830 4,344 3,685 4,394 8,997
5,151 4,348 5,162 4,363 5,182 4,392 5,232 4,469 4,934 4,242 5,233 4,438 4,467 4,436 4,477 4,450 4,508 4,492 4,479 4,589 4,491 4,611 4,514 4,653 4,695 4,396 4,464 4,461 4,620 4,405 4,413 4,474 8,665
1,661 1,181 2,449 2,664 1,656 1,186 1,639 1,190 1,462 1,073 1,734 1,250 697 902 1,520 2,653 1,532 2,677 1,499 2,706 755 878 751 885 788 912 742 958 724 897 744 922 4,964
1,456 1,173 1,636 1,222 1,633 1,225 1,613 1,232 1,444 1,118 1,686 1,259 617 888 617 890 622 899 606 931 607 934 606 939 704 899 886 926 649 886 669 917 7,423
Matching Estimation of Dynamic Treatment Models
2 22 222 2222 222000 222001 3 33 333 0003 00033 000333 4 000004 444000 444003 Number of observations
Note: The first element of each sequence refers to period 3(1). Mean, mean in subsample; SD, standard deviation in subsample.
331
332
MICHAEL LECHNER
Table A2a.
Percentage Employed in 7(4) in Selected Subsamples.
Sequence
000000 01 001 0001 000001 1 11 111 1111 10 110 1110 11110 111000 111002 2 22 222 2222 222000 222001 3 33 333 0003 00033 000333 4 000004 444000 444003 Number of observations
Subsample (Treatment State in Period 3(1)) 0
1
2
3
4
58 96 95 92 91 97 97 97 97 97 97 97 97 96 90 81 81 81 80 79 96 71 70 71 76 76 76 75 81 76 90 69,951
71 97 96 94 93 98 98 98 98 98 98 98 98 98 90 82 82 83 82 81 97 81 80 80 83 83 83 85 85 82 92 8,997
74 97 95 96 94 98 98 98 98 98 98 98 98 98 89 82 82 82 81 80 96 82 82 82 84 84 84 86 86 85 92 8,665
56 95 95 91 89 97 97 97 97 97 97 97 97 95 90 82 81 82 81 80 97 67 72 72 77 74 74 78 80 68 65 4,964
50 95 95 91 89 97 97 97 97 97 97 97 97 95 91 80 80 80 79 78 96 65 65 65 73 73 73 75 78 72 89 7,423
Note: The first element of each sequence refers to period 3(1).
Matching Estimation of Dynamic Treatment Models
Table A2b.
333
Percentage Employed in 9(4) in Selected Subsamples.
Sequence
Subsample (Treatment State in Period 3(1)) 0
1
2
3
4
56 76 77 77 83 81 81 82 82 80 81 82 83 77 89 87 85 87 86 86 94 55 59 59 58 75 75 70 67 68 65 4,964
50 74 76 75 81 78 78 79 79 76 78 79 80 74 87 86 86 86 85 84 94 50 50 50 49 49 49 65 64 63 61 7,423
Employment in 9(4) in % points 000000 01 001 0001 000001 1 11 111 1111 10 110 1110 11110 111000 111002 2 22 222 2222 222000 222001 3 33 333 0003 00033 000333 4 000004 444000 444003 Number of observations
58 78 79 78 84 80 81 82 82 80 81 81 82 77 88 86 86 86 86 85 95 58 58 59 56 56 56 71 69 69 66 69,951
72 85 85 85 89 87 87 87 87 86 87 87 88 85 88 84 84 84 84 83 94 70 70 70 69 69 69 80 78 79 75 8,997
75 87 86 86 90 89 89 89 89 88 87 89 90 87 89 83 83 83 83 82 94 72 72 72 72 72 72 82 80 81 77 8,665
Note: The first element of each sequence refers to period 3(1).
PANEL DATA MODELS AND TRANSITORY FLUCTUATIONS IN THE EXPLANATORY VARIABLE Terra McKinnish ABSTRACT This chapter demonstrates that fixed-effects and first-differences models often understate the effect of interest because of the variation used to identify the model. In particular, the within-unit time-series variation often reflects transitory fluctuations that have little effect on behavioral outcomes. The data in effect suffer from measurement error, as a portion of the variation in the independent variable has no effect on the dependent variable. Two empirical examples are presented: one on the relationship between AFDC and fertility and the other on the relationship between local economic conditions and AFDC expenditures.
1. INTRODUCTION This chapter explores the use of panel data models in cases in which the effect of interest may not be well estimated using the within-unit time-series variation in conditions. While it is well known that panel data models can exacerbate attenuation bias due to measurement error, it is not widely appreciated that
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 335–358 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00011-4
335
336
TERRA MCKINNISH
these results apply to a broader set of circumstances. Even though the independent variable may not be measured with error, per se, it still may mismeasure the factor that truly affects the outcome of interest. For example, we may know the exact value of state-welfare benefits from administrative records, but not all of the variation in these benefit levels will necessarily influence behavior. In particular, we expect many outcomes to respond differently to short- and long-term variation in conditions. This differential effect of long- and short-term variation can generate the same bias as ‘‘true’’ measurement error. This chapter investigates two empirical examples. The first studies the effect of benefit levels in the Aid to Families with Dependent Children (AFDC) program on birth rates. The second studies the effect of local economic conditions on AFDC expenditures. In both of these cases, the outcome of interest is likely more responsive to sustained changes in the explanatory variable than transitory fluctuations. For both examples, we first compare estimates from fixed-effects, first-differences, and long-differences models to confirm the pattern that is consistent with measurement error and, therefore, these differential effects. The estimates of the effect of local economic conditions on AFDC expenditures are also compared to results using unemployment insurance (UI) expenditures as the dependent variable. Both AFDC and UI are income maintenance programs. As a result, we expect expenditures in both programs to increase during economic downturns. Because, however, AFDC is a program with relatively long-term receipt and UI is a program with shortterm receipt, we expect UI expenditures to be more sensitive to short-term variation in economic conditions than AFDC participation. In this case, analytic results show that the patterns of coefficient estimates from differences and lagged effects models should differ systematically between the AFDC regressions and the UI regressions. These differential patterns are confirmed in the data. For both examples, we attempt to obtain consistent estimates of the effect of interest. In the birth rate example, in the absence of external instruments, lagged internal instruments are considered. We demonstrate that lagged instruments will tend to be ‘‘weak’’ and those that are sufficiently correlated with the independent variable will be most sensitive to correlation in the measurement error. In the welfare expenditures example, the decline of the steel industry is used as an instrument that predicts long-term changes in local economic conditions. A comparison of OLS and IV estimates from differences models and lagged effects models illustrate the differential effect of transitory and
Panel Data Models and Transitory Fluctuations
337
sustained changes in economic conditions. This exercise also clarifies the differences in estimates from models with and without lagged effects that have been highlighted in many recent studies.
2. MEASUREMENT ERROR IN PANEL DATA This section reviews the basic analytical measurement error results for panel data from Griliches and Hausman (1986). Consider the following model: Y it ¼ ai þ bZ it þ it
(1)
X it ¼ Z it þ vit
(2)
For example, in the first empirical application, Yit represents the birth rate in state i at time t, Xit the welfare benefit, Zit the sustained component of the welfare benefit, and vit the transitory component of the welfare benefit. The term ‘‘measurement error’’ is not being used in the conventional manner, because we know the precise value of the AFDC benefit. But it is still the case that the benefit level (X) may mismeasure the factor that truly affects fertility, the sustained component of AFDC benefits (Z). The results that follow assume classical measurement error, in which vit and eit are white noise, both uncorrelated with Zit. It is also assumed that a stationary process generates Zit. If we consider a differences model regressing Y on the observed variable, X: Y it Y itj ¼ bd ðX it X itj Þ þ ð0it 0itj Þ
(3)
the asymptotic result is: plimðb^ d Þ ¼ b
s2Z ð1 rj Þ s2Z ð1 rj Þ þ s2v
! (4)
where rj is the correlation between Zit and Zitj. Consider the firstdifferences model in which j=1. If Z is highly correlated over time and two observations of X from adjoining time periods are differenced, most of the information about Z will be eliminated, leaving primarily variation due to the noise component, v. If Zit has a declining correlogram then rj or1 , and estimates obtained using longer differences will be less inconsistent than estimates using shorter differences. By taking differences of observations that are less correlated with each other, the variance of the signal is increased
338
TERRA MCKINNISH
relative to the noise. If Zit has a declining correlogram, plimðb^ d Þ converges to the well-known cross-sectional result bs2v =ðs2v þ s2z Þ as j becomes large. If the fixed-effects model is used: Y it Y¯ i ¼ bfe ðX it X¯ i Þ þ ð0it ¯0i Þ
(5)
then: 1 Pt1 1 2 T þ 2 k¼1 ðT K 1Þrj B C ^ ¼ bB C T plimðbÞ @ P 1 T 1 2A s2Z 2 T þ 2 t1 sv k¼1 ðT K 1Þrj þ T T 0
s2Z
(6)
Griliches and Hausman (1986) summarize a number of general results comparing the fixed-effects, first-differences, and long-differences models under the assumption that Zit is independent of nit and stationary with a declining correlogram. They show that for TW2: (a) The inconsistency of fixed-effects and long-differences is less than that of first-differences. (b) The inconsistency of the (T1) long-differences estimator is less than that of the fixed-effects estimator. (c) The relative inconsistency of differences shorter than (T1) and fixed-effects depends on more specific characteristics of the correlation structure. These analytical results suggest that if the dependent variable is more responsive to long-term changes in the independent variable than transitory fluctuations, we should observe a distinct pattern when we compare firstdifferences, fixed-effects, and long-differences estimates. First, all estimates should be attenuated toward zero. The attenuation should be greatest in the first-differences estimates and reduced as longer differences are used. The fixed-effects estimates should be larger in magnitude than the first-differences estimates but there is no prediction about the relative magnitude of the fixedeffects and long-differences results.1 It is important to note that other forms of misspecification can also produce a similar pattern. If there is a time-varying unobserved confounder and this unobserved confounder is more highly correlated over time than the observed independent variable, this generates the same pattern in the fixed-effects, first-differences, and long-differences models.2 In this case, the bias will be smallest for first-differences, since the unobserved characteristic
Panel Data Models and Transitory Fluctuations
339
will largely be differenced out, but the bias will be larger for differences over longer periods that allow more change in the unobserved characteristic.
3. DESCRIPTION OF EMPIRICAL EXAMPLES 3.1. AFDC Benefits and Birth Rates A simple economic model suggests that any government transfer program that lowers the cost of supporting a child should increase birth rates. This section examines the effect of the generosity of the AFDC program on birth rates of young women, using a panel of state-level data on age-specific birth rates and AFDC benefit levels from 1973 to 1992. A number of studies have used panel data models to study the relationship between welfare and fertility (Jackson & Klerman, 1994; Matthews, Ribar, & Wilhelm, 1997; Clark & Strauss, 1998; Rosenzweig, 1999; Hoffman & Foster, 2000; Argys, Averett, & Rees, 2000). Similar studies have examined female headship, an outcome which combines fertility and marital decisions (Moffitt, 1994; Hoynes, 1997). Panel data studies typically have found positive, but modest, effects of benefits on fertility. The two studies that found large effects (Clark & Strauss, 1998; Rosenzweig, 1999) arguably used specifications that identify the effect from longer-term variation in benefits.3 As childbearing is a commitment to consumption not only in the period of birth but to future periods as well, we expect fertility decisions to be relatively non-responsive to transitory changes in welfare generosity. This raises the question of what variation will be available to identify the effect of interest in panel data models. Fig. 1 graphs the nominal AFDC benefit for a family of four over time in three states: Massachusetts, Montana, and Mississippi. This benefit level is guaranteed to a family of four with no additional income, a legislated parameter of the program that is known with certainty. There are clearly permanent cross-sectional differences between the three states in their general benefit level. Looking within each state over time, the graph shows that that Massachusetts’s time trend contains frequent small fluctuations. Mississippi’s time trend, on the other hand, contains no such fluctuations. The state changes benefits only twice in the period from 1970 to 1994. Montana’s time trend shows both transitory and sustained variation. Fig. 1 suggests that the cross-sectional differences in benefit level eliminated by differencing are fairly permanent, but much of the time-series variation
TERRA MCKINNISH
340 Massachusetts Montana
Mississippi
668
60 70
Fig. 1.
year
94
Nominal AFDC Benefits, Family of Four, 1970–1994.
remaining once the cross-sectional effects are eliminated reflects transitory fluctuations that likely have little bearing on fertility rates.4 Data used in this example are counts of births by state, year, age, and race of mother for 1973–1992 obtained using the Detailed Natality Files from the National Center for Health Statistics. This chapter reports the results for white women ages 20–24.5 Denominators for the birth rates are obtained from various US Census files. Births that occur in the first 9 months of a year are attributed to the previous year. The independent variable of interest is the AFDC benefit for a family of four with no additional income.6 Measures of real earnings per capita are obtained from the Bureau of Economic Analysis’ (BEA) Regional Economic Information System (REIS).7 The differences regression equation is: Dj BirthRatest ¼ b1 Dj AFDCst þ b2 Dj Earnpcst þ Yeart b3 þ mst
(7)
where Dj indicates a j-year difference, BirthRatest is the logged birth rate for white women of ages 20–24 in state s at year t, AFDCst the logged real
Panel Data Models and Transitory Fluctuations
341
AFDC benefit, and Earnpcst the logged real earnings per capita. Yeart is a vector of year indicators. The fixed-effects model is therefore: BirthRatest ¼ b0 þ b1 AFDCst þ b2 Earnpcst þ Yeart b3 þ States b4 þ st (8) where States is a vector of state indicator variables. 3.2. Local Economic Conditions and Welfare Expenditures The second empirical example examines the relationship between AFDC program participation and local labor market conditions, using county-level data on earnings and AFDC expenditures from 1969 to 1993. The earlier research on this topic (Fitzgerald, 1995; Miller & Sanders, 1997; Hoynes, 2000) estimated the relationship between contemporaneous economic conditions and AFDC participation using micro-level data. The more recent studies (Bartik & Eberts, 1999; CEA, 1999; Figlio & Ziliak, 1999; Wallace & Blank, 1999; Ziliak, Figlio, Davis, & Connolly, 2000; Mueser, Hotchkiss, King, Rokicki, & Stevens, 2000; Blank, 2001) have used aggregate caseload data and include lagged values of economic conditions in their specification. The general finding in this literature is that panel data models that only use contemporaneous economic conditions find a very modest relationship, but those models that add up multiple lagged effects find much larger effects. The relationship between the lagged effects specifications and the specifications used in this chapter will be pursued in a later section of the chapter. The basic regression model is: Dj AFDCist ¼ b0 þ b1 Dj Earningsist þ b2 Dj Popist þ ðStates Yeart Þb3 þ ist (9) where Dj operator again indicates a j-year difference, AFDCist is logarithm of real AFDC expenditures for county i in state s and time t, Earningsist the logarithm of aggregate real earnings, Popist the logarithm of county population, and (States Yeart) the interaction of state and year dummy variables. The fixed-effects model is: AFDCist ¼ b0 þ b1 Earningsist þ b2 Popist þ ðStates Yeart Þb3 þ Countyi b4 þ ist
(10)
Annual measures of county earnings, AFDC expenditures, and county population are obtained from the BEA REIS data from 1969 to 1993. AFDC expenditures are a function of benefit levels and caseload. Because, however,
342
TERRA MCKINNISH
benefit levels vary at the state level, the state-year effects net out the effect of benefits. The within-state variation in expenditures at the county level should largely reflect changes in caseloads.
4. FIXED-EFFECTS AND DIFFERENCES RESULTS Results from the fixed-effects and differences models specified in Eqs. (7)–(10) are reported in Table 1. The first column reports results from the regression of birth rates on state AFDC benefit levels. The second column reports results from the regression of county AFDC expenditures on county earnings. Results are reported for both models for fixed-effects, first-differences, 3-year long-differences, 5-year long-differences, and 7-year long-differences. For the AFDC expenditures regressions, the larger sample size and longer time series also allow the estimation of a 20-year long-differences model. Table 1. Estimation Method
Fixed-effects First-differences 3-year differences 5-year differences 7-year differences 20-year differences
Fixed-effects and Differences Estimates.
AFDC Benefits and Fertility County Earnings and AFDC Expenditures All States 1973–1992
N
All Counties 1969–1993
N
0.123 (0.022) 0.017 (0.023) 0.074 (0.051) 0.094 (0.059) 0.116 (0.059)
1,020
0.244 (0.006) 0.038 (0.006) 0.165 (0.011) 0.219 (0.014) 0.246 (0.017) 0.268 (0.038)
70,621
969 867 765 663
67,302 61,279 55,391 49,518 12,884
Notes: Column 1: Dependent variable is the logarithm of state birth rate for white women, ages 20–24. Table reports coefficient on logarithm of state AFDC benefit level. Per capita income and year effects are included. Standard errors, in parentheses, are clustered at the state level. Data include all 50 states and the District of Columbia. Column 2: Dependent variable is the logarithm of county AFDC expenditures. Table reports the coefficient on the logarithm of total county earnings. All regressions include controls for county population and state-year effects. Standard errors, in parentheses, are clustered at the county level. There are 3,182 counties in the United States. Missing observations are due to suppression of the AFDC expenditures variable in small counties (suppression rate is 9.7%).
Panel Data Models and Transitory Fluctuations
343
For both empirical examples, the results display the expected pattern. Consistent with the analytical predictions from Section 2, the first-differences coefficient estimates are smaller in magnitude than the fixed-effects estimates. Also consistent with the analytical predictions, the coefficient estimates increase in magnitude with the length of the differences.8 The change in the coefficients is substantial, in both cases increasing almost an order of magnitude between the first-differences and the 7-year long-differences.9 In both columns of Table 1, the fixed-effects estimate is larger in magnitude than the 5-year differences estimate, and in the fertility example, is larger than even the 7-year differences estimate. These results in Table 1 provide some guidance about selection of panel data models. If the dependent variable is relatively unresponsive to transitory fluctuations in the explanatory variable, then first-differences models perform poorly. In these examples, the fixedeffects estimates and the long-differences estimates are similar in magnitude, but because much of the data is eliminated with long-differences models, their precision is considerably less than the fixed-effects estimates. This suggests that in most cases fixed-effects is the preferred estimator. There are some important caveats to the above conclusion. First, the analytical results discussed in Section 2 indicate that we can only be certain that the fixed-effects estimate will be less inconsistent than the firstdifferences model. In other applications, long-differences could be sufficiently less inconsistent than fixed-effects to warrant the loss of precision. Second, it must be remembered that all of the estimates in Table 1, including the fixed-effects estimates, suffer from attenuation bias. The bias is simply less severe than for first-differences. Finally, the performance of the fixedeffects estimator will depend in part on the length of the available time series.
5. A COMPARISON OF AFDC AND UI RESULTS This chapter argues that the results in Table 1 are due to the differential effect of long-term and short-term changes in conditions, rather than just omitted variable bias. A useful exercise to support this claim is to compare the results from the analysis of AFDC expenditures with results from an analysis of UI expenditures. Like AFDC, UI is an income maintenance program, but one that is specifically designed to act as a buffer to business cycle fluctuations. The program typically provides qualified recipients 50–70% of their previous wages for up to 26 weeks. As the UI program is designed to sustain workers through temporary job losses, UI expenditures should be quite sensitive to transitory fluctuations in economic conditions. Because of limits on the
344
TERRA MCKINNISH
duration of benefits, it is unlikely that UI expenditures are as responsive to long-term changes in economic conditions as AFDC expenditures.10 As a result, there should be a very different pattern predicted for the empirical results. 5.1. Analytical Results This section reports analytical results for a model in which the dependent variable is most responsive to long-term changes in the explanatory variable and for a model in which the dependent variable is most responsive to shortterm changes in the explanatory variable. Analytical results for differences models and for differences models with a lagged effect will be discussed.11 Consider a slight variation of the measurement error model described in Eqs. (1) and (2): Y it ¼ ai þ b Z it þ it
(11)
X it ¼ Z it þ vit
(12)
Z it ¼ rZ it1 þ yit
(13)
vt ¼ dvt1 þ mt
(14)
and
where 0odo1 and 0oro1. If dor, then Y is a function of the more highly correlated component of the observed variable. If d4r then Y is a function of the less correlated component of the observed variable.12 For the differences model described in Eq. (3), the asymptotic result is now: s2z ð1 r j Þ ^ plimðbd Þ ¼ b 2 (15) sz ð1 r j Þ þ s2v ð1 d j Þ If dor, the anticipated pattern in the differences estimates remains the same as described in Section 2. The first-differences will be the most attenuated, as much of the signal is differenced out. Longer differences will capture proportionally more signal and be less attenuated. If, instead, d4r, so that the dependent variable responds to the more transitory fluctuations in X, the reverse is true. The first-differences estimates will be least attenuated, as the signal-to-noise ratio will be greatest. The estimates should decrease in magnitude as longer differences are used and the signal-to-noise ratio decreases.
Panel Data Models and Transitory Fluctuations
345
Now consider a differences regression model that includes a lagged effect of the independent variable: Y it Y itj ¼ bd ðX it X itj Þ þ bl ðX it1 X itj1 Þ þ ðeit eitj Þ
(16)
for j=1, plimðb^ l Þ ¼
2bs2v s2z ½ð1 rÞð1 dÞ2 ð1 dÞð1 rÞ2 4½s2z ð1 rÞ þ s2v ð1 dÞ2 ½s2z ð1 rÞ2 þ s2v ð1 dÞ2 2
(17)
and, for jW1, plimðb^ l Þ ¼ 2bs2v s2z ½ð1 d j Þð2r r jþ1 r j1 Þ ð1 r j Þð2d d jþ1 d j1 Þ 4½s2z ð1 r j Þ þ s2v ð1 d j Þ2 ½s2z ð2r r jþ1 p j1 Þ þ s2v ð2d d jþ1 d j1 Þ2 (18)
It can quickly be shown that, in both cases, plimðb^ l Þ has the same sign as b if dor and that plimðb^ l Þ has the sign of b if d4r. Furthermore, it will generally be the case that jplimðb^ l; jþ1 Þj4jplimðb^ l; j Þj, regardless of the relative magnitude of d and r.13 The above results make some important predictions for differentiating between a dependent variable that responds to transitory variation compared to a dependent variable that responds to sustained changes. In the case of the dependent variable that responds to sustained changes, the estimates from the differences model and the estimates of lagged effect should be of the same sign. Furthermore, both sets of estimates should increase in magnitude with longer differences. In the case of the dependent variable that responds to transitory variation in the explanatory variable, the coefficient estimates on the lagged effect should be the opposite sign of the coefficient estimates from the differences model. Furthermore, the estimates from the differences model should decrease in magnitude with the length of the differences, while the estimates of the lagged effect should increase in magnitude with the length of the differences.14 5.2. Empirical Results The empirical results for this exercise are reported in Table 2. The first two columns report results from the AFDC expenditures regressions. The first column reports the results from the differences model described in Eq. (13). As was the case in Table 1, we see that the magnitudes of the estimates
346
TERRA MCKINNISH
Table 2.
Comparison of Estimation Results: County AFDC versus UI Expenditures, 1969–1993. AFDC Expenditures
j=1 j=3 j=5 j=7 j=20
UI Expenditures
Differences Estimate
Lagged Effect
Differences Estimate
Lagged Effect
N
0.047 (0.006) 0.175 (0.013) 0.231 (0.015) 0.269 (0.018) 0.299 (0.040)
0.074 (0.006) 0.116 (0.009) 0.123 (0.011) 0.104 (0.013) 0.132 (0.055)
0.261 (0.018) 0.280 (0.020) 0.201 (0.019) 0.160 (0.020) 0.043 (0.037)
0.182 (0.015) 0.353 (0.024) 0.378 (0.025) 0.431 (0.029) 0.557 (0.065)
63,709 57,781 52,023 46,193 9,930
Notes: Dependent variable is the logarithm of county AFDC expenditures or UI expenditures. Table reports the coefficient on the logarithm of county earnings or one-year lag of county earnings. All regressions include controls for county population and state-year effects. Standard errors, in parentheses, are clustered at the county level. There are 3,182 counties in the United States. Missing observations are due to suppression of the AFDC expenditures and UI expenditures variables in small counties (combined suppression rate is 11.5%).
increase with the length of the differences.15 Second column reports the results from adding a single lagged effect to the differences model, as described in Eq. (16). We observe the same pattern, as shown in column 1. The estimates increase in magnitude with the length of the differences. The results for UI expenditures are reported in columns 3 and 4. As was the case with AFDC expenditures, the differences estimates in the third column suggest that UI expenditures decrease when economic conditions improve. In this case, however, the estimates decrease in magnitude with the length of the differences, as implied by the analytic results in Eq. (15). In column 4, the coefficient on the lagged effect is now positive, as implied by the analytic results in Eq. (16). Both of the analytical predictions are born out in the data. When a dependent variable is responsive to shorter-term changes in the explanatory variable, the effects are largest for short rather than long-differences, and the sign of the lagged effect is the opposite of the contemporaneous effect. The results in this section emphasize that the performance of different panel data estimators depends crucially on the type of variation in the explanatory variable that generates responses in the dependent variable.
Panel Data Models and Transitory Fluctuations
347
6. LAGGED INTERNAL INSTRUMENTS The results in Table 1 are consistent with attenuation bias, which suggests that even the largest estimates in Table 1 understate the magnitudes of the effects of interest. The most common correction for measurement error is to use IV estimation with an instrument that is correlated with the signal but independent of the measurement error. Unfortunately, it is difficult to find such an instrument for AFDC benefits. Griliches and Hausman (1986) point out that in the absence of an appropriate external instrument, it is possible to use lagged values of the independent variable as instruments to correct the measurement error bias. Consider first the basic differences model from Eq. (3): Y it Y itj ¼ bðX it X itj Þ þ ðit itj Þ If the measurement error, v, is uncorrelated over time, then any value of X other than Xit and Xitj, or any function of these values, is a valid instrument for XitXitj.16 This section provides evidence that these instruments have not worked well in practice because many of the theoretically valid instruments are in fact weak instruments. The lagged values of X produce consistent estimates under the appropriate assumptions about the structure of the measurement error. But if the lagged values of X are only mildly correlated with XitXitj, the IV estimator can have large bias in small samples and can be extremely imprecise (Bound, Jaeger, & Baker, 1995). The intuition of the problem is fairly simple. If the signal is highly correlated over time, differencing two adjacent observations will leave almost no signal. Therefore, when instrumenting the first-differences, one is trying to instrument an observation that is almost entirely white noise. Instruments that are lagged several periods behind the independent variable and instruments that are differences of lagged observations will tend to be particularly weak. In order to increase the correlation between the instrument and the explanatory variable, one might instrument long-differences rather than first-differences, and use instruments that overlap the independent variable rather than ones that lag behind it. Table 3 reports the results obtained using lagged internal instruments to estimate the relationship between AFDC and fertility. The basic differences model from Eq. (7), which includes controls for per capita earnings and year effects, is used. Lagged AFDC benefits are used as instruments for first-differences, 2-year differences, and 3-year differences. The first column
348
Table 3.
TERRA MCKINNISH
AFDC Benefits and Fertility: Lagged Internal Instruments. Coefficient on AFDC Benefit
Instrument XitXit1 with: Xit2 [0.70] Xit3 [0.95] Xit4 [1.25] Xit2Xit3 [0.41] Xit3Xit4 [0.42] Xit2Xit4 [0.02]
0.349 (0.823) 0.272 (0.645) 0.272 (0.655) 0.120 (0.440) 0.273 (0.995) 1.48 (12.41)
Instrument XitXit2 with: Xit3 [4.31] Xit4 [4.90] Xit3Xit4 [0.04] Xit1Xit3 [135.0] Xit1Xit4 [105.0]
0.775 0.776 0.715 0.137 0.129
Instrument XitXit3 with: Xit2 [0.26] Xit4 [3.79] Xit1Xit4 [360.8] Xit1Xit2 [106.4] Xit2Xit4 [61.3]
2.49 (5.00) 0.762 (0.498) 0.118 (0.056) 0.116 (0.062) 0.121 (0.061)
(0.470) (0.445) (4.40) (0.064) (0.060)
Notes: First-stage partial F-statistic in reported in square brackets. Standard errors, in parentheses, are clustered at the state level.
describes the instrument and reports the first-stage partial F-statistic on the instrument. There are three types of instruments used in Table 3: lagged differences, lagged levels, and overlapping differences. The lagged differences are conceptually the most attractive, but, as predicted, tend to be very weak instruments. Lagged levels have less theoretical justification unless one has a conceptual model that suggests that the growth rates should depend on the previous levels. Overlapping differences are the most likely to be strong instruments, but will be most vulnerable to violations of the assumption of uncorrelated errors (or in this case, correlation in the transitory component of the benefit level). Many of the instruments in Table 3 have an F-statistic less than one. Only 5 instruments can be classified as ‘‘strong’’ instruments, all of which have F-statistics larger than 40. Consistent with our expectations, none of the instruments for first-differences are strong, and all of the strong instruments overlap the explanatory variable.
Panel Data Models and Transitory Fluctuations
349
The second column in Table 3 reports the IV estimate of the coefficient on AFDC benefits. There is clearly tremendous variance in the coefficient estimates. If, however, attention is focused on the results in boldface when are obtained using the strong instruments, the results are all positive and of reasonable and similar magnitude, in the range of 0.12–0.13. The results obtained with the ‘‘strong’’ instruments are very similar in magnitude to those obtained with the 7-year long-differences and fixedeffects results. On one hand, this is not surprising, as the results using the overlapping instruments are identified from the same persistent year-to-year changes that identify the fixed-effects estimates. The fixed-effects and longdifferences estimates, however, still suffer from attenuation bias that the IV estimates should not if the assumption of uncorrelated measurement errors holds. This suggests that either the attenuation bias in these estimates is very slight, or that there is sufficient correlation in the transitory component of benefits to generate inconsistent estimates when using the overlapping instruments.
7. STEEL SHOCK INSTRUMENTAL VARIABLE RESULTS This section uses the structural decline of the steel industry between 1970 and 1987 as an instrument that is correlated with long-term changes in local economic conditions.17 The steel industry analysis focuses on eight states: Alabama, California, Illinois, Indiana, Michigan, New York, Ohio, and Pennsylvania. These eight states have the largest employment in primary metals in 1970, accounting for nearly 69% of total primary metals employment in the United States. Fig. 2 graphs the fraction of total earnings attributed to primary metals manufacturing over time for both the entire United States, and the eight steel-producing states used in this analysis. The graph shows that the fraction of total earnings attributed to primary metals manufacturing more than halved between 1979 and 1987. Because the steel industry is geographically concentrated, the collapse of steel manufacturing was not felt equally across all counties. In the eightstate region in 1969, 60% of counties had less than 1% of total employment in primary metals, while 8% of counties had more than 10% of employment in primary metals. Those counties with little steel employment were left relatively untouched by the decline in steel manufacturing, while those with high concentrations were very hard hit by the decline.
TERRA MCKINNISH
350 USA
Steel Region
1970
1975
Percent Earnings from Steel
3
2
1
0
Fig. 2.
1980 Year
1985
1990
Fraction of Earnings from Steel.
The instrument used in this analysis is the fraction of men in the county employed in primary metals manufacturing in 1969 and the interaction of that variable with the fraction of total earnings in the United States in that year attributable to the primary metals industry. The employment concentration measure, calculated from 4th County Population File C from the 1970 Census, provides a measure of each county’s vulnerability to the decline of steel. The earning share, calculated annually from BEA data, measures changes in demand for domestic steel at the national level. This instrument interacts a measure of a county’s dependence on steel with a measure of aggregate demand for steel.18 Because of the large structural decline in the steel industry during the period under study, the instrument should be highly correlated with the long-term component of earnings, but not the transitory. Coefficient estimates from the first-stage regressions are reported in Appendix B. Table 4 reports OLS and IV results for the steel states. The first column reports OLS results for four specifications: fixed-effects, first-differences, firstdifferences with two lagged effects, and 5-year long-differences.19 The firstdifferences coefficient is 0.131, but the sum of the contemporaneous effect and two lags is 0.441.20 As documented by a number of recent studies, the estimated effect of local economic conditions is considerably larger if lagged
Panel Data Models and Transitory Fluctuations
351
Table 4. Instrumental Variables and Lagged Effects, Steel States Sample, 1970–1987.
Fixed-effects First-differences Sum of first-differences and two lagged differences 5-year differences
OLS
N
IV
N
0.598 (0.067) 0.131 (0.029) 0.441 (0.047) 0.544 (0.054)
11,044
0.858 (0.277) 0.696 (0.201) 0.728 (0.221) 0.982 (0.233)
11,044
10,949 9,836 8,478
10,949 9,836 8,478
Notes: Dependent variable is the logarithm of county AFDC expenditures. Table reports the coefficient on the logarithm of total county earnings. All regressions include controls for county population and state-year effects. Instrument is the county’s fraction of male employment in primary metals in 1969 interacted with that year’s fraction of total earnings in the United States from primary metals manufacturing. For the fixed-effects model, the employment variable is interacted with period indicators for 1974–1976, 1977–1981, and 1982–1987 rather than the fraction earnings variable. Standard errors, in parentheses, are clustered at the county level. First-stage F-statistic is 29.9 for the fixed-effects model, 36.6 for first-differences model, and 45.0 for 5-year differences model. There are 619 counties in the 8-state steel region of Alabama, California, Indiana, Illinois, Michigan, New York, Ohio, and Pennsylvania.
effects are included. The fixed-effects and 5-year differences coefficient are even larger at 0.598 and 0.544. The second column reports the IV results for all four models. The IV results are substantially larger in magnitude than the OLS results. For the firstdifferences model, the coefficient increases in magnitude from 0.131 to 0.696. Two lags of the instruments are used to instrument the lagged effects model. It is particularly interesting that the IV estimate obtained from the firstdifferences without lags is very similar in magnitude to the IV estimate of the first-differences model with lags; the estimates are 0.696 and 0.728, respectively. This suggests an explanation for the difference between the OLS first-differences estimate and the OLS lagged effects estimate. Given contemporaneous changes in economic conditions, lagged changes in economic conditions provide a measure of the duration of the change. Therefore, in OLS, the lagged model better estimates the effect of long-term changes in conditions. When the models with and without lagged effects are instrumented with a longterm local shock, the difference between the models disappears. The IV estimates for the fixed-effects and the 5-year differences model are 0.858 and 0.982. There are two reasons that the IV results should be
352
TERRA MCKINNISH
greater in magnitude than the OLS results, even the fixed-effects and longdifference ones. One is that all of the OLS models suffer attenuation bias due to transitory fluctuations in economic conditions. This bias is, of course, more extreme for the first-differences model, but still exists for the fixedeffects and long-differences models. Additionally, the IV estimates are identified using variation in economic conditions resulting from changes in industrial job opportunities for low-skilled workers, which presumably should be more directly related to welfare participation than changes in general economic conditions. It is not possible, therefore, to say what fraction of the difference between the OLS and IV estimates is due to the purging from the data of the transitory variation in economic conditions.
8. CONCLUSIONS Fixed-effects and first-differences models are extremely popular because the relationship of interest is often confounded by unobserved heterogeneity in the cross section. Unfortunately, if the independent variable is an imprecise measure of the relevant factor, coefficient estimates from these models can be severely attenuated toward zero. The time-series variation that remains after removing fixed-effects often largely reflects idiosyncratic changes in the independent variable that have little influence on the decision of interest. This is problematic anytime the outcome of interest is one that likely responds to sustained changes in conditions, such as fertility, welfare program participation, educational attainment, migration, and marital status. These findings suggest that studies using fixed-effects or first-differences models can understate the magnitude of the effect of interest. Specifically, the results suggest that when the outcome does not respond to transitory variation in the explanatory variable, first-differences will produce particularly attenuated estimates. Attenuation still exists in the fixed-effects estimates, but should be less severe. Alternatively, if the outcome of interest primarily responds to transitory fluctuations in the explanatory variable, firstdifferences estimates are preferred. For researchers, comparing fixed-effects, first-differences, and longdifferences models is a simple way to check for misspecification. If these estimates differ, the research must then consider what forms of omitted variable bias and what assumptions regarding how shocks in the explanatory variable affect the outcome are consistent with the pattern of the estimates.
Panel Data Models and Transitory Fluctuations
353
NOTES 1. This pattern is predicted even without all of the assumptions of the classical error model. They hold even if the measurement error is correlated over time, as long as the correlation of the measurement error is smaller than the correlation of the signal. They can also hold for the case in which Z is not stationary, if v follows a stationary process. See McKinnish (1999) for details. 2. Heckman and Hotz (1989) proposed a popular test of the fixed-effects assumption in differences models, that of including a lagged value of the dependent variable in the model. If unobserved confounders been differenced out, the coefficient on the lagged dependent variable should be zero. McKinnish (2005) shows that measurement error generates the same failure of the Heckman–Hotz test as omitted variables, so the Heckman–Hotz test cannot be used to distinguish between these forms of misspecification. 3. Clark and Strauss used an IV strategy and Rosenzweig averaged benefits across years, both of which act to ‘‘smooth’’ out transitory changes in benefits. Hoffman and Foster (2000) show that the large effects estimated by Rosenzweig are sensitive to specification. 4. Moffitt (1994) and Rosenzweig (1999) have also noted that using year-to-year variation in AFDC benefits is likely to understate the effect of interest. Keane and Wolpin (2002) estimate a structural model of welfare effects that assumes women are forward-looking and therefore distinguishes between transitory and permanent benefit changes. They, however, are not able to control for state fixed-effects in the model. 5. The results for white women of ages 15–19 and 25–29, as well as black women, ages 20–24 and 25–29, all display the same patterns shown here for white women of ages 20–24. The sole demographic group that does not conform to the expected pattern is black teens. See McKinnish (1999) for estimates. 6. Obtained from various years of The Green Book publication of the U.S. House Ways and Means Committee and provided in electronic form by Robert Moffitt. 7. Earnings are divided by the population of age 10 and older in order to avoid endogenous effects of changes in the birth rate. AFDC and earnings variables are deflating using the July CPIU, base year 1982–1984. 8. Baker, Benjamin, and Stanger (1999) document a similar pattern in analysis of minimum wage laws. They also consider differences of varying length and find that the magnitude of the effect is larger for longer differences. 9. One concern might be that the sample is shrinking as longer differences are used. These same patterns are observed even if the sample is restricted to be the same across all models. It is, however, unclear that the restricted sample estimates are preferred. To illustrate this, consider the fact that data from 1980 to 1990 on a cross section of N observations can be used to estimate a 10-year long-difference model with sample size N or a first-differences model with sample size 10N, but both estimates will use the variation from 1980 to 1990. If the sample is restricted to be the ‘‘same’’, so that both models are estimates using a sample of size N, the firstdifferences model will be estimated using the change from 1989 to 1990, and the pre1989 variation will be removed from the model.
354
TERRA MCKINNISH
10. Making a similar argument, Black, Daniel, and Sanders (2002) show that UI expenditures were relative non-responsive to the large shocks to the coal economy during the 1970s and 1980s. 11. Proofs of all asymptotic results appear in Appendix A. 12. Likewise, we could state that dor and compared the case where Y is a function of Z to the case in which Y is a function of v. It will simply be easier to present the analytical results that follow using this representation of the model. 13. See Appendix A for further details. Result concerning relative magnitudes of estimates from j and j+1 models generated from grid search. 14. Klerman and Haider (2004) demonstrate that caseload models, even those including multiple lags of economic conditions, are also likely biased because they fail to take into account the fact that caseloads are the net result of flows on and off the program. They show that a simple stock-flow model of welfare caseloads indicates that any regression analysis of the stock requires a full lag structure, and interactions of those lags, that is equal in length to the longest time people spend on the welfare program. Dealing with such a complicated lag structure is beyond the scope of this chapter. One way to interpret their analysis of welfare case flows is that lagged values of economic conditions matter because they affect entry and exit to and from welfare in previous periods. Lagged economic conditions therefore affect the composition of the current welfare caseload and the extent to which the current caseload will respond to changes in economic conditions. This suggests that detailed information on caseload composition could potentially substitute for the lagged effects. 15. The small differences in estimates between the first column of Table 2 and the second column of Table 1 are due to small reductions in sample so that each estimate in a row of Table 2 is estimated using the same sample. 16. This assumes, of course, that the explanatory variables are strictly exogenous. 17. Results in Table 3 and the discussion in this section appear in similar form in Black, McKinnish, and Sanders (2003). 18. For the fixed-effects model, it was determined that the first-stage performance of the instrument was substantially better if the fraction earnings in primary metals was replaced with period indicators for 1974–1976, 1977–1981, and 1982–1987. Results are reported using this preferred instrument. 19. See Laporte and Windmeijer (2005) for a discussion of the difference between fixed-effects and first-differences estimators when the explanatory variable is a binary indicator and there are omitted lags and/or leads of the binary treatment. 20. Adding additional lags does not increase the size of the estimated effect.
ACKNOWLEDGMENTS Generous comments by Dan Black, Seth Sanders, and Jeffrey Smith are gratefully acknowledged, as is financial support from the National Science Foundation under grant SES-0196133.
Panel Data Models and Transitory Fluctuations
355
REFERENCES Argys, L., Averett, S., & Rees, D. (2000). Welfare generosity, pregnancies and abortions among unmarried recipients. Journal of Population Economics, 13, 569–594. Baker, M., Benjamin, D., & Stanger, S. (1999). The highs and lows of the minimum wage effect: A time-series cross-section study of the Canadian law. Journal of Labor Economics, 17, 318–350. Bartik, T. J., & Eberts, R. W. (1999). Examining the effect of industry trends and structure on welfare caseloads. In: S. Danziger (Ed.), Welfare reform and the economy: What will happen when a recession comes? (pp. 119–157). Kalamazoo, MI: Upjohn Institute for Employment Research. Black, D. A., Daniel, K., & Sanders, S. G. (2002). The impact of economic conditions on participation in disability programs: Evidence from the coal boom and bust. American Economic Review, 92, 27–50. Black, D. A., McKinnish, T., & Sanders, S. G. (2003). How the availability of high-wage jobs to low-skilled men affects AFDC expenditures: Evidence from shocks to the steel and coal economies. Journal of Public Economics, 87, 1919–1940. Blank, R. M. (2001). What causes public assistance caseloads to grow? Journal of Human Resources, 36, 85–118. Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90, 443–450. Clark, G., & Strauss, R. (1998). Children as income-producing assets: The case of teen illegitimacy and government transfers. Southern Economic Journal, 64, 827–856. Council of Economic Advisers (CEA). 1999. Economic expansion, welfare reform, and the decline in welfare caseloads: An update. Technical Report, Executive Office of the President, Washington, DC. Figlio, D., & Ziliak, J. (1999). Welfare reform, the business cycle and the decline in AFDC caseloads. In: S. Danziger (Ed.), Welfare reform and the economy: What will happen when a recession comes? (pp. 19–48). Kalamazoo, MI: Upjohn Institute for Employment Research. Fitzgerald, J. (1995). Local labor markets and local area effects on welfare duration. Journal of Applied Policy and Management, 14, 43–67. Griliches, Z., & Hausman, J. (1986). Errors in variables in panel data. Journal of Econometrics, 31, 93–118. Heckman, J., & Hotz, V. J. (1989). Choosing among alternative nonexperimental methods for estimating the impact of social programs. Journal of the American Statistical Association, 84, 862–874. Hoffman, S., & Foster, E. M. (2000). AFDC benefits and non-marital births to young women. Journal of Human Resources, 35, 376–391. Hoynes, H. (1997). Does welfare play any role in female headship decisions? Journal of Public Economics, 65, 89–117. Hoynes, H. (2000). Local labor markets and welfare spells: Do demand conditions matter? Review of Economics and Statistics, 82, 351–368. Jackson, C. & Klerman, J. (1994). Welfare, abortion and teenage fertility. Unpublished manuscript. RAND Corporation.
356
TERRA MCKINNISH
Keane, M., & Wolpin, K. (2002). Estimating welfare effects consistent with forward-looking behavior. Journal of Human Resources, 37, 570–622. Klerman, J., & Haider, S. (2004). A stock-flow analysis of the welfare caseload. Journal of Human Resources, 39, 865–886. LaPorte, A., & Windmeijer, F. (2005). Estimation of panel data models with binary indicators when treatment effects are not constant over time. Economics Letters, 88, 389–396. Matthews, S., Ribar, D., & Wilhelm, M. O. (1997). The effects of economic conditions and reproductive health services on state abortion and birth rates. Family Planning Perspectives, 29, 52–60. McKinnish, T. (1999). Model sensitivity in research on welfare and fertility. Ph.D. Dissertation, Carnegie Mellon University. McKinnish, T. (2005). Lagged dependent variables and specification bias. Economics Letters, 88, 55–59. Miller, R., & Sanders, S. (1997). Human capital development and welfare participation. Carnegie-Rochester Conference Series on Public Policy, 46, 1–47. Moffitt, R. (1994). Welfare effects on female headship with area effects. Journal of Human Resources, 29, 621–636. Mueser, P., Hotchkiss, J., King, C., Rokicki P. & Stevens, D. (2000). The welfare caseload, economic growth and welfare-to-work policies. Unpublished manuscript. Rosenzweig, M. (1999). Welfare, marital prospects and non-marital childbearing. Journal of Political Economy, 107, S3–S32. Wallace, G., & Blank, R. (1999). What goes up must come down? Explaining recent changes in public assistance caseloads. In: S. Danziger (Ed.), Welfare reform and the economy: What will happen when a recession comes? (pp. 49–90). Kalamazoo, MI: Upjohn Institute for Employment Research. Ziliak, J., Figlio, D., Davis, E., & Connolly, L. (2000). Accounting for the decline in AFDC caseloads. Journal of Human Resources, 35, 570–586.
APPENDIX A Proof of result in Eq. (15): plimðb^ d Þ ¼
CovðY it Y itj ; X it X itj Þ VarðX it X itj Þ
VarðZ it Zitj Þ s2z ð1 r j Þ ¼b 2 ¼b VarðZ it Z itj Þ þ Varðvit vitj Þ sz ð1 r j Þ þ s2v ð1 d j Þ jplimðb^ jþ1 Þj4jplimðb^ j Þj 8j iff ð1 r jþ1 Þð1 d j Þ4ð1 r j Þð1 d jþ1 Þ which is true iff dor.
Panel Data Models and Transitory Fluctuations
357
Proof of result in Eq. (17): ( plimðb^ l Þ ¼ ¼
)
VarðX it X it1 ÞCovðY it Y it1 ; X it1 X it2 Þ
CovðY it Y it1 ; X it X it1 ÞCovðX it X it1 ; X it1 X it2 Þ VarðX it X it1 ÞVarðX it1 X it2 Þ CovðX it X it1 ; X it1 X it2 Þ2 4½s2z ð1
2bs2v s2z ½ð1 rÞð1 dÞ2 ð1 dÞð1 rÞ2 rÞ þ s2v ð1 dÞ2 ½s2z ð1 rÞ2 þ s2v ð1 dÞ2 2
Proof of result in Eq. (18): 8 < VarðX it X itj ÞCovðY it Y itj ; X it1 X itj1 Þ plimðb^ l Þ ¼ ¼
9 =
: CovðY Y ; X X ÞCovðX X ; X ; it itj it itj it itj it1 X itj1 Þ VarðX it X itj ÞVarðX it1 X itj1 Þ CovðX it X itj ; X it1 X itj1 Þ2 4½s2z ð1
2bs2v s2z ½ð1 d j Þð2r r jþ1 p j1 Þ ð1 r j Þð2d d jþ1 d j1 Þ r j Þ þ s2v ð1 d j Þ2 ½s2z ð2r r jþ1 p j1 Þ þ s2v ð2d d jþ1 d j1 Þ2
For both results, plimðb^ l Þ is the sign of b iff ð1 rÞð1 dÞ2 ð1 dÞ ð1 rÞ2 40, which is true iff dor. The property jplimðb^ l; jþ1 Þj4jplimðb^ l; j Þj was investigated by grid search. A large grid search failed to turn up any violations of the property for cases in which j 2. Violations at j=1 tended to occur when either r or s2m =s2y was high.
APPENDIX B This appendix reports first-stage estimates for IV analysis in Table 4. For the first-differences and long-differences models, the instruments are the fraction of men in the county employed in primary metals manufacturing in 1969 and the interaction of that variable with the fraction of total earnings in the United States in that year attributable to the primary metals industry. For the fixed-effects models, it was found that the instruments performed better in the first-stage if fraction of employment in primary metals was interacted with period indicators for 1974–1976, 1977–1981, and 1982–1987, instead of with the fraction of earnings from primary metals. The main effect of the fraction employment variable is absorbed into the county fixed-effects.
TERRA MCKINNISH
358
Table B1.
Fraction steel Fraction steel Fraction earnings
First-stage Estimates from Steel-state IV Analysis. FirstDifferences
5-Year Differences
0.379 (0.049) 19.29 (2.71)
1.76 (0.185) 99.89 (10.79)
Partial F-statistic
36.6
45.0
N
10,949
8,478
Fixedeffects Fraction Steel (1974–1976) Fraction Steel (1977–1981)
0.235 (0.055) 0.420 (4.20)
Fraction Steel (1982–1987)
0.175 (0.102) 29.9 11,044
AN EMPIRICAL ASSESSMENT OF THE EFFECTS OF PARENTHOOD ON WAGES Marianne Simonsen and Lars Skipper ABSTRACT In this chapter, we characterise the selection into parenthood for men and women separately and estimate the effects of motherhood and fatherhood on wages. We apply propensity score matching exploiting an extensive high-quality register-based data set augmented with family background information. We estimate the net effects of parenthood and find that mothers receive 7.4% lower average wages compared to non-mothers, whereas fathers gain 6.0% in terms of average wages from fatherhood.
1. INTRODUCTION This chapter investigates the effects of parenthood on wages. Evaluating the effect of motherhood on wages, the family gap, is a classical problem and studies typically find that having children is costly for women in terms of wages (see, e.g. Budig & England, 2001; Millimet, 2000; Phipps, Burton, & Lethbridge, 2001; Simonsen & Skipper, 2006; Waldfogel, 1998a, 1998b). This result is often attributed to mothers’ different investments in household
Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 359–380 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00012-6
359
360
MARIANNE SIMONSEN AND LARS SKIPPER
production resulting in, for example, career interruptions, different preferences for working conditions such as unplanned overtime, potentially different bargaining power because of stronger geographic ties due to costs of moving children, or possible discrimination. The effects of fatherhood on wages, on the other hand, are less well known, yet studies suggest that fathers, more often than mothers, are likely to observe an increase in earnings from parenthood (Browning, 1992; Pencavel, 1986; Millimet, 2000). Again, this is likely caused by fathers’ different labour-market investments. Pencavel (1986), for example, shows ordinary least squares (OLS) results for the US, indicating that fathers work more hours than non-fathers. Clearly, both estimates for mothers and fathers are of crucial importance for understanding individual costs and gains related to childbearing. Uncovering the effects of parenthood on wages is not a simple task; parenthood is potentially endogenous to labour-market outcomes, but it has proven difficult to find plausible exogenous variation. Further, in the light of the recent literature on heterogeneous effects (see, e.g. Heckman, LaLonde, & Smith, 1999), the task of finding instruments identifying meaningful local average treatment effects (see Imbens & Angrist, 1994) is even more cumbersome. In addition, recent studies in economics have acknowledged that treatments are not always a matter of either (see Black & Smith, 2004). Often, effects of treatments work through several economic channels. That is, the treatment may have a direct effect on a potential outcome, but may also have indirect effects working through concomitant variables. Uncovering direct effects in the presence of the latter variables is far from trivial (see Rosenbaum, 1984). As opposed to most of the existing literature, we directly address the point that wage costs of parenthood are potentially heterogeneous by implementing propensity score matching using an extensive high-quality register-based data set augmented with parental background information. We estimate the net effects of parenthood; we do not attempt to clear out the effect of variables such as experience and type of occupation because they are highly likely to be affected by parenthood (see Simonsen & Skipper, 2006). The estimated parameters are, therefore, informative about the wage costs and gains of parenthood including those accruing from individual choices caused by childbearing. Clearly, the identifying assumptions underlying propensity score matching do not allow for unconfoundedness. Realising this, we investigate the sensitivity of our results to this assumption by implementing a sensitivity analysis following Ichino, Mealli, and Nannicini (forthcoming). Our empirical analyses provide important new insights into the processes governing the selection into motherhood versus fatherhood. Importantly,
Empirical Assessment of the Effects of Parenthood on Wages
361
educational attainment information about the parents of the individuals in our estimation sample seems to be important for the selection into both motherhood and fatherhood. In particular, maternal educational attainment is found to affect both daughters’ and sons’ fertility decisions. Furthermore, we add to the scarce literature on wage effects of fatherhood, which we contrast with the results of motherhood on wages. Our results confirm the established pattern in the literature: we find that mothers receive around 7.4% lower wages compared to non-mothers, while fathers gain around 6.0% in terms of wages compared to non-fathers. For men, the difference in wages between a parent (a father) and a non-parent (a nonfather) seems to vary with the propensity of fatherhood. The size of the estimated effects, however, is small in an international comparison. The chapter is organised as follows. In Section 2, we discuss our parameters of interest. In Section 3, we present our data and discuss the institutional settings. Section 4 presents the results from propensity score matching, while Section 5 discusses the results from sensitivity analysis. Section 6 presents the final discussion.
2. THE PARAMETERS OF INTEREST The goal of the evaluation is to measure the effect or impact of a given treatment, C, on an outcome variable, YC. Here, the treatment is having children, C=1, as opposed to the ‘untreated’ state of ‘non-parenthood’, C=0, and the outcome of interest is log hourly wages. We limit our analysis to include the population of employed individuals. Our estimated parameters, therefore, are not valid for the entire population including the group of non-participants. Let Y1 be the potential outcome in the presence of children and Y0 be the potential outcome in the absence of children. We are now faced with what is known as The Fundamental Evaluation Problem in that we do not observe the same individual both with and without children at the same point in time. Because it is not possible to identify person-specific impacts, attention in the literature usually shifts to that of constructing (conditional) means. Here, we estimate the mean effect of treatment on the treated women, M=0, and men, M=1, defined as ywomen E ½Y 1 Y 0 jC ¼ 1; M ¼ 0 ¼ E ½Y 1 jC ¼ 1; M ¼ 0 E ½Y 0 jC ¼ 1; M ¼ 1
ð1Þ
362
MARIANNE SIMONSEN AND LARS SKIPPER
ymen E ½Y 1 Y 0 jC ¼ 1; M ¼ 1 ¼ E ½Y 1 jC ¼ 1; M ¼ 1 E ½Y 0 jC ¼ 1; M ¼ 1
ð2Þ
Hence, the problem turns to that of finding the counterfactuals E[Y0|C=1, M=0] and E[Y0|C=1, M=1] in Eqs. (1) and (2), which are, of course, unobserved, i.e. some assumptions are needed to obtain identification. We estimate the mean effect of having children for those (employed) women who have children (Eq. (1)), and not the mean effect for the population of women. With heterogeneous impacts these parameters will differ.1 Similarly, we focus on the mean effect of having children for the group of employed fathers (Eq. (2)). Our focus is thus on whether women and men who choose to have children are punished or gain in terms of wages, not on whether individuals who do not have children would be punished or gain had they chosen to have children. In this chapter, we apply the method of matching that has recently received much attention in applied econometrics in general and in programme evaluation in particular. Matching is based on the assumption that conditioning on attributes, X, eliminates the selective differences between those with and without children. More precisely, the method of matching assumes that the econometrician has access to conditioning variables sufficiently rich such that the counterfactual outcome distribution of those having children is the same as the observed outcome distribution of those without. By conditioning on the covariates at our disposal, we will thus be capable of balancing the bias coming from the self-selection into motherhood. Using Eqs. (1) and (2), we make the following conditional independence assumption (CIA) Lechner (1999):2 a CjX; M ¼ j; j 2 f0; 1g (3) Y0 In other words, we assume that the underlying distribution of the outcome for an individual had he or she not had children is the same as that of a nonparent with the same observed characteristics. In particular, this means that individuals must not take into account wages in the non-parenthood state when deciding whether to become a parent or not. They may, however, consider wages in the parenthood state. This is consistent with the case where parents and non-parents are equally productive in the nonparenthood state conditional on their characteristics, but are potentially different in the parenthood state. In order to be able to utilise Eq. (3), it is necessary to make sure that there is an individual-without-a-child analogue
Empirical Assessment of the Effects of Parenthood on Wages
363
to each parent, the ‘common support’ assumption, i.e. P PrðC ¼ 1jX; M ¼ jÞo1;
j 2 f0; 1g
(4)
We do not want to assume any functional form of the outcome equation as opposed to much of the literature in general. We are, therefore, potentially faced with the non-parametric curse of dimensionality. A way to circumvent the curse of dimensionality without imposing arbitrary assumptions on the outcome equation is based on the results in Rosenbaum and Rubin (1983). Here, the focus is shifted from the set of covariates to the probability of motherhood, P=Pr(C=1 |X). As long as Eqs. (3) and (4) hold, a Y0 CjP; M ¼ j; j 2 f0; 1g (5) This new conditioning variable, P, changes CIA into Eq. (5) which together with Po1 are sufficient conditions required in order to justify propensity score matching to estimate the mean impact on the treated (see Heckman, Ichimura, & Todd, 1998b). Clearly, the functional form of P is rarely known and has to be estimated, shifting the high-dimensional estimation problem from that of estimating E [Y|X] to that of estimating E [C|X], often estimated by logit or, as in this chapter, a probit (see the discussion on this issue in Black & Smith, 2004). Moreover, as will become apparent, the adoption of a one-dimensional specification of selection clearly illuminates both the common support considerations as well as the differences in distributions of covariates that would not be addressed by standard OLS.
3. DATA AND INSTITUTIONAL SETTINGS The original data set contains information on a representative sample of 5% of all Danish individuals in the 15–74 age bracket. Information stems from several registers, all maintained by Statistics Denmark. The registers include variables describing income, demographics, and education on an yearly basis. Importantly, in addition to this set of individual level information, we exploit rich information about parental educational attainment, i.e. the level of education of the parents of the women and men in our sample. In the empirical analysis below, we use a 1997 cross-sectional sub-sample of women and men aged 20–40, who are employed for more than 200 h/year, are not self-employed, and are not undertaking education. The lower age bound is chosen to exclude individuals who are in the state between two types of education, for instance high school and university. The upper age
364
MARIANNE SIMONSEN AND LARS SKIPPER
bound is chosen because of an age restriction on the availability of parental background information used in the econometric analyses. Among women in the relevant age group, 72% were employed more than 200 h in 1997, whereas the share was 79% for men aged 20–40. Table 1 shows descriptive statistics on selected variables for the sample used in our analyses along with descriptive statistics of mothers, nonmothers, fathers, and non-fathers. We classify women as mothers if they have given birth to a child in 1996 or earlier. Thus, it is assumed that the presence of biological children is more important than the presence of stepchildren. This may, of course, be a problem if children in the household other than biological children of the woman affect her choices and actions. Most often, though, children from separated homes live with their mother.3 In our sample, 54.8% of the women have children in 1996. For men, due to data limitations, we are forced to define fatherhood based on information about children living at the same address as the unit of observation. The presence of stepchildren, therefore, will affect our measure of fatherhood status. Similarly, if a couple is divorced and the children are officially living with their mother, a father may be misclassified as a non-father. We do not, however, expect this to cause large measurement problems precisely because custody in the vast majority of cases is granted to the mother in case of divorce. Of course, this should be kept in mind when interpreting our estimated parameters. The outcome variable of interest in the analysis is log hourly wage, which is calculated from annual earnings and number of working hours. The measure of working hours used in this calculation is very precise in that the hours information comes from registers on compulsory contributions to supplemental pension payments that are closely linked to the working hours actually paid for by employers. It is seen that average log wage for mothers does not differ from average log wage for non-mothers. Yet it is clear from Table 1 that mothers differ significantly from non-mothers in terms of observables: Mothers are on average seven years older, are more often employed in the public sector, and are more likely to have an education directed towards the healthcare sector or the schooling system. Furthermore, they are more often settled in the province and have grown up in families with lower levels of human capital. Fathers’ wages are slightly higher than non-fathers and their observable characteristics differ as well: Fathers are almost six years older than nonfathers, have more years of education, are more likely to have a specialised type of further education (less likely to have a general type), and are more
Empirical Assessment of the Effects of Parenthood on Wages
Table 1.
365
Selected Variables, Descriptive Statistics.
Variablesa
Women
Men
All
Mothers
Nonmothers
All
Fathers
Nonfathers
4.80 (0.28) 29.72 (5.80) 12.23 (2.45)
4.81 (0.26) 32.87 (4.20) 12.20 (2.42)
4.79 (0.30) 25.88 (5.13) 12.26 (2.48)
4.93 (0.34) 30.16 (5.84) 11.90 (2.41)
5.02 (0.32) 33.65 (4.33) 12.16 (2.49)
4.87 (0.34) 27.83 (5.55) 11.72 (2.33)
Type of highest completed education: General (0/1) 0.22 Business (0/1) 0.34 Industry (0/1) 0.01 Construction (0/1) 0.01 Graphical (0/1) 0.01 Services (0/1) 0.02 Food and beverages (0/1) 0.04 Agricultural (0/1) 0.01 Transportation (0/1) 0.00 Health 0.12 Pedagogic (0/1) 0.06 Humanistic (0/1) 0.03 Musical (0/1) 0.00 Social (0/1) 0.04 Technical (0/1) 0.02 Public security (0/1) 0.00 Unknown (0/1) 0.08
0.19 0.33 0.01 0.01 0.01 0.02 0.04 0.01 0.00 0.15 0.09 0.03 0.00 0.03 0.01 0.00 0.07
0.26 0.35 0.01 0.01 0.01 0.02 0.03 0.01 0.00 0.07 0.04 0.03 0.00 0.05 0.02 0.00 0.09
0.22 0.16 0.16 0.10 0.01 0.00 0.03 0.03 0.02 0.01 0.02 0.01 0.00 0.05 0.06 0.01 0.10
0.19 0.14 0.18 0.12 0.01 0.00 0.03 0.03 0.01 0.02 0.03 0.01 0.00 0.05 0.08 0.02 0.08
0.24 0.18 0.15 0.09 0.01 0.00 0.03 0.03 0.02 0.01 0.01 0.01 0.00 0.05 0.05 0.01 0.12
Province (0/1) Private sector (0/1)
0.68 0.46
0.57 0.60
0.65 0.83
0.70 0.82
0.62 0.84
0.53 0.01 0.31 0.05 0.08
0.65 0.01 0.24 0.03 0.06
0.74 0.00 0.19 0.02 0.04
0.59 0.01 0.27 0.04 0.07
0.01
0.01
0.00
0.01
0.53 0.01
0.60 0.00
0.65 0.00
0.56 0.00
Log wages Age (years) Length of completed education (years)
0.63 0.52
Parental background information Highest level of completed education, mother’s side: Basic schooling (0/1) 0.64 0.73 High school (0/1) 0.01 0.01 Vocational training (0/1) 0.25 0.20 Short further education (0/1) 0.03 0.02 Medium length further 0.06 0.04 education (0/1) Long further education (0/1) 0.01 0.01 Highest level of completed education, father’s side: Basic schooling (0/1) 0.60 0.65 High school (0/1) 0.00 0.00
366
MARIANNE SIMONSEN AND LARS SKIPPER
Table 1. (Continued ) Variablesa
Vocational training (0/1) Short further education (0/1) Medium length further education (0/1) Long further education (0/1) No. of observations a
Women
Men
All
Mothers
Nonmothers
All
Fathers
Nonfathers
0.29 0.03 0.05
0.26 0.02 0.04
0.32 0.04 0.07
0.29 0.03 0.05
0.26 0.03 0.03
0.30 0.03 0.06
0.03
0.02
0.04
0.03
0.03
0.04
29,006
15,958
13,048
35,060
13,993
21,067
Dep. Variables: 1 for Mothers, 0 for Non-mothers, 1 for Fathers, 0 for Non-fathers.
likely to be settled in the province. As is the case with mothers, fathers in our sample also stem from families with relatively low levels of human capital. To understand men’s and women’s fertility choices and thus the estimated parameters, it is important to know not only the available data, but also the institutional settings regarding parental leave, family-friendly policies, and childcare. The empirical analysis is conducted within a highly familyfriendly regime, especially aimed at and utilised by women. The Danish legislation offers generous parental (particularly maternal) leave in connection with childbirth, both with regard to length and compensation. In the period before 1984, mothers were eligible for 18 weeks of jobprotected maternity leave (four weeks before expected birth and 14 weeks after). In 1984, however, the maternity leave scheme was extended with an additional 10 weeks of leave amounting to a maximum of 28 weeks, and fathers were granted 2 weeks of leave during the first 14 weeks after the birth. The degree of compensation while on maternity leave varies with sector of employment and union membership, but with a legally ensured lower bound on benefits received amounting to 100% unemployment insurance benefits or approximately h1,650 per month in 1997. This is the scheme effective in 1997. The 10 additional weeks can, in principle, be shared with the father, yet this is not the norm. In fact, in 1997, 96.6% of all fathers on leave took two weeks or less and the mode is two weeks (cf. http://www.statikbanken.dk). Furthermore, in 1994, publicly subsidised sabbatical leave and childcare leave were introduced. For employed individuals, childcare leave amounts to a maximum of 52 weeks per child under the age of eight, while sabbatical leave amounts to a maximum of 52 weeks. Parents receive 60%
Empirical Assessment of the Effects of Parenthood on Wages
367
unemployment insurance, h1,000 per month in 2001, while on childcare leave. As is the case with parental leave in connection with childbirth, women are over-represented among individuals taking childcare leave: in 1997, 93% of all individuals (22,229) on childcare leave are women (20,636). The public sector, where mothers typically work, is characterised by being more family friendly than the private sector. Specifically, individuals employed in the public sector receive a full wage compensation during maternity and paternity leave as opposed to the lower benefits received by private sector employees. In addition, public sector employees are allowed to take leave earlier compared to private sector employees and have the right to 10 fully funded care days per child. Finally, apart from the existence of a smaller qualification, bonus wages in the public sector are mechanically determined by seniority. Therefore, individuals in the public sector do not risk loosing experience-based increases in wages when taking maternity or paternity leave. This is particularly important for women; men, on the other hand, who only take short leave in connection with childbirth are less likely to be affected by their leave decision. Day care is, for the major part, publicly provided and organised within the municipalities. Day care is highly subsidised by the municipalities, resulting in an average price of h200 per month. All children are eligible for publicly provided childcare, including children born to unemployed parents; the only exception occurs if one of the parents takes formal publicly supported maternity or childcare leave. Municipalities provide nursery centres, family day care, and kindergartens. In general, opening hours during week days are between 6.30 am and 5.15 pm, in principle allowing both parents to work full time.
4. PROPENSITY SCORE MATCHING 4.1. Results for Women We first estimate the probability of being a mother. We model this propensity score by a standard probit,4 conditioning on a set of variables thought to influence both the fertility decision as well as wage outcomes. This set includes age, type of education, e.g. healthcare, services, technical education, etc., length of education given the type, place of habitation (nine regional dummies), and parental educational attainment. The results from the probit are presented in Table 2.
368
MARIANNE SIMONSEN AND LARS SKIPPER
Table 2. Coefficient Estimates and Asy. Standard Error, Motherhood and Fatherhood Probits. Variablea
Motherhood
Fatherhood
Coeff.
Std. Err.
Coeff.
Std. Err.
12.540 0.791 0.011
0.364 0.022 0.000
11.714 0.618 0.033
0.493 0.019 0.001
Type of highest completed education: General (0/1) Business (0/1) Industry (0/1) Construction (0/1) Graphical (0/1) Services (0/1) Food and beverages (0/1) Agricultural (0/1) Pedagogic (0/1) Humanistic (0/1) Musical (0/1) Social (0/1) Technical (0/1) Public security (0/1) Transport (0/1) Unknown (0/1)
0.341 1.713 1.113 0.556 1.349 0.290 0.583 0.504 0.292 1.344 0.634 2.983 1.033 – – 0.979
0.229 0.228 1.042 1.260 1.312 0.571 0.401 0.468 0.509 0.543 1.249 0.561 0.632 – – 0.234
0.400 0.726 0.787 0.155 0.101 1.183 1.853 1.012 1.372 0.023 1.738 0.255 0.985 0.417 0.199 0.008
0.411 0.426 0.440 0.471 0.620 1.253 3.066 0.798 0.633 0.901 1.176 1.296 0.575 0.499 0.518 0.000
Length of highest completed education: General (0/1) length Business (0/1) length Industry (0/1) length Construction (0/1) length Graphical (0/1) length Services (0/1) length Food and beverages (0/1) length Agricultural (0/1) length Pedagogic (0/1) length Humanistic (0/1) length Musical (0/1) length Social (0/1) length Technical (0/1) length Public security (0/1) Transport (0/1) Unknown (0/1) length
0.097 0.071 0.115 0.128 0.063 0.088 0.055 0.025 0.014 0.028 0.056 0.030 0.153 0.033 – – 0.036
0.013 0.018 0.017 0.088 0.106 0.111 0.049 0.032 0.034 0.034 0.034 0.078 0.034 0.040 – – 0.018
0.006 0.019 0.062 0.085 0.013 0.017 0.108 0.184 0.112 0.111 0.024 0.097 0.005 0.066 0.010 0.037 1.005
0.030 0.031 0.033 0.034 0.037 0.050 0.105 0.258 0.065 0.042 0.060 0.071 0.081 0.039 0.035 0.039 0.845
Intercept Age Age squared
Empirical Assessment of the Effects of Parenthood on Wages
369
Table 2. (Continued ) Variablea
Motherhood Coeff.
Fatherhood
Std. Err.
Coeff.
Std. Err.
Parental background information Highest level of completed education, mother’s side: High school (0/1) 0.060 Vocational training (0/1) 0.049 Short further education (0/1) 0.153 Medium length further education (0/1) 0.212 Long further education (0/1) 0.179
0.119 0.024 0.055 0.043 0.106
0.130 0.050 0.129 0.086 0.047
0.109 0.021 0.052 0.038 0.102
Highest level of completed education, father’s High school (0/1) Vocational training (0/1) Short further education (0/1) Medium length further education (0/1) Long further education (0/1)
0.154 0.025 0.058 0.047 0.058
0.006 0.038 0.012 0.140 0.064
0.147 0.021 0.049 0.043 0.052
side: 0.142 0.042 0.056 0.030 0.006
Note: Bold coefficients significant at the 5% level. Nine regional dummies and interaction terms between age and education dummies are included, reference age group 28–30, educational group health, parental educational group parental schooling. a Dep. Variables: 1 for Mothers, 0 for Non-mothers, 1 for Fathers, 0 for Non-fathers.
We find that age significantly increases the probability of being a mother in 1996 with a decreasing effect. Furthermore, relative to healthcareoriented types of education, most types of education decrease the probability of being a mother for a given length of education, though some coefficients are insignificant. Note that some may be insignificant due to a small number of individuals in a particular category (see Table 1). The education variables presumably capture some effects of preferences for having children. The analysis also shows that higher level of education of a given type reduces the probability of motherhood. The choice of education is in the vast majority of cases predetermined on the decision of motherhood, but we acknowledge that unobserved factors may affect the choice of both education and fertility. As long as the choice to become a mother is conditionally independent of outcome in the non-motherhood state holds, this is not a problem. The inclusion of the set of regional dummies is meant to account for local labour-market conditions and regional specific fertility preferences.5 Finally, a large literature (see, e.g. Behrman & Rosenzweig, 2002; Black, Devereux, & Salvanes, 2005; Plug, 2004) suggests inter-generational
370
MARIANNE SIMONSEN AND LARS SKIPPER
spillover effects of human capital.6 Thus, wage outcomes for the women in our sample are likely to be directly affected by their parents’ educational attainment. Given that the level of education of the woman under observation affects her fertility choice, we also expect parental educational attainment to have a direct effect on this decision. Furthermore, one may argue that parental educational attainment is likely to proxy unobserved variables affecting both fertility choice and wage outcomes. Importantly, parental educational attainment may reflect preferences for leisure (including preferences for spending time with children), which presumably shape up the preferences of the women in our estimation sample. Given this, including parental educational attainment in the conditioning set greatly strengthens the validity of our conditional independence assumption (Eq. (3)). We find that the mother’s level of education has a significant effect on the daughter’s fertility choice (at a given age). Specifically, we observe that a higher level of completed education of the mother decreases the probability that the daughter is herself a mother. Father’s level of education, however, does not seem to influence the fertility choice. Of course, this does not necessarily mean that father’s level of education does not matter; the result may be caused by assortative matching in the marriage market. The probit model predicts relatively well: 5,710 out of 29,006 predictions or 19.7% of all predictions are wrong7 and Efron’s R2 equals 0.36. Following Lechner (2002b), we depict both a non-parametric regression of outcomes on the predicted propensity of motherhood and the smoothed densities of the propensity scores for both mothers and non-mothers. This is given in Fig. 1. Firstly, it is seen that mothers have more probability mass concentrated around high values of the propensity score compared to non-mothers who have more mass concentrated around low values of the propensity score. Hence, mothers are likely to differ significantly from non-mothers in terms of observables meaning that there is a potential gain from matching. Note that the densities seem to have common support: It is possible to find a match for a mother among non-mothers even for mothers with the highest level of the propensity score. Thus, the model does not predict too well. Secondly, we see that mothers have lower expected outcomes for all values of the predicted propensity score. In other words, the sign of the effect of motherhood on wages for the group of mothers seems to be unambiguously negative. For the group of non-mothers, we now use leave-one-out cross validation to select the optimal bandwidth to be used in the subsequent kernel matching analysis (see Black & Smith, 2004). The difficult part in selecting
Empirical Assessment of the Effects of Parenthood on Wages
371
5.8 5.6
Low Wage
5.4 5.2 5.0 4.8 4.6 4.4 0.0 0.1
0.2
0.3
0.4 0.5 0.6 Propensity Score
0.7
0.8
0.9
1.0
0.0 0.1
0.2
0.3
0.4 0.5 0.6 Propensity Score
0.7
0.8
0.9
1.0
4.5 4.0 3.5
Density
3.0 2.5 2.0 1.5 1.0 0.5 0.0
Bandwidths using Silverman (1986) rule of thumb. Mothers, Non-mothers
Fig. 1. Non-Parametric Regression of Log Wages on Predicted Propensity of Motherhood and Non-Parametric Estimates of Densities of Motherhood. Bandwidths Using Silverman (1986) Rule of Thumb.
372
MARIANNE SIMONSEN AND LARS SKIPPER
the optimal bandwidth is that it is not only kernel specific but also data specific. The intuitively appealing idea behind cross validation is to select the bandwidth that best predicts an outcome of a stochastic variable with a given data set minimising some loss function, here the root mean squared error (RMSE). For the group of non-mothers, we observe their log wages in the non-motherhood state. Leaving one non-mother out at a time, we can then estimate the difference between the observed outcome and a predicted outcome from the remaining observations. For five standard kernels,8 we find the minimum RMSE using both kernel and local linear regression matching algorithms.9,10 With the preferred kernel and corresponding optimal bandwidth, hopt, we then estimate the densities of the propensities for mothers and non-mothers separately. To avoid ‘poor’ matches, we trim away observations in areas of ‘thin’ support of the estimated densities using a trimming level of 2% following Heckman, Ichimura, Smith, and Todd (1998a). Specifically, we choose a lowest acceptable value of the density, dlow, common to both the density for mothers and non-mothers. We then exclude the observations below dlow. A trimming level of 2% means that we choose dlow such that 2% of the observations are excluded. With the remaining observations, we now calculate the average treatment effect of the treated using bootstrap techniques (99 bootstraps) to estimate the standard error. The uncertainty of the first step estimation of the probit is not taken into account when estimating the mean effects. Incorporating all steps of the estimation in the bootstrap procedure, including re-estimating the propensity along with the corresponding densities, proved to be too expensive computationally. Due to our large sample, we do not expect the uncertainty from the first estimation to inflate the variance to any degree (see also Lechner, 2002a on this issue when using nearest-neighbour matching). We see from the first row in Table 3 that motherhood results in 7.4% lower average wages for the group of mothers; a result in the lower end of the estimates in the international literature (see, e.g. Budig & England, 2001; Phipps et al., 2001; Waldfogel, 1998a) and consistent with the results found in Millimet (2000) for the US. The negative impact may to a large degree reflect that mothers make different choices than non-mothers in terms of non-participation, part time, choice of occupation, etc. The penalty could also be caused by job flexibility, a non-pecuniary benefit, within the public sector where the mothers are over-represented: Besides having extra care days targeted towards children’s needs, mothers may be reallocated to less stressful or time-consuming jobs, for example, jobs involving only standard hours. We also conclude that we obtain reasonably balanced covariates after
Empirical Assessment of the Effects of Parenthood on Wages
Table 3. Parameter Women Men
373
Estimated Treatment Effects. Estimate
Standard Error
0.074 0.060
0.005 0.006
Note: Dep. Variable: Log Hourly Wage Rate in 1997. Full Sample: 15,958 Mothers, 13,048 Non-mothers, 13,993 Fathers, and 21,067 Non-fathers. Bold coefficients significant at the 5% level. Densities were estimated using a Gaussian kernel and a bandwidth chosen via cross validation. The overlapping support region was determined using a 2% trimming rule. Standard errors are based on 99 bootstraps with 100% resampling. Kernel-based matching.
matching on our estimated propensity score. In no case do the standardised differences in means for covariates (Rosenbaum & Rubin, 1985) exceed 6.7% and the majority is below 2.0%.
4.2. Results for Men The next set of results concerns the fathers in our population. Again, we model this propensity score by a standard probit using the same conditioning set as for the motherhood probit. The results from the fatherhood probit are seen in Table 2. Qualitatively, the results mirror those for the motherhood probit, although there is some variation in the effects of types of education compared to the results from the motherhood propensity. Surprisingly, the similarity of the results also holds with regard to parental educational attainment: We find that men’s fertility choice is more affected by their mother’s level of education than it is by their father’s. Only if the father has a medium length further education do we find a significant (and negative) effect on fatherhood. The fatherhood model does not perform as well as the motherhood model in terms of wrong predictions. Here, 9,646 out of 35,060 or 27.5% of the predictions are wrong and Efron’s R2 of 0.24 is clearly not as high either. Fig. 2 shows the non-parametric regression of outcomes on the predicted propensity of fatherhood and the smoothed densities of the propensity scores for both fathers and non-fathers. As for mothers, we see that fathers have more probability mass concentrated around high values of the propensity score compared to non-fathers, and the supports are overlapping. Interestingly, neither fathers nor non-fathers have very high (above 0.90) predicted propensities. We also
374
MARIANNE SIMONSEN AND LARS SKIPPER 5.8 5.6
Log Wage
5.4 5.2 5.0 4.8 4.6 4.4 0.0 0.1
0.2
0.3
0.4 0.5 0.6 Propensity Score
0.7
0.8
0.9
1.0
0.0
0.2
0.3
0.4 0.5 0.6 Propensity Score
0.7
0.8
0.9
1.0
4.5 4.0 3.5
Density
3.0 2.5 2.0 1.5 1.0 0.5 0.0
0.1
Bandwidths using Silverman (1986) rule of thumb. Fathers, Non-fathers
Fig. 2. Non-Parametric Regression of Log Wages on Predicted Propensity of Fatherhood and Non-Parametric Estimates of Densities of Fatherhood. Bandwidths Using Silverman (1986) Rule of Thumb.
Empirical Assessment of the Effects of Parenthood on Wages
375
observe that for low values of the predicted propensity, there is virtually no wage difference between fathers and non-fathers. For higher values of the propensity, however, fathers have higher wages compared to non-fathers. Put differently, the sign of the wage effect seems to be unambiguously positive, yet this phenomenon seems to be driven by men with a high probability of fatherhood. Note again that the support is thin for high values of the estimated propensity, especially for non-fathers. Again, we use leave-one-out cross validation to select the optimal bandwidth and we manage to fairly balance the covariates after matching. In fact, the standardised differences in means for covariates are all below 7.5% with the majority below 2.0%. We find that fatherhood results in a significant increase of 6.0% in average wages for the population of fathers (see Table 3). The fact that fathers seem to gain from fatherhood is consistent with the literature, yet the effect is comparatively small: Millimet (2000) finds that fathers gain as much as 12% from fatherhood. As is the case with mothers, the wage effect may be due to fathers’ different labourmarket choices. If fathers work for longer hours than non-fathers (see Pencavel, 1986), for example, this may affect the average hourly wage. Given the different incentives for mothers and fathers to take leave and the discrepancy between their levels of leave-taking, it is not surprising that the estimated parameters are not the same either. Furthermore, the difference between the gender-specific estimates may stem from family responsibilities; women clearly engage the most in caring for children. Using time-use data on 6,624 adult Danes in 2001, Bonke (2002) finds that women with children below the age of seven spend close to 6 h/day on home production. The corresponding number of hours for men is 4. Single men or men in relationships without children or children older than seven years report to be spending between 2 and 3 h/day. Women with children older than seven report spending 4.5 h/day, whereas singles or women living in relationships without children report spending between 2 and 4 h/day. Only for lone parents and singles below 45, there are no statistical differences between the hours spent on home production between the genders.
5. SENSITIVITY ANALYSIS Realising that unobserved factors may affect both the selection into parenthood as well as our outcome variable, we investigate the sensitivity of our results in this dimension. Ichino et al. (forthcoming) suggest such a procedure for performing sensitivity analysis of the matching estimates to
376
MARIANNE SIMONSEN AND LARS SKIPPER
unconfoundedness. The authors argue that like Rosenbaum (1987), their analysis does not impose any parametric assumptions on the outcome equation, yet unlike Rosenbaum (1987) they manage to uncover point estimates of the average treatment effect on the treated. In particular, Ichino et al. (forthcoming) consider the case of a binary unobserved confounding factor U. One possible interpretation of U is that of unobserved ability or motivation. The principle behind the sensitivity analysis is the following: Firstly, one specifies the distribution of U depending on outcome and treatment status. In our case, we allow the distribution of U to depend on the observed log wages and parenthood status, C: pij ¼ PrðU ¼ 1jC ¼ i; IðY 4y ÞÞ where I(Y>y) is an indicator for the observed log wage being larger than the median log wage. We let p11 ¼ 0:9 p10 ¼ 0:3 p01 ¼ 0:9 p00 ¼ 0:6 This characterises the case where the unobserved confounder is positively correlated with the outcome and parents have lower expected unobserved abilities or motivation than non-parents. That is, we essentially allow for the situation where individuals select into parenthood based on Y0 or the difference between Y1 and Y0, the wage gain from parenthood. Parents with high observed wages, the high achievers, however, are assumed to have the same level of the unobserved component as high-wage non-parents, whereas parents with lower observed wages, the low achievers, are assumed to have a lower expected unobserved component compared to low-wage non-parents. Under this regime we expect to overestimate the penalty for mothers and underestimate the premium for fathers. An alternative setup would be to let both high- and low-achieving parents have a lower expected unobserved component compared to non-parents. This would have the unrealistic implication, however, that within the group of high achievers, parents would have lower unobserved ability or motivation (i.e. greater luck) than non-parents. The next step is to attribute a value of U to each parent according to the values of pij, include this value in the conditioning set, and match on it just as any other observed covariate. We find that U affects the propensity of
Empirical Assessment of the Effects of Parenthood on Wages
377
Table 4. Estimated Mean Treatment Effects Incorporating Simulated Confounding Unobservables. Parameter
Average
Minimum
Maximum
ywomen ymen
0.062 0.071
0.101 0.069
0.055 0.072
Note: Densities were estimated using a Gaussian kernel and a bandwidth chosen via cross validation. The overlapping support region was determined using a 2% trimming rule. Average is based on 10 simulated confounding unobservables.
parenthood negatively and the coefficients in both the motherhood and fatherhood probits are highly significant.11 For women, the standardised difference in means for U is approximately 12% whereas the corresponding difference is around 3% for men. This procedure is then ideally repeated a large number of times, say 10,000, and the average treatment effect from these repetitions is taken as the estimate of the mean effect of treatment on the treated. Unfortunately, though intuitively appealing, this last step renders the suggested strategy infeasible when using kernel matching instead of nearest-neighbour matching because we have to use bootstrap methods to estimate the variance. We have two parameters of interest; estimating one of these parameters takes approximately 3 h with 99 bootstraps. Estimating both of them continuously 10,000 times would then take 2,500 full days. Therefore, instead of performing the full-blown sensitivity analysis, we choose a much less ambitious course of action and consider a much lower number of repetitions. We realise, of course, that this makes the precision of our estimate extremely poor. The results are shown in Table 4. We see that inclusion of the confounder U only affects the estimated parameters to a very small degree and in the expected directions. Clearly, if we choose a setup where parents have a much lower probability of having a higher U than non-parents, the effects on our estimated parameters will be greater.
6. DISCUSSION In this chapter, we provide important new empirical evidence on the processes governing the selection into motherhood versus fatherhood. We find that educational attainment information about the parents of the individuals in
378
MARIANNE SIMONSEN AND LARS SKIPPER
our estimation sample seems to be important for the selection into both motherhood and fatherhood. In particular, maternal educational attainment is found to affect both daughters’ and sons’ fertility decisions. Furthermore, we consider the effects on wages of motherhood as well as fatherhood. Many empirical studies investigate the former parameter, whereas the latter has received little attention in the literature. We contribute to the existing literature on the effect of parenthood on wages by using high-quality data and by implementing propensity score matching that in many respects is less restrictive than what has been used up until now. An important feature of the identification strategy is that it allows for heterogeneous treatment effects. We find that employed mothers receive on average 7.4% significantly lower wages compared to non-mothers. Employed fathers, however, receive 6.0% higher wages compared to non-fathers. We argue that this pattern is not surprising given that mothers are much more likely to take long spells of leave in connection with childbirth and spend significantly more time per day in home production. Interestingly, a back-of-the-envelope calculation suggests that family earnings are relatively unaffected by parenthood. Of course, assessing the distribution of welfare within the household would require ambitious modelling of intra-family bargaining. This is, however, beyond the scope of this chapter and is left for future research.
NOTES 1. All former studies except Simonsen and Skipper (2006), see Section 1, evaluating the family gap, have assumed a common treatment effect. 2. Notice that these assumptions are generally stronger than the mean independence assumption of Heckman et al. (1998b). See Lechner (2001) for discussion on this point. 3. Moreover, in 1996 the number of adoptions amounted to only 600 in total. In our sample this would amount to approximately 30 adoptions in 1996. 4. Note that strictly speaking we do not need the probit to be the exact propensity score. We only need it to balance the distributions of our attributes; hence, we need it to be a balancing score. 5. Habitation outside the capital area clearly increases the probability of motherhood. 6. The literature does not agree on whether this link is causal (nurture) or due to ability (nature). For our purposes, however, it is sufficient that the link exists. P 7. We define the number of wrong predictions as ni¼1 ðC i 1ðp^i :5Þ Þ2 . 8. We use the biweight, triweight, Normal, Epanechnikov, and uniform kernels. 9. Our (GAUSS) program does the following. Firstly, the bandwidth using Silverman’s rule-of-thumb (Silverman, 1986), hs, is calculated for the different
Empirical Assessment of the Effects of Parenthood on Wages
379
kernels. RMSE is then calculated using the data at 100 different bandwidths selected symmetrically around each hs and the bandwidth resulting in the smallest observed RMSE is used in subsequent analysis. If RMSE as a function of h is found not to be convex around hs the programme instead calculates RMSE at each percentile and returns the bandwidth resulting in the smallest RMSE among the 100. 10. Note that with our data, impacts are highly insensitive to the choice of bandwidth as well as kernel. Also, performing nearest-neighbour matching both with and without replacement turned out to give similar results. 11. Results are available on request.
ACKNOWLEDGMENTS We thank the Danish Social Science Research Council for financial support and the National Centre for Register-Based Research for support to access data from Statistics Denmark. We greatly appreciate the comments from Editor Dann Millimet and two anonymous referees. Any remaining errors are our own.
REFERENCES Behrman, J. R., & Rosenzweig, M. R. (2002). Does increasing women’s schooling raise the schooling of the next generation? American Economic Review, 92, 323–334. Black, D. A., & Smith, J. A. (2004). How robust is the evidence on the effects of college quality? Evidence from matching. Journal of Econometrics, 121, 99–124. Black, S. E., Devereux, P., & Salvanes, K. (2005). Why the apple doesn’t fall far: Understanding the intergenerational transmission of education. American Economic Review, 95, 437–449. Bonke, J. (2002). Tid og velfærd [Time and welfare (in Danish)]. Working Paper no. 26. Danish Social Research Institute, Copenhagen. Browning, M. (1992). Children and household economic behavior. Journal of Economic Literature, 30, 1434–1475. Budig, M. J., & England, P. (2001). The wage penalty for motherhood. American Sociological Review, 66, 204–225. Heckman, J., Ichimura, H., Smith, J. A., & Todd, P. (1998a). Characterizing selection bias using experimental data. Econometrica, 66, 1017–1098. Heckman, J., Ichimura, H., & Todd, P. (1998b). Matching as an econometric evaluation estimator. Review of Economic Studies, 65, 261–294. Heckman, J., LaLonde, R., & Smith, J. A. (1999). The economics and econometrics of active labor market programs. In: A. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. 3, pp. 1865–2097). Amsterdam: Elsevier Science. Ichino, A., Mealli, F., & Nannicini, T. (Forthcoming). From temporary help jobs to permanent employment: What can we learn from matching estimators and their sensitivity? Journal of Applied Econometrics.
380
MARIANNE SIMONSEN AND LARS SKIPPER
Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62, 467–475. Lechner, M. (1999). Earnings and employment effects of continuous off-the-job training in East Germany after unification. Journal of Business and Economic Statistics, 17, 74–90. Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In: M. Lechner & F. Pfeiffer (Eds), Econometric evaluation of labour market policies (pp. 43–58). Heidelberg: Physica/ Springer. Lechner, M. (2002a). Some practical issues in the evaluation of heterogeneous labour market programmes by matching methods. Journal of the Royal Statistical Society Series A, 165, 59–82. Lechner, M. (2002b). Program heterogeneity and propensity score matching: An application to the evaluation of active labor market policies. The Review of Economics and Statistics, 84, 205–220. Millimet, D. L. (2000). The impact of children on wages, job tenure, and the division of household labour. Economic Journal, 110, 139–157. Pencavel, J. (1986). Labor supply of men: A survey. In: O. Ashenfelter & R. Layard (Eds), Handbook of labor economics (Vol. 1, pp. 3–102). New York: Elsevier Science. Phipps, S., Burton, P., & Lethbridge, L. (2001). In and out of the labour market: Long-term income consequences of child-related interruptions to women’s paid work. Canadian Journal of Economics, 34, 411–429. Plug, E. (2004). Estimating the effect of mother’s schooling on children’s schooling using a sample of adoptees. American Economic Review, 94, 358–368. Rosenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable that has been affected by the treatment. Journal of the Royal Statistical Society Series A, 147, 656–666. Rosenbaum, P. R. (1987). Sensitivity analysis to certain permutation inferences in matched observational studies. Biometrika, 74, 13–26. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall/CRC. Simonsen, M., & Skipper, L. (2006). The costs of motherhood: An analysis using matching estimators. Journal of Applied Econometrics, 21, 919–934. Waldfogel, J. (1998a). The family gap for young women in the United States and Britain: Can maternity leave make a difference? Journal of Labor Economics, 16, 505–545. Waldfogel, J. (1998b). Understanding the ‘‘family gap’’ in pay for women with children. Journal of Economic Perspectives, 12, 137–156.
THE EMPLOYMENT EFFECTS OF JOB-CREATION SCHEMES IN GERMANY: A MICROECONOMETRIC EVALUATION Marco Caliendo, Reinhard Hujer and Stephan L. Thomsen ABSTRACT In this chapter, we evaluate the employment effects of job-creation schemes (JCS) on the participating individuals in Germany. JCS are a major element of active labour market policy in Germany and are targeted at long-term unemployed and other hard-to-place individuals. Access to very informative administrative data of the Federal Employment Agency justifies the application of a matching estimator and allows us to account for individual (group-specific) and regional effect heterogeneity. We extend previous studies for Germany in four directions. First, we are able to evaluate the effects on regular (unsubsidised) employment. Second, we observe the outcomes of participants and non-participants for nearly three years after the programme starts and can therefore analyse medium-term effects. Third, we test the sensitivity of the results with respect to various Modelling and Evaluating Treatment Effects in Econometrics Advances in Econometrics, Volume 21, 381–428 Copyright r 2008 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(07)00013-8
381
382
MARCO CALIENDO ET AL.
decisions that have to be made during implementation of the matching estimator. Finally, we check if a possible occurrence of a specific form of ‘unobserved heterogeneity’ distorts our interpretation. The overall results are rather discouraging, since the employment effects are negative or insignificant for most of the analysed groups. One exception are long-term unemployed individuals who benefit from participation at the end of our observation period. Hence, one policy implication is to address the programmes to this problem group more closely.
1. INTRODUCTION The German labour market is plagued by persistently high unemployment in combination with a clearly separated situation as reflected by the unemployment rates of 9.3% in West and 20.1% in East Germany in 2003. The Federal Employment Agency (FEA) spends substantial amounts on active labour market policies (ALMP) every year in an attempt to overcome this unemployment problem (West: 12.3 billion Euro, East: 8.9 billion Euro). The main goal of ALMP is the permanent integration of unemployed persons into regular employment. In 1998, Social Code III (Sozialgesetzbuch III) was enacted as the legal basis for ALMP. The main features of the law were introduction of new instruments, decentralisation of competencies and more flexible allocation of funds. Maybe the most important change from an evaluator’s point of view was the legal requirement of a mandatory impact evaluation for all ALMP measures.1 One major element of ALMP in Germany in recent years has been job-creation schemes (JCS). Their importance is currently decreasing as they have often been criticised for the lack of explicit qualificational elements and ‘stigma effects’.2 However, it can also be argued that they provide a reasonable opportunity for individuals who are not able to reintegrate into the first (unsubsidised) labour market by themselves or who do not fit the criteria for other programmes, e.g. long-term unemployed or other hard-toplace individuals. A thorough evaluation of JCS in Germany has been impossible for a long time because available survey datasets, such as the German Socio-economic Panel (SOEP) or the Labour Market Monitors for East Germany and Saxony-Anhalt, either concentrate on East Germany and/or contain only a small number of participants. Due to the small number of observations, earlier studies were not able to account for effect heterogeneity, e.g. by estimating the effects for subgroups of the labour
The Employment Effects of Job-Creation Schemes in Germany
383
market. Hence, drawing policy-relevant conclusions (for subgroups as well as for West Germany) out of those evaluations is problematic. The picture of the effects in the earlier studies is mixed. Whereas, e.g. Steiner and Kraus (1995) find short-term positive effects for men in East Germany, but no significant effects for women, the extended analysis in Kraus, Puhani, and Steiner (2000) results in negative effects for the participating individuals. In line with this finding are the results of Hu¨bler (1997), who states that the programmes do not achieve the expected positive impacts. In contrast, Eichler and Lechner (2002), who analyse the effects for Saxony-Anhalt, find a reduction of unemployment for participants. Overall, these findings do not show a clear tendency about the effects of JCS in Germany. With the 1998 reform, things have changed giving us access to very rich administrative data containing more than 11,000 participants starting a JCS in February 2000 and a comparison group of nearly 220,000 nonparticipants. We use these data to answer the question whether JCS enhance the employment chances of participating individuals. Since we observe the labour market outcomes of the individuals for 35 months after the programmes start, our main interest is in the short-to-medium-term employment effects of the programmes. The extensive set of available individual characteristics in combination with the information on the regional labour market situation makes the application of a matching estimator possible. Additionally, the large number of observations allow us to account for several sources of effect heterogeneity. The importance of effect heterogeneity for the evaluation of JCS in Germany has been well documented in Hujer, Caliendo, and Thomsen (2004). Basically, there are two shortcomings to that study. The first one refers to an unsatisfying outcome variable that allows us to only determine whether the individual is registered unemployed or not, but does not allow us to draw conclusions about the reintegration success into regular (unsubsidised) employment. A second restriction relates to the relatively short observation period after the programme starts, i.e. two years. This chapter extends the previous analyses for Germany in four important directions. First, we are able to evaluate the reintegration effects into regular (unsubsidised) employment. Second, we can measure the employment status of participants and non-participants for nearly three years after the programme starts. Third, we test the sensitivity of the results to various decisions, which have to be made while implementing the matching estimator, like the choice of the matching algorithm. Finally, we check if a possible occurrence of a ‘hidden bias’ distorts the interpretation of our results.
384
MARCO CALIENDO ET AL.
The focus of our analysis will be individual (group-specific) and regional effect heterogeneity. To do so, we separate the analysis by several characteristics and carry out the estimation on subpopulations.3 Men and women in West and East Germany will be the ‘main groups’ of our analysis. In addition, we estimate the effects for 11 ‘subgroups’ defined by age and unemployment duration as well as by specific characteristics indicating disadvantages on the labour market such as the lack of occupational training or the existence of placement restrictions due to health problems. The remainder of this chapter is organised as follows. In the following Section, we describe the institutional background of JCS in Germany and introduce the dataset used. Section 3 explains the general framework for microeconometric evaluation analysis and Section 4 deals with the empirical implementation of the matching estimator. In particular, we discuss the justification of the matching estimator and the estimation of the propensity scores in Section 4.1. The choice of the proper matching algorithm is discussed in Section 4.2, whereas Section 4.3 deals with common support issues and Section 4.4 presents some quality indicators for the chosen matching algorithm. In Section 5, we present the results for the main groups and the various subgroups. Additionally, we also test the sensitivity of our estimates with respect to a specific form of unobserved heterogeneity. The final section concludes and gives some policy recommendations.
2. INSTITUTIONAL BACKGROUND, DATASET AND SELECTED DESCRIPTIVES 2.1. Institutional Background German active as well as passive labour market policy is mainly paid for with unemployment insurance (UI) funds. Payment to UI is compulsory for all employees (except civil servants, judges, clergymen, soldiers and some other groups) and amounts to 6.5% of an employee’s gross salary. Persons who have contributed for at least 12 months to UI during the last three years before unemployment,4 are entitled to unemployment benefits (Arbeitslosengeld, UB). UB amount to 60% (67%) of the last average net earnings from insured employment (with at least one dependent child) and are paid from UI funds. The entitlement lasts for at least six months. The maximum duration is up to 32 months and depends on the contribution period and the individual’s age. For persons who had exhausted their UB entitlement, unemployment
The Employment Effects of Job-Creation Schemes in Germany
385
assistance (UA) was paid. UA amounted to 53% (57%) of the last average net earnings from insured employment (with at least one dependent child). UA could have been paid indefinitely (until retirement age) if the individual satisfied the benefit conditions. UA was administered by the FEA, but funding was by tax.5 With about one million participants between 2000 and 2004 and expenditures of 12 billion Euro, JCS have been a major element of ALMP in Germany.6 JCS are subsidised jobs for unemployed persons who face barriers to employment. Programmes aim at providing participants with a stable foundation and relevant qualifications for later reintegration into regular employment. Hence, JCS are primarily targeted at specific problem groups in the labour market, such as the long-term unemployed (with more than one year of unemployment), or persons without work experience or occupational training. Both the target groups and the design of JCS differ strongly from the most important ALMP programme in Germany, which is vocational training (VT). VT programmes are specifically designed to build up human capital (over a participation period of up to two years) that should pay off in the long run. The effects of VT in Germany have been recently analysed by Lechner, Miquel, and Wunsch (2004) and Fitzenberger, Osikominu, and Vo¨lter (2005) based on data very similar to ours. As it turns out, the effects for VT become positive only in the long run (after approximately five years). For JCS, however, positive effects should be visible earlier, since the main goal of JCS is to increase the employability of individuals. To prevent possible negative effects for businesses and regular workers, the law states that JCS should only be granted if supporting activities that are additional in nature and for the collective good. Additional in nature means that the activities would not be executed without the subsidy. Therefore, the majority of jobs are promoted in the public and noncommercial sector; measures with a predominantly commercial purpose have been excluded (up to January 2002). Since JCS are co-financed measures, support takes the form of a wage subsidy to the employer covering 30–75% of the costs, usually for a period of 12 months. Exceptions can be made for a higher subsidy rate (up to 100%) as well as a longer duration (up to 24 months) for persons with strong labour market disadvantages or projects of high priority. To become eligible for participation, individuals have to be long-term unemployed, i.e. unemployed for more than one year,7 and have to fulfil the requirements for unemployment compensation. However, caseworkers are allowed to fill 5% of the available places with individuals who do not meet these conditions. Further exceptions can be made for young unemployed (under 25 years) without occupational training and disabled
386
MARCO CALIENDO ET AL.
persons. Short-term unemployed people could be placed if they act as tutors for other participants in JCS. Up to 2004, participants’ eligibility for unemployment benefits was automatically renewed in the same way as if they were in regular employment. An important issue to be discussed in the evaluation of the programme effects is how individuals are selected into programmes. In particular, answering the question of why certain unemployed persons participate while others do not is important for modelling the participation decision and the choice of the comparison group. The following discussion relies on both evidence from interviews with caseworkers and the institutional setting. Participation in a JCS results from placement by the caseworker responsible for a given worker in the labour office. The caseworkers evaluate the unemployed persons’ efforts in finding a job and the individuals’ employment probabilities in meetings during the unemployment spell. If the caseworker assesses the unemployed person’s situation as requiring assistance through participation in a JCS, he can offer the individual a place if an opening is available, i.e. the particular job offer must relate to the individual’s qualifications as assessed by the caseworker in cooperation with the potential participant. Thus, assignment to programmes depends on the assessment of the individual’s need of assistance by the local labour office on the one hand, and on the availability of jobs in JCS at that time on the other. The responsible caseworker can cancel a JCS placement at any time if the participant can be placed in regular employment. Moreover, if an individual rejects the offer of a JCS, the labour office can stop the unemployment benefits for up to 12 weeks after the first rejection, and in the case of repeated rejection the unemployed person may lose his/her benefit entitlement altogether.8
2.2. Dataset The empirical analysis is based on data from four administrative sources of the FEA that were merged for this purpose. It contains information on a cross section of participants who started a JCS in February 2000 and on a sample of unemployed persons in January 2000 who did not participate in the following month. Due to the strongly different situations of the labour market in East and West Germany, we analyse the two regions separately. Taking into account previous empirical findings (Hujer et al., 2004), we also separate the analysis by gender. Furthermore, we excluded Berlin from the analysis: the special situation in the capital city would require a separate
The Employment Effects of Job-Creation Schemes in Germany
387
evaluation. However, the small number of participants makes it difficult to draw generalisable conclusions from the results. Our final sample contains 11,151 participants and 219,622 non-participants for whom we observe the employment status up to December 2002, which is almost three years after the programmes start. The main source of information to describe the individual labour market situation is the job-seekers database (Bewerberangebotsdatei, BewA) and an adjusted version for statistical purposes (ST4). These sources contain information on all individuals registered at the labour offices as unemployed or facing impending unemployment. The data sources provide each individual’s unemployment status information together with important information on the job-seeker’s socio-demographic situation, qualification details and labour market history. This information is amended by attributes of subsidised employment programmes (ST11), e.g. the economic sector and the programme duration. These three sources constitute a prototype version of the programme participant’s master dataset (Mabnahme-Teilnehmer-Gesamtdatei, MTG).9 For this reason, the MTG contains numerous attributes to describe individual characteristics on the one hand, and provides a reasonable basis for the construction of the comparison group on the other. It should be noted that all information included in the MTG is obtained by the local caseworkers, i.e. it comprises the set of observable characteristics they use to evaluate the individual’s employability as well as some subjective assessments. As the local labour market environment is an important determinant of programme assignment and impacts, we complete our set of attributes by regional dummies according to the classification of similar and comparable labour office districts by the FEA. This classification was undertaken by a project group of the FEA (Blien et al., 2004) whose aim was to enhance the comparability of the labour office districts for a more efficient allocation of funds. It categorises the 181 German labour office districts into 12 comparable clusters. The comparability of the labour office districts is built upon several labour market characteristics. The most important criteria are the underemployment rate and the corrected population density. In addition to that, the vacancy ratio describing the relation of all reported vacancies at the labour office to the number of employed persons, the placement ratio that contains the number of placements in relation to the number of employed persons, and the ratio of people who receive maintenance allowance in relation to the underemployment rate are used. Furthermore, an indicator for the tertiarisation level built on the number of employed persons in agricultural occupations and an indicator for seasonal unemployment are
388
MARCO CALIENDO ET AL.
considered. The 12 types of comparable labour office districts can be summarised into five broader types characterised by their local labour market conditions. Because all East German labour office districts (except the city of Dresden) belong to the first cluster, we use the finer classification (Clusters Ia, Ib, Ic and II) for the East, whereas for West Germany we rely on the broader one (Clusters II–V). The clusters are ordered according to the labour market prospects starting with the worst labour market environment (Cluster Ia). Finally, the outcome variable (regular and unsubsidised employment) is taken from the Employment Statistics Register (Bescha¨ftigtenstatistik, BSt). The BSt contains information on all persons registered in the social security system (employees and participants in several ALMP programmes). We define only regular employment as success; all kinds of subsidised employment and/or participation in ALMP programmes are defined as failure. For this reason, we have to complete the outcome variable by adding information from the final version of the MTG on all periods spent by individuals in ALMP programmes. We are thus able to explicitly identify regular (unsubsidised) employment as the outcome of interest. Table 1 shows that the largest participating groups are women (5,035) and men (2,924) in East Germany. In West Germany, 2,140 men and 1,052 women started a JCS in February 2000. Due to the large number of observations in our sample, we are also able to analyse the programme effects for specific problem groups in the labour market. We distinguish three age categories (younger than 25 years, between 25 and 50 years, and older than 50 years), three unemployment durations (up to 13 weeks, between 13 and 52 weeks, and more than 52 weeks), persons without work experience, persons without occupational training and persons with a high educational degree (college and university graduates).10 In addition, we analyse the effects for rehabilitation attendants,11 and for individuals for whom the caseworkers have noted placement restrictions due to health problems. In total, we get 11 ‘subgroups’ for whom the effects will be estimated separately in both regions and for both genders. Table 1 indicates the number of observations in each of these groups, differentiated by participation status. What can be seen as most important is that nearly all groups contain a reasonable number of participants, allowing a proper estimation and interpretation of the effects.12 The figures from Table 1 show that the majority of participants come from East Germany. A look at the individual characteristics measured in February 2000 shows regional differences too. To save space we do not discuss these differences in detail here, but we do want to mention two important findings. The descriptives show that the allocation of individuals to JCS is not target-oriented
The Employment Effects of Job-Creation Schemes in Germany
Table 1.
389
Number of Observations in Main Groups and Subgroups.
Groups
East Germanya
West Germany Men
Women
Men
Women
Part. Non-Part. Part. Non-Part. Part. Non-Part. Part. Non-Part. Total (main group)
2,140
Subgroups Age (in years) o25 458 25–50 1,337 W50 345 Unemployment duration (in weeks) o13 558 13–52 744 W52 838 Without occup. experience 273 Without occup. training 1,340 With high degreeb 112 Rehabilitation attendant 111 With placement 354 restrictions
44,095
1,052
34,227
2,924
64,788
5,035
76,512
4,102 23,560 16,433
182 709 161
2,443 19,732 12,052
240 1,571 1,113
8,743 35,927 20,118
148 3,342 1,545
4,864 44,329 27,319
12,198 13,909 17,988 3,281 21,659 1,486 2,763 9,516
237 403 412 159 476 146 44 148
7561 12,235 14,431 2,548 17,093 1,165 1,063 5,993
578 1,248 1,098 293 837 146 218 394
22,003 22,864 19,921 7,023 14,966 2,682 4,849 10,470
575 1,970 2,490 498 1,121 191 156 376
12,447 26,657 37,408 7,945 19,776 1,619 3,520 9,121
Note: All individual characteristics are measured at the end of January 2000. Classification into the different subgroups is based on this point in time too. a Observations from the labour office districts of Berlin are excluded. b Persons with a college or university certificate/diploma.
according to the legal rules. The results also indicate a different purpose of JCS in the two regions, where JCS are more target-oriented (e.g. for young unemployed without occupational training) in West Germany but are also used to relieve the tense situation in the labour market in East Germany.13
3. ESTIMATION OF TREATMENT EFFECTS WITH MATCHING ESTIMATORS 3.1. Potential Outcome Framework and Selection Bias Since we work with non-experimental data, we have to deal with some identification issues. As we consider only one specific programme compared
390
MARCO CALIENDO ET AL.
to non-participation, we can use the potential outcome framework with two potential outcomes Y1 (individual receives treatment) and Y 0 (individual does not receive treatment).14 The actually observed outcome for any individual i can be written as Y i ¼ Y 1i Di þ ð1 Di ÞY 0i , where D 2 f0; 1g is a binary treatment indicator. The treatment effect for each individual i is then defined as the difference between her potential outcomes Di ¼ Y 1i Y 0i . Since there will be never an opportunity to estimate individual effects with confidence, we concentrate on population averages of impacts from treatment. The most prominent evaluation parameter is the average treatment effect on the treated (ATT), which focuses explicitly on the effects on those who actually participate in the programme.15 It is given by ATT ¼ EðDjD ¼ 1Þ ¼ EðY 1 jD ¼ 1Þ EðY 0 jD ¼ 1Þ
(1)
Given Eq. (1), the problem of selection bias is straightforward to see, since the second term on the right-hand side is unobservable. If the condition EðY 0 jD ¼ 1Þ ¼ EðY 0 jD ¼ 0Þ holds, we can use the non-participants as an adequate comparison group. However, with non-experimental data it will usually not hold. Consequently, estimating the ATT by the difference in the subpopulation means of participants E(Y 1|D=1) and non-participants E(Y 0|D=0) will lead to a selection bias, since participants and nonparticipants are selected groups that would have different outcomes even in the absence of the programme. This bias may come from observable factors like age or skill differences or unobservable factors like motivation. For both cases different estimation strategies are available.16 If one is willing to assume that selection occurs on observed characteristics only, the matching estimator is an appealing choice. Its basic idea is to select from a large group of non-participants those individuals who are similar to the treated group in all relevant (observable) characteristics, i.e. those characteristics that influence the participation decision and outcomes simultaneously.17
3.2. How Does Matching Solve the Evaluation Problem? Matching is based on the conditional independence (or unconfoundedness) assumption (CIA), which states that conditional on some covariates X, the outcomes (Y1, Y 0) are independent of D. Since we are interested in ATT only, we only need to assume that Y 0 is independent of D because the moments of the distribution of Y 1 for the treated are directly estimable. As matching on X can become hazardous when X is of high dimension (the ‘curse of dimensionality’), Rosenbaum and Rubin (1983) suggest the use of
The Employment Effects of Job-Creation Schemes in Germany
391
balancing scores b(X). These are functions of the relevant observed covariates X such that the conditional distribution of X given b(X ) is ‘ independent of the assignment to treatment, that is X DjbðX Þ. For participants and non-participants with the same balancing score, the distributions of the covariates X are the same, i.e. they are balanced across the groups. The propensity score P(X ), i.e. the probability of participating in a programme, is one possible balancing score. It summarises the information on the observed covariates X into a single index function. The propensity score can be seen as the coarsest balancing score whereas X is the finest (Rosenbaum & Rubin, 1983). Hence, it is sufficient to assume that (in the notation of Dawid, 1979): Assumption 1. Unconfoundedness for the comparison group given the propensity score is Y0
a
DjPðX Þ
‘
denotes independence. If Assumption 1 is fulfilled, the nonwhere participant outcomes have, conditional on P(X), the same distribution that participants would have experienced if they had not participated in the programme (Heckman, Ichimura, & Todd, 1997a). Hence, if the means exist, EðY 0 jX ; D ¼ 1Þ ¼ EðY 0 jX ; D ¼ 0Þ ¼ EðY 0 jX Þ and the missing counterfactual mean can be constructed from the outcomes of non-participants. In order for both sides of the equations to be well defined simultaneously for all X, the following is usually additionally assumed, that: Assumption 2. Weak overlap: PrðD ¼ 1jX Þo1 for all X. These assumptions are sufficient for identification of Eq. (1). The method of matching can also be used to estimate the ATT at some points X=x, where x is a particular realisation of X: ATTðX ¼ xÞ ¼ EðDjX ¼ x; D ¼ 1Þ ¼ EðY 1 jX ¼ x; D ¼ 1Þ EðY 0 jX ¼ x; D ¼ 1Þ
(2)
392
MARCO CALIENDO ET AL.
This parameter measures the mean treatment effect for persons who were randomly drawn from the population of the treated, given a specific realisation of certain characteristics X. This is of particular interest for us, since we want to estimate the effects for subgroups, such as long-term unemployed persons or individuals without work experience.
4. IMPLEMENTATION OF THE MATCHING ESTIMATOR After having decided to use matching estimators for evaluation purposes, the researcher is confronted with several questions regarding the implementation of these estimators.18 Every evaluation task requires a careful consideration of the available choices for the given situation. Hence, we will discuss the implementation and justification of the matching estimator in our context in the following sections. We start with the plausibility of the CIA, the definition of the comparison group and the estimation of the propensity scores in Section 4.1. Following that, we choose one matching algorithm to be used in the further analysis in Section 4.2. Sections 4.3 and 4.4 will be concerned with common support and matching quality issues. 4.1. Plausibility of CIA, Comparison Group and Propensity Score Estimation Clearly, the CIA is in general a very strong assumption and the applicability of the matching estimator depends crucially on the plausibility of the CIA. Blundell, Dearden, and Sianesi (2005) argue that the plausibility of such an assumption should always be discussed on a case-by-case basis, thereby accounting for the informational richness of the data. Implementation of matching estimators requires choosing a set of variables that credibly justify the CIA. Only variables that simultaneously influence the participation decision and the outcome variable should be included in the matching procedure. Hence, economic theory, a sound knowledge of previous research and also information about the institutional settings should guide the researcher in specifying the model (see, e.g. Smith & Todd, 2005a; Sianesi, 2004). Both economic theory and previous empirical findings highlight the importance of socio-demographic and qualificational variables. Regarding the first category we can use variables such as age, marital status, number of children, nationality (German or foreigner) and health
The Employment Effects of Job-Creation Schemes in Germany
393
restrictions. The second class (qualification variables) refers to the human capital of the individual, which is also a crucially important determinant of labour market prospects. The attributes available are occupational training, occupational group, occupational rank and the work experience of the individual.19 Furthermore, as pointed out by Heckman and Smith (1999), unemployment dynamics and the labour market history play a major role in driving outcomes and programme participation. Hence, we use career variables describing the individual’s labour market history such as the duration of the last employment, the duration of unemployment at the end of January 2000, the number of (unsuccessful) placement propositions,20 the last contact with the job centre21 and the knowledge of whether the individual plans to take part in vocational rehabilitation or has already participated in a programme before. Heckman, Ichimura, Smith, and Todd (1998b) additionally emphasise the importance of drawing treated and comparison people from the same local labour market and giving them the same questionnaire. Since we use administrative data from the same sources for participants and non-participants, the latter point is given in our analysis. To account for the local labour market situation, we use the regional context variables described above. Finally, the institutional structure and the selection process into JCS provide some further guidance in selecting the relevant variables. As we have seen from the discussion in Section 2.1, JCS are in general open to all job seekers who meet the eligibility criteria. However, what should have become clear is that programme participation depends on the individual’s need for assistance as evaluated by the responsible caseworker. Caseworkers assess the individual’s need for assistance based on the set of sociodemographic, qualificational and career variables used in our analysis. In addition, we are able to use the caseworkers’ subjective assessments of the individuals’ placement restrictions as well. Based on the overall informative mass of our dataset, we argue henceforth that the CIA holds in our setting. We test the sensitivity of our estimates to this assumption using a bounding analysis in Section 5.3. Choosing a proper comparison group is the next thing to do. Although participation in ALMP programmes is not mandatory in Germany, the majority of unemployed persons join a programme after some time. Thus, comparing participants to individuals who will never participate is inadequate, since it can be assumed that the latter group is particularly selective.22 Sianesi (2004) discusses this problem for Sweden and argues that these persons are the ones who do not enter a programme because they have already found a job. Therefore, we restrict our comparison group to those
394
MARCO CALIENDO ET AL.
who are unemployed at the end of January 2000 and do not participate in February 2000 (but may possibly join a programme later). Tables A.2 (West Germany) and A.3 (East Germany) in the supplementary appendix provide information on the labour market destinations during the observation period of the comparison group. It becomes obvious that during the observation period only a minor fraction of these individuals participate in (any) ALMP programme (about 4.5% (3.4%) of the male (female) nonparticipants in West Germany and about 8.3% (7.8%) in East Germany in December 2002). Hence, this should not distort our results. The ratio of participants to non-participants in February 2000 in our data is approximately 1:20. After having defined the comparison group and decided about the set of variables to be included in the specification, one has to choose an adequate model. Little advice is available regarding which functional form to use (see, e.g. Smith, 1997). In principle, any discrete choice model can be used. We use a logit model because the logit distribution has more density mass in the tails (in comparison to the probit model) and so reflects our situation better. Table 2 contains the results of the propensity score estimation. Looking at Table 2 clarifies that the influence of variables on the participation probability differs by regions and gender and highlights the appropriateness of the separate estimation of the propensity scores for men and women in West and East Germany. We will use the estimated propensity scores in the following to implement the matching estimator.
4.2. Choosing the Matching Algorithm The next choice to be made concerns the matching algorithm to be used. Several algorithms have been suggested in the literature. Good overviews can be found in Heckman et al. (1998a) or Smith and Todd (2005a). Clearly, all approaches should yield asymptotically the same results because, with growing sample size, all of them become closer to comparing only exact matches (Smith, 2000). However, in small samples the choice of the matching approach can be important (Heckman et al., 1997a). All matching estimators contrast the outcome of a treated individual with the outcomes of comparison group members. However, the estimators differ not only in the definition of the neighbourhood for each treated individual and the handling of the common support problem, but also with respect to the weights given to these neighbours. Usually a trade-off between bias and variance arises. First, one has to decide on how many non-treated individuals are to be
Estimation Results (Coefficients) of the Logit Models for the Propensity Score.
Variable
West Germany Men
Constant
East Germany
Women
Men
Women
Coeff.
S.E.
Coeff.
S.E.
Coeff.
S.E.
Coeff.
S.E.
1.1739
0.2731
3.1254
0.4533
5.7880
0.3659
8.0021
0.3944
0.0599 0.0004 0.1676 0.0653 0.4402
0.0145 0.0002 0.0612 0.0281 0.0683
0.0067 0.0003 0.4483 0.0183 0.2825
0.0235 0.0003 0.0761 0.0439 0.1211
0.0901 0.0008 0.2683 0.0335 0.6284
0.0141 0.0002 0.0506 0.0266 0.1966
0.1702 0.0019 0.1145 0.0238 0.7082
0.0136 0.0002 0.0344 0.0184 0.2432
0.9160 0.8052 1.1190 0.2757 0.0472
Ref. 0.1826 0.1267 0.3658 0.1570 0.0892
1.3404 0.6433 1.9871 0.0651 0.0751
Ref. 0.2578 0.1978 0.4246 0.2685 0.1390
0.5491 0.4991 0.5691 0.0708 0.1918
Ref. 0.2758 0.1270 0.1925 0.1721 0.0716
1.1375 0.6032 0.7999 0.0725 0.1422
Ref. 0.2442 0.1242 0.1954 0.1826 0.0608
Qualification variables Occupational training Without complete occup. training, no CSEc Without complete occup. training, with CSE Industrial training Full-time vocational school Technical school Polytechnic College, university
0.3364 0.6738 0.7639 0.0987 0.3534 0.2399
Ref. 0.0622 0.0692 0.2685 0.1756 0.2009 0.1577
0.2294 0.0808 0.0734 0.7183 1.4983 1.0221
Ref. 0.1334 0.1399 0.2432 0.1927 0.2144 0.1869
0.1015 0.1777 0.3223 0.2227 0.0135 0.0810
Ref. 0.0823 0.0748 0.2594 0.1231 0.2058 0.1354
0.3428 0.3315 0.8588 1.0166 1.0388 0.9004
Ref. 0.0865 0.0820 0.1384 0.0977 0.1794 0.1272
395
Socio-demographic variables Age Age2 Married Number of children German Health restrictions No health restrictions Acc. DoDa, 80% and over Acc. DoD, 50% to under 80% Acc. DoD, 30% to under 50% Acc. DoD, 30% to under 50%, no equalis.b Other health restrictions
The Employment Effects of Job-Creation Schemes in Germany
Table 2.
396
Table 2. (Continued ) Variable
West Germany Men Coeff.
Career variables Duration of last employment (months) Duration of unemployment (in weeks) Up to 13 Between 13 and 52 More than 52 Number of placement propositions Last contact to job center (weeks) Rehabilitation attendant Placement restrictions
Women S.E.
0.2222 0.5605
Coeff.
Men S.E.
Coeff.
0.0092 0.7494
0.0927 0.4657 Ref. 0.5810 0.1544 0.3077 0.0544 0.1023 0.1533
0.2628
0.2501
0.1609 0.3167 0.3933
0.5499 0.0163 0.0877 0.0112 0.3397
Ref. 0.0982 0.1152 0.1536 0.0563 0.0745
0.0046
0.2055 0.3087 0.0494 0.0013 0.1533 0.3396
Women S.E.
Coeff.
S.E.
0.2370
0.0670
Ref. 0.2605 0.0995 0.2628
0.0828 0.5154 Ref. 0.1954 0.0999 0.1739 0.0478 1.1891 0.2170
0.2149 0.0127 1.2092
Ref. 0.0819 0.0406 0.2860
0.1637 0.1490 0.5131 0.1512 0.3139
Ref. 0.1944 0.1256 0.1624 0.1054 0.1017
0.1811 0.1809 0.2838 0.0345 0.2279
Ref. 0.0597 0.1067 0.1662 0.0528 0.0695
0.0657 0.2197 0.0404 0.1004 0.1175
Ref. 0.0525 0.0605 0.1215 0.0437 0.0527
0.0005
0.0033
0.0007
0.0038
0.0004
0.0028
0.0003
Ref. 0.0616 0.0678 0.0028 0.0125 0.1185 0.0989
0.0698 0.0888 0.0530 0.0520 0.0696 0.2654
Ref. 0.0889 0.0974 0.0042 0.0177 0.2039 0.1546
0.4673 0.4498 0.0610 0.1204 0.2958 0.3164
Ref. 0.0561 0.0599 0.0030 0.0114 0.0939 0.0870
0.2509 0.1694 0.0919 0.0644 0.1535 0.3000
Ref. 0.0511 0.0509 0.0031 0.0085 0.1024 0.0825
MARCO CALIENDO ET AL.
Occupational group Plant cultivation, breeding, fishery Mining, mineral extraction Manufacturing Technical occupations Service occupations Other occupations Occupational rank Unskilled worker Skilled worker White-collar worker, simple occupations White-collar worker, advanced occupations Other Qualification (with work experience)
East Germany
Regional context variablesd Cluster Ia Cluster Ib Cluster Ic Cluster II Cluster III Cluster IV Cluster V
0.2292 0.6479 0.4764 2.1463 0.0929
Ref. 0.0801 0.2286 1.0285 0.0777 0.2706
0.5301 0.4613 2.6387 3.0671 0.9368
Ref. 0.1043 0.4466 0.5245 0.1141 0.3406
0.4830 0.6545 1.1431 1.7272 0.4232 0.1040 0.3077 0.2838
0.2225 0.1841 0.0080
0.0730 0.0722 0.1002 Ref.
0.5666 0.4601 0.4530
0.0960 0.0917 0.1423 Ref.
Ref. 0.0628 0.0893 0.4289 0.0546 0.2273 0.1291 0.1248 0.1361 Ref.
Note: Bold letters indicate significance at the 1% level. Italic letters refer to the 5% level. a DoD: Degree of disability on a scale from 0 to 100%. b People with accepted degree of disability, but no equalisation to other persons with the same degree of disability. c CSE: Certificate for secondary education. d Cluster according to the FEA classification as described in Blien et al. (2004).
0.5263 0.5634 0.3364 1.5382 0.3780 0.1421 0.0242 0.1841
Ref. 0.0422 0.0746 0.5250 0.0418 0.2720 0.1238 0.1210 0.1292 Ref.
The Employment Effects of Job-Creation Schemes in Germany
Programme before unemployment No further education or programme Further education complete, cont. education Further education complete, voc. adjustment Job-preparative measure Job-creation scheme Rehabilitation measure
397
398
MARCO CALIENDO ET AL.
matched to a single treated individual. Single nearest-neighbour (NN) matching only uses the participant and her closest neighbour. Therefore, it minimises the bias but might also involve an efficiency loss, since potentially a large number of close neighbours are disregarded. Clearly, NN matching faces the risk of bad matches if the closest neighbour is far away. This can be avoided by using caliper matching, i.e. imposing a tolerance on the maximum distance in the propensity scores allowed. Kernel-based matching, on the other hand, uses weighted averages of (nearly) all – depending on the choice of the kernel function – individuals in the comparison group to construct the counterfactual outcome for the treated individuals, thereby reducing the variance but possibly increasing the bias. Finally, using the same non-treated individual more than once (as in kernel-based matching or NN matching with replacement) can possibly improve the match quality, but increases the variance.23 Fro¨lich (2004) uses a Monte Carlo study to analyse the finite-sample properties of matching estimators. It turns out that ridge matching (some special form of kernel matching) is superior to all other estimators, followed by kernel matching. k-NN matching, where k is the number of neighbours used, was not very successful in his design. However, since our sample sizes are much larger than those in the Monte Carlo study of Fro¨lich (2004), the pernicious effects of NN matching are probably mitigated. Furthermore, kernel matching is not feasible for our estimation because the computing time is too high. However, to see if the inclusion of more comparison units for the construction of the counterfactual outcome has an influence on the estimated effects, we also use ‘multiple NN’ methods. This form of matching trades reduced variance, resulting from using more information to construct the counterfactual for each participant, with increased bias that results from poorer matches on average (Smith & Todd, 2005a).24 This brief discussion makes clear that several alternatives emerge when using matching estimators. Black and Smith (2004) show how crossvalidation methods can be used to choose between different matching methods. The basic idea of leave-one-out validation is to drop the jth observation in the comparison group and use the remaining observations in the 0 comparison group to form an estimate of Y 0j , denoted by Y^ j;j . 0 The associated forecast error is then given by Aj;j ¼ Y 0j;j Y^ j;j . Repeating the process for the remaining observations allows comparisons of the mean squared error (MSE) or root MSE of the forecasts associated with different matching estimators – or bandwidths when selecting a bandwidth – to guide the choice of estimator or bandwidth (Smith & Todd, 2005a). Based on our dataset, which contains nearly 220,000 non-participants, the computing
The Employment Effects of Job-Creation Schemes in Germany
399
costs of cross validation are too high. Hence, we use a more pragmatic approach and try a number of approaches and test the sensitivity of the results with respect to the algorithm choice. If they give similar results, the choice may be unimportant. On the other hand, if the results differ, further investigation may be needed in order to reveal more about the source of the disparity (Bryson, Dorsett, & Purdon, 2002). We implement 11 matching algorithms, including NN matching without replacement (without caliper and with calipers of 0.01, 0.02 and 0.05) and NN matching with replacement with the same calipers. To see if the estimates differ when more neighbours are included, we additionally implement k-NN matching with 2, 5 and 10 NNs.25 Table 3 contains the results for the main groups for the last month of the observation period. In a recent paper Abadie and Imbens (2006) point out that bootstrap methods are invalid for NN matching.26 Since we cannot use kernel matching methods here – for which Heckman, Ichimura, and Todd (1998b) provide a rigorous distribution theory – but still want to take account of the variance component due to the estimation of the propensity score, we stick to the bootstrapping method (with 400 replications). The estimates illustrate two points: first of all, the results are not sensitive to the chosen matching algorithm. For men in West Germany, the effects are statistically insignificant and centred around zero. For men in East Germany, the statistically significant effects vary between 2.37%-points (5-NN matching) and 2.94%-points (NN with replacement). This means that the employment rate of men in East Germany, who started their JCS in February 2000, is on average between 2.37%-points and 2.94%-points lower when compared to matched non-participants in December 2002. We will give an extensive interpretation of the results in Section 5 and restrict the discussion here to sensitivity issues. The statistically significant effects for females in East Germany vary between 1.06%-points (10 NN) and 1.93%-points (NN with replacement). The only group for whom a somewhat higher variation in the effects is detected is that of women in West Germany, where the lowest estimated effect is 4.66%-points (2-NN matching) and the highest estimated effect is 6.1%-points (10 NN). The second point to note is that the standard errors are (as expected) in general lower for the multiple NN algorithms, even though the differences here are not very pronounced. Hence, the choice of the matching algorithm does not seem to be a critical issue in our case. The results show that the estimates are not sensitive to the algorithm choice and that the improvement that comes from multiple NN methods in terms of reduced variance is very limited. Therefore, we decide to use NN matching for the rest of our analysis. Since
400
MARCO CALIENDO ET AL.
Table 3.
The Effects in the Main Groups for Different Matching Algorithms.
Matching algorithm
West Germany NN without replacement Caliper 0.01 Caliper 0.02 Caliper 0.05 NN with replacement Caliper 0.01 Caliper 0.02 Caliper 0.05 Multiple NN 2NN 5NN 10 NN East Germany NN without replacement Caliper 0.01 Caliper 0.02 Caliper 0.05 NN with replacement Caliper 0.01 Caliper 0.02 Caliper 0.05 Multiple NN 2NN 5NN 10 NN
Men
Women
Effect
S.E.
Obs.a
Effect
S.E.
Obs.a
0.0005 0.0014 0.0014 0.0009 0.0051 0.0042 0.0051 0.0051
0.0123 0.0127 0.0124 0.0130 0.0145 0.0161 0.0142 0.0144
2,138 2,123 2,128 2,138 2,138 2,123 2,128 2,138
0.0570 0.0500 0.0498 0.0545 0.0504 0.0504 0.0504 0.0504
0.0195 0.0200 0.0200 0.0195 0.0237 0.0228 0.0231 0.0233
1,052 1,019 1,024 1,027 1,052 1,052 1,052 1,052
0.0014 0.0002 0.0013
0.0140 0.0118 0.0111
2,138 2,138 2,138
0.0466 0.0529 0.0610
0.0214 0.0178 0.0175
1,052 1,052 1,052
0.0291 0.0291 0.0291 0.0291 0.0294 0.0294 0.0294 0.0294
0.0084 0.0085 0.0081 0.0080 0.0091 0.0098 0.0104 0.0098
2,924 2,918 2,932 2,924 2,924 2,924 2,924 2,924
0.0135 0.0137 0.0135 0.0135 0.0193 0.0191 0.0193 0.0193
0.0065 0.0068 0.0064 0.0068 0.0078 0.0074 0.0078 0.0077
5,033 5,026 5,027 5,030 5,033 5,030 5,031 5,033
0.0250 0.0237 0.0249
0.0080 0.0079 0.0065
2,924 2,924 2,924
0.0128 0.0101 0.0106
0.0072 0.0061 0.0057
5,033 5,033 5,033
Note: Bold letters indicate significance at the 1% level. Italic letters refer to the 5% level. All standard errors are bootstrapped with 400 replications. NN matching without replacement uses each non-participant only once, whereas with NN matching with replacement each nonparticipant can be used repeatedly. Caliper defines the maximal allowed difference in the propensity score of participants and matched non-participants. Matching is implemented with the Stata module PSMATCH2 by Leuven and Sianesi (2003). a Obs. is the number of participants after matching.
we have a very large sample of non-participants, the probability of finding good matches without using replacement is quite high. To avoid an unnecessary inflation of the variance, we match without replacement. Finally, to ensure good matching quality, we implement a caliper of 0.02.27
The Employment Effects of Job-Creation Schemes in Germany
401
4.3. Common Support Before assessing the matching quality, it is important to check the region of common support for participants and non-participants. The most straightforward way is a visual analysis of the density of the propensity score in both groups. Fig. 1 contains the results for the main groups, whereas Figs. A.1–A.4 in the supplementary appendix refer to the subgroups. It is a common finding that the distribution for non-participants is highly skewed to the left. Problems arise when the distributions for the two groups do not overlap. A good example are short-term unemployed men in West Germany, among whom a large number of observations in the treatment
West Germany Women Total
Men Total 0
20
1
0
Density
Density
15 10
20 10
5 0
0 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Propensity Score
0
1
0.5 1 0 0.5 Propensity Score
20 10
1
Women Total 0
20 Density
30
0
East Germany
Men Total
Density
1
30
1
15 10 5 0
0 0
0.2
0.4 0.6 0 0.2 0.4 Propensity Score
0.6
0
0.5 1 0 0.5 Propensity Score
1
Fig. 1. Propensity Score Distributions in the Main Groups. The Left Side of the Graphs Refers to Non-participants (D ¼ 0), the Right Side to Participants (D ¼ 1) in Each Group.
402
MARCO CALIENDO ET AL.
group have propensity scores over 0.5 while almost none of the comparison individuals can be found in this region. There are several ways of imposing the common support condition, e.g. by ‘minima and maxima comparison’ or ‘trimming’ (see Caliendo & Kopeinig, 2008 for an overview). We impose the ‘minima and maxima condition’ and additionally implement NN matching with a caliper of 0.02. The idea of minima–maxima comparison is to delete all treated observations, whose propensity score is smaller than the minimum and higher than the maximum in the comparison group. Treated individuals who fall outside the common support region have to be disregarded and for these individuals the treatment effect cannot be estimated. Bryson et al. (2002) note that if the proportion of lost individuals is small, this poses only few problems. However, if the number is too large, there may be concerns whether the estimated effect on the remaining individuals can be viewed as representative.28 Table 4 contains the number of treated individuals lost in each of the subgroups. It can be seen that the number of lost individuals is fairly low for three of the main groups. For men in West Germany we lose 0.79% of the observations, for men (0.03%) and women (0.16%) in East Germany the proportion is even smaller. However, for women in West Germany we cannot find similar non-participants for around 6.84% of the treated population and have to discard these individuals. The overlap between participating and non-participating women in West Germany is fairly limited in the subgroups too, and we lose up to 20.95% of the treated population (for women with placement restrictions). Hence, the estimated effects for these groups should be interpreted carefully. For the other subgroups the share of lost individuals is acceptable. However, two subgroups are problematic for both gender and regions. The first are short-term unemployed persons (fewer than 13 weeks unemployed). For this group we lose 21.15% of the participating men in West Germany, and 19.20% of men and 25.04% of women in East Germany. This means that we are not able to find short-term unemployed individuals in the comparison group who have similar propensity scores to the treated individuals. The second subgroup consists of individuals with a high degree, where the share of individuals lost is 6.85% (14.14%) for men (women) in East Germany and 14.29% (17.81%) for men (women) in West Germany. Overall, we note that the share of lost individuals is rather small in East Germany, higher for men in West Germany and highest for women in West Germany.
The Employment Effects of Job-Creation Schemes in Germany
Table 4.
403
Number of Treated Individuals Lost due to Common Support Requirement. Men
Women
Before After Lost Before After Lost Matching Matching in % Matching Matching in % West Germany Total 2,140 Age (in years) o25 458 25–50 1,337 W50 345 Duration of unemployment (in weeks) o13 558 13–52 744 W52 838 Without occupational experience 273 Without occupational training 1,340 With high degree 112 Rehabilitation attendant 111 With placement restrictions 354 East Germany Total 2,924 Age (in years) o25 240 25–50 1571 W50 1113 Duration of unemployment (in weeks) o13 578 13–52 1248 W52 1098 Without occupational experience 293 Without occupational training 837 With high degree 146 Rehabilitation attendant 218 With placement restrictions 394
2,123
0.79
1,052
980
6.84
434 1,328 344
5.24 0.67 0.29
182 709 161
162 663 150
10.99 6.49 6.83
440 720 835 247 1,296 96 100 326
21.15 3.23 0.36 9.52 3.28 14.29 9.91 7.91
237 403 412 159 476 146 44 148
189 365 400 128 447 120 35 117
20.25 9.43 2.91 19.50 6.09 17.81 20.45 20.95
2,923
0.03
5,035
5,027
0.16
229 1570 1074
4.58 0.06 3.50
148 3,342 1,545
144 3,335 1,481
2.70 0.21 4.14
467 1230 1098 289 835 136 215 371
19.20 1.44 0.00 1.37 0.24 6.85 1.38 5.84
575 1970 2490 498 1121 191 156 376
431 1963 2490 489 1116 164 148 362
25.04 0.36 0.00 1.81 0.45 14.14 5.13 3.72
Note: We used the minima–maxima restriction to impose the common support condition. Results refer to NN matching without replacement and a caliper of 0.02.
404
MARCO CALIENDO ET AL.
4.4. Matching Quality 4.4.1. Matching Quality for the Main Groups Since we are not using an exact matching procedure here, we have to check the ability of the matching algorithm to balance the relevant covariates in both groups. One suitable measure to assess the distance in the marginal distributions of the X variables is the standardised difference (SD) suggested by Rosenbaum and Rubin (1985). For each covariate X, it is defined as the difference of the sample means in the treated and (matched) comparison subsamples as a percentage of the square root of the average of the sample variances in the treated and non-treated group. This is a common approach used in many evaluation studies, e.g. by Sianesi (2004). We calculate the absolute difference between the respective participating and non-participating groups both before and after matching. To abbreviate the documentation, we calculated the means of the SD before and after matching for the four main groups (Table 5) as an unweighted average of all variables (mean standardised difference, MSD). The overall difference before matching lies between 10.8% for women in East Germany and 16.1% for women in West Germany. A significant reduction can be achieved for all groups so that the bias after matching is 2.5% (3.2%) for men (women) in West Germany and 1.8% (1.6%) for men (women) in East Germany. Clearly, this is an enormous reduction and shows that the matching procedure is able to balance the characteristics in the treatment and the matched comparison groups. What has to be kept in mind is that solely looking at the MSD might be misleading, if huge differences remain in single covariates. Table A.4 in the supplementary appendix contains the SD for each covariate and shows that this is not the case here, because the SD for nearly all of the covariates lies well below 5%.29 Additionally, Sianesi (2004) suggests reestimating the propensity score on the matched sample (i.e. on the participants and matched non-participants) and comparing the pseudo-R2’s before and after matching. After matching there should be no systematic differences in the distribution of the covariates between the two groups. Therefore, the pseudo-R2 after matching should be fairly low. As the results from Table 5 show, this is true for our estimation. The results of the F-tests (with degrees of freedom in brackets) point in the same direction, indicating a joint significance of all regressors before, but not after, matching.30 4.4.2. Matching Quality for the Subgroups Now that we have shown that the matching procedure is able to balance the distribution of the covariates between treated and comparison individuals in the main groups, we have to test this for the subgroups too.
Mean Standardised Difference and Some Quality Indicators.
Propensity Score Specification
West Germany Men Before Matching
Main groups Mean standardised differencea Pseudo-R2 F-testb
14.62 0.1389 2,460.8
Mean standardised difference in subgroupsa Age (in years) o25 10.48 25–50 11.30 W50 17.82 Duration of unemployment (in weeks) o13 15.71 13–52 12.79 W52 17.77 Without occupational experience 14.02 Without occupational training 14.31 17.18 With high degreec Rehabilitation attendant 18.13 With placement restrictions 19.29
Women
After Matching With P1
East Germany
Before Matching
P2
2.51 0.006 38.0
Men
After Matching With P1
16.08 0.1775 1,679.4
3.07 0.009 23.4
Before Matching
P2
Women
After Matching With P1
12.01 0.1225 2,951.3
1.79 0.004 35.3
Before Matching
P2
After Matching With P1
10.83 0.1144 4,323.3
1.57 0.003 39.2
P2
11.18 6.30 13.51
3.18 2.79 5.77
12.50 15.56 20.48
12.62 6.63 13.94
6.55 3.31 6.12
14.74 11.91 16.79
12.99 4.17 10.40
5.00 2.47 2.37
13.73 9.84 14.98
14.89 2.46 7.06
9.61 1.38 1.55
9.71 5.88 5.89 9.33 3.88 9.71 12.44 9.75
3.55 3.88 2.86 5.39 3.03 7.31 8.63 4.13
16.18 16.10 19.13 15.93 16.79 14.50 23.96 26.99
12.61 8.02 8.01 10.13 4.72 9.64 17.04 11.69
4.61 4.51 4.04 4.15 3.98 5.30 13.83 4.82
19.05 12.43 13.55 12.10 11.17 18.14 12.88 15.35
11.71 3.30 7.72 10.67 5.49 12.14 11.70 8.35
5.39 2.38 1.82 4.45 2.50 5.25 4.52 4.31
13.78 11.74 11.61 12.17 11.04 15.04 15.87 18.37
10.75 3.79 3.67 5.64 4.14 10.88 10.69 7.00
4.20 1.54 1.63 3.36 2.63 5.50 5.64 3.17
405
Note: P1 refers to the ‘overall’ propensity score estimation, whereas P2 refers to the ‘group-specific’ propensity score estimation. a Mean standardised difference has been calculated as an unweighted average of all covariates. Standardised difference before matching pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¯ 1 X¯ 0 Þ=f ðV 1 ðX Þ þ V 0 ðX ÞÞ=2g and standardised difference after matching calculated as calculated as 100 ð X pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 100 ðX¯ 1M X¯ 0M Þ=f ðV 1 ðX Þ þ V 0 ðX ÞÞ=2g. b Degrees of freedom for the F-test: 41 for men and 40 for women. c Persons with a college or university certificate/diploma.
The Employment Effects of Job-Creation Schemes in Germany
Table 5.
406
MARCO CALIENDO ET AL.
Table 5 contains the results for the 11 subgroups. The first column refers to the MSD before matching, and the second column to the MSD after matching, when matching is done with the estimated ‘overall’ propensity score as shown in Table 2. This propensity score specification, which we label P1, has been done separately for the four main groups. However, it is very clear that the matching procedure based on the overall scores is not able to balance the covariates between treated and matched non-treated individuals in the subgroups. For example, the MSD after matching for men in West Germany reaches a level of 12.44% for rehabilitation attendants. Even though this is a reduction compared to the MSD before matching, it is not acceptable. For women in West Germany, in this group the MSD after matching is 17.04%. In East Germany, the MSD after matching is not much lower, reaching levels of 12.99% for young men and 14.89% for young women. Even though there are some subgroups for which the MSD is acceptable, the overall matching quality in the subgroups is not. Hence, alternative strategies are called for. One way to do so is to redefine the propensity score estimation. Whereas the ‘overall’ propensity score estimation has only been done separately for men and women in West and East Germany, we also estimate ‘groupspecific’ propensity scores. The basic idea behind that is to capture the varying influence of the variables for certain subgroups more accurately. Since we have 11 subgroups for both gender and regions, we are left with 44 propensity score estimations.31 Based on these estimations, labelled P2, we rerun the matching procedure and estimate the MSD once again. It can be seen that the MSD is now clearly lower not only compared to the situation before matching, but also compared to the situation when matching on the ‘overall’ score. This result shows that the ‘overall’ score specification is not fine enough to balance the relevant characteristics between participants and non-participants in the subgroups. Hence, we will use the ‘group-specific’ propensity scores for all further analyses in the subgroups.
5. EMPIRICAL RESULTS An important decision, which has to be made in every evaluation, is when to measure the programme effects. The empirical analysis should ensure that participants and non-participants are compared in the same economic environment and the same life-cycle position. The literature is dominated by two approaches, either comparing individuals from the start or after the end of programmes. The latter approach is problematic for two reasons. First,
The Employment Effects of Job-Creation Schemes in Germany
407
since it implies comparison of participants and non-participants in the month(s) after the programmes end, very different economic situations may be compared if exits are spread over a longer time period. Second, this approach entails an endogeneity problem of programme exits (Gerfin & Lechner, 2002). A different approach, which is predominant in the recent literature (see, e.g. Sianesi, 2004; Gerfin & Lechner, 2002) and which is also used here, measures the effects from the start of the programmes. By doing so, the policy-relevant question of whether the placement officer should place an unemployed individual in a JCS in February 2000 or not can be answered. What should be kept in mind is the possible occurrence of locking-in effects for the group of participants. The net effect of a programme consists of two opposite effects. First, the employment probability of the participants is expected to rise due to assumed positive aspects of the programme. Second, because participants who are involved in the programmes do not have the same amount of time to look for new jobs as non-participants, a reduced search intensity during programmes is expected. As it is not possible to disentangle both effects without further assumptions, locking-in effects should be seen as a constituent part of the overall programme effect (Sianesi, 2001).32 When interpreting the results both underlying effects have to be considered. As to the fall in the search intensity, we should expect an initial negative locking-in effect from participation in most programmes.33 To assess the possible magnitude of this initial effect, it is helpful to look at the programme exit rates in each group.34 The majority of participants leaves the programmes after one year. That is true for about 50% of the participants in West Germany and about 68% (80%) of the male (female) participants in East Germany. In March 2001, around 80% (74%) of the male (female) participants in West Germany left the programmes. The corresponding numbers are approximately 91% for men and 92% for women in East Germany. Since we observe the outcomes of the individuals for almost three years after starting the programmes, successful programmes should more than compensate for this initial fall.
5.1. Results for the Main Groups The results from the beginning (February 2000) to the end (December 2002) of our observation period for the main and the subgroups are depicted in Figs. 2–5. Fig. 2 contains the results for men in West Germany. The solid line in each graph describes the monthly employment effect, i.e. the
408
-0.30 -0.40 12
24
-0.10 -0.20 -0.30 -0.40 -0.50
36
-0.10 -0.20 -0.30 -0.40
2
12
24
36
2
24
12
-0.00 -0.10 -0.20 -0.30 -0.40 -0.50
36
2
12
24
Month
Month
Month
Month
Duration of Unemployment < 13 Weeks
Duration of Unemployment 13-52 Weeks
Duration of Unemployment > 52 Weeks
Without Professional Experience
0.10
0.10
-0.00
-0.00
-0.10 -0.20 -0.30 -0.40
-0.10 -0.20 -0.30 -0.40 -0.50
-0.50 2
-0.00
-0.50
Age > 50 Years
0.10
12
Month
24
2
36
Without Professional Training
12
Month
24
0.10
0.10
-0.00
-0.00
-0.10 -0.20 -0.30 -0.40 -0.50
36
ATT (Employment)
2
-0.00
ATT (Employment)
-0.20
Age 25-50 Years
0.10
ATT (Employment)
ATT (Employment)
ATT (Employment)
-0.10
ATT (Employment)
ATT (Employment)
-0.00
-0.50
Age < 25 Years
0.10
ATT (Employment)
Total
0.10
36
-0.10 -0.20 -0.30 -0.40 -0.50
2
With High Degree
12
24
36
2
12
24
Month
Month
Rehabilitation Attendant
With Placement Restrictions
0.10
36
0.10
-0.30 -0.40 -0.50
-0.00 -0.10 -0.20 -0.30 -0.40 -0.50
2
12
24 Month
36
ATT (Employment)
-0.20
ATT (Employment)
ATT (Employment)
-0.10
0.10
-0.00 -0.10 -0.20 -0.30 -0.40
12
Month
24
36
-0.10 -0.20 -0.30 -0.40 -0.50
-0.50 2
-0.00
2
12
24 Month
36
2
12
24
36
Month
Fig. 2. Employment Effects for Men in West Germany (February 2000–December 2002). Solid Line Describes the Monthly Employment Effect. Dotted Lines are the Upper and Lower 95% Confidence Limits. Month 2 Refers to February 2000, Month 12=December 2000, Month 24=December 2001, Month 36=December 2002.
MARCO CALIENDO ET AL.
ATT (Employment)
0.10 -0.00
-0.40
-0.30 -0.40 -0.50
12
24
36
12
24 Month
Duration of Unemployment < 13 Weeks
Duration of Unemployment 13-52 Weeks
ATT (Employment)
0.10
-0.10 -0.20 -0.30 -0.40 -0.50 12
24
-0.10 -0.20 -0.30 -0.40
24
12
Month
24
-0.00 -0.10 -0.20 -0.30 -0.40
36
-0.40 -0.50
-0.00 -0.10 -0.20 -0.30 -0.40
12
24 Month
36
12
Month
24
12
24 Month
36
36
-0.10 -0.20 -0.30 -0.40
36
2
12
Month
24
36
With Placement Restrictions
0.10 -0.00 -0.10 -0.20 -0.30 -0.40 -0.50
2
24
-0.00
Rehabilitation Attendant
-0.50 2
12
Without Professional Experience
0.10
ATT (Employment)
-0.30
-0.40
-0.50 2
ATT (Employment)
ATT (Employment)
ATT (Employment)
-0.20
-0.30
Month
0.10 -0.10
-0.20
2
Duration of Unemployment > 52 Weeks
With High Degree
Without Professional Training
-0.00
-0.10
36
-0.50 2
-0.00
-0.50 12
0.10
Month 0.10
-0.40
0.10
Month
-0.00
36
-0.30
2
-0.50 2
-0.20
36
Month
-0.00
-0.10
-0.50 2
ATT (Employment)
2
0.10 ATT (Employment)
-0.20
ATT (Employment)
-0.30
-0.10
-0.00
ATT (Employment)
-0.20
Age > 50 Years
Age 25-50 Years
0.10 ATT (Employment)
ATT (Employment)
ATT (Employment)
-0.10
-0.50
Age < 25 Years
0.10 -0.00
0.10 -0.00 -0.10 -0.20 -0.30 -0.40
The Employment Effects of Job-Creation Schemes in Germany
Total
0.10 -0.00
-0.50 2
12
24 Month
36
2
12
24
36
Month
Fig. 3. Employment Effects for Women in West Germany (February 2000–December 2002). Solid Line Describes the Monthly Employment Effect. Dotted Lines are the Upper and Lower 95% Confidence Limits. Month 2 Refers to February 2000, Month 12=December 2000, Month 24=December 2001, Month 36=December 2002. 409
410
Month
24
-0.40
36
-0.00
0.10 ATT (Employment)
-0.10 -0.20 -0.30 -0.40
2
12
Month
24
36
-0.10 -0.20 -0.30 -0.40
12
Month
24
36
0.10
ATT (Employment)
-0.20 -0.30 -0.40 -0.50
Month
24
-0.20 -0.30 -0.40 -0.50
2
12
24 Month
36
24
-0.10 -0.20 -0.30 -0.40 2
12
24
12
24 Month
36
-0.20 -0.30 -0.40 2
0.10
-0.30 -0.40
36
-0.20 -0.30 -0.40 12
Month
24
36
With Placement Restrictions
0.10
-0.20
24
-0.10
2
Rehabilitation Attendant
-0.10
Month
-0.50
36
-0.00
12
Without Professional Experience
-0.00
Month
-0.50 2
-0.10
-0.50
36
Duration of Unemployment > 52 Weeks
0.10
-0.10
Month
-0.00
36
-0.00
12
-0.00
-0.00 -0.10 -0.20 -0.30 -0.40 -0.50
2
12
24 Month
36
2
12
24
36
Month
Fig. 4. Employment Effects for Men in East Germany (February 2000–December 2002). Solid Line Describes the Monthly Employment Effect. Dotted Lines are the Upper and Lower 95% Confidence Limits. Month 2 Refers to February 2000, Month 12=December 2000, Month 24=December 2001, Month 36=December 2002.
MARCO CALIENDO ET AL.
-0.10
12
With High Degree
0.10
-0.00
-0.40
-0.50 2
Without Professional Training
-0.30
0.10
-0.50 2
-0.20
2
Duration of Unemployment 13-52 Weeks
-0.00
-0.10
-0.50
ATT (Employment)
ATT (Employment)
12
-0.50
ATT (Employment)
-0.30
-0.50
Duration of Unemployment < 13 Weeks
0.10
-0.20
ATT (Employment)
-0.40
-0.10
Age > 50 Years
0.10
-0.00
ATT (Employment)
-0.30
ATT (Employment)
-0.20
Age 25-50 Years
0.10
ATT (Employment)
ATT (Employment)
ATT (Employment)
-0.10
-0.50 2
Age < 25 Years
0.10 -0.00
ATT (Employment)
Total
0.10 -0.00
-0.40
Month
24
36
0.10 ATT (Employment)
-0.00 -0.10 -0.20 -0.30 -0.40 2
12
24
12
24
36
-0.30 -0.40
Month
12
36
-0.30 -0.40 -0.50
36
2
12
24 Month
Duration of Unemployment > 52 Weeks
Without Professional Experience
-0.10 -0.20 -0.30 -0.40
0.10 -0.00
0.10
-0.10 -0.20 -0.30 -0.40 -0.50
12
24
36
12
-0.20 -0.30 -0.40 12
24 Month
24
-0.20 -0.30 -0.40
36
2
Rehabilitation Attendant
0.10
-0.10
2
-0.10
-0.50 2
With High Degree
36
12
-0.10 -0.20 -0.30 -0.40 2
12
24 Month
36
Month
24
36
With Placement Restrictions
0.10
-0.00
-0.50
36
-0.00
Month
-0.00
-0.50
24
-0.20
Duration of Unemployment 13-52 Weeks
2
ATT (Employment)
-0.20
24
-0.40
-0.10
Month
0.10
-0.10
12
-0.30
Month
-0.00
2
-0.20
-0.00
Month
-0.00
36
Month Without Professional Training
0.10 ATT (Employment)
2
-0.10
-0.50 2
-0.50
-0.50
-0.50
-0.40 -0.50
Duration of Unemployment < 13 Weeks
0.10 ATT (Employment)
12
-0.30
-0.00
ATT (Employment)
2
-0.20
ATT (Employment)
-0.50
-0.10
ATT (Employment)
-0.30
-0.00
Age > 50 Years
0.10
ATT (Employment)
-0.20
ATT (Employment)
-0.10
ATT (Employment)
ATT (Employment)
ATT (Employment)
-0.00
Age 25-50 Years
0.10
0.10
-0.00 -0.10 -0.20 -0.30 -0.40 -0.50
2
12
24
The Employment Effects of Job-Creation Schemes in Germany
Age < 25 Years
Total
0.10
36
Month
Fig. 5. Employment Effects for Women in East Germany (February 2000–December 2002). Solid Line Describes the Monthly Employment Effect. Dotted Lines are the Upper and Lower 95% Confidence Limits. Month 2 Refers to February 2000, Month 12=December 2000, Month 24=December 2001, Month 36=December 2002. 411
412
MARCO CALIENDO ET AL.
difference in the employment rates between participants and matched nonparticipants. The graphs for the main group are captioned ‘total’ in the figures. All of the graphs have one thing in common, namely a large drop in the effects for the first months after starting the programme. This can be interpreted as the expected locking-in effect, which is more pronounced for men (Fig. 2) and women (Fig. 3) in West Germany than for men (Fig. 4) and women (Fig. 5) in East Germany. Five months after the programmes started (in July 2000), the effects for men in West Germany lie around 21.1% points. That means that the average employment rate of participating men is about 21% points lower in comparison to matched non-participants. Clearly, this strong reduction is expected as nearly all participants are still in the programmes, whereas the non-participants have the chance to search, apply for and find a new job. When interpreting the results one has to bear in mind that although JCS are a kind of employment they are classified as failures when assessing the reintegration success into regular (unsubsidised) employment. For women in West Germany, the result is very similar in that month and amounts to around 20.4%-points. The situation in East Germany is somewhat different. The effects here are 14.0%-points for men and 9.4%-points for women. Compared to the results for West Germany, this reflects the worse labour market situation with fewer employment opportunities. Being locked into the programme does not have as much influence, because the chances for non-participants of finding a new job are lower anyway. The development of the effects is quite different in the two regions as well. Whereas in West Germany a relatively steep increase in the employment effects can be found, the development in East Germany is much smoother. For example, in July 2001 the employment effect has risen to 12.5%-points for men and 11.9%-points for women in West Germany. Hence, the negative effects are nearly halved. In East Germany, however, the effects lie at around 10.9%-points for men and 7.5%-points for women. Looking at the last month of our observation period (December 2002), we do not find a significant programme effect for men in West Germany. That is, the employment chances of participants and matched non-participants do not differ. However, for women in West Germany we find a significant positive effect of around 5%-points, which means that participating women have benefited from the programme in terms of employment chances. However, this positive result has to be treated with caution. This is because i) women in West Germany have the smallest sample size, ii) we have lost a considerable share of participants due to the common support requirement and iii) the estimates imply a confidence interval where the lower endpoint lies close to zero.
The Employment Effects of Job-Creation Schemes in Germany
413
For East Germany, on the other hand, we find negative employment effects of 2.9%-points for men and 1.4%-points for women. This shows that the overall effect of JCS for the participating individuals is dissatisfying. Only for one of the groups, women in West Germany, do we find a positive employment effect nearly three years after the programmes start, whereas for the other three main groups the effects are negative or insignificant. It seems that the pronounced initial negative (locking-in) effect cannot be overcome during our observation period. Judging by these numbers, JCS have to be rated as unsuccessful regarding their goal to reintegrate individuals into regular (unsubsidised) employment.
5.2. Results for the Selected Subgroups Even though JCS do not work for the participants as a whole, they may work for subgroups. For instance, one might think that they are especially effective for the explicit target groups of JCS, such as long-term unemployed persons or persons without work experience. Figs. 2–5 contain the results for our selected subgroups. To abbreviate the discussion, we concentrate on two main points. First, we will examine the occurrence of locking-in effects and second, we will discuss the results at the end of the observation period (December 2002). Considering locking-in effects is explicitly of interest, since it can be expected that these effects differ for the subgroups. Relevant examples are provided by the groups defined by age and unemployment duration. Older unemployed persons have in general fewer labour market opportunities than middle-aged or younger persons. Due to the worse ‘outside options’ of the non-participants, we expect to find weaker locking-in effects for older participants and stronger effects for the other groups. The figures support these expectations empirically, independently of gender and region. With regard to the previous unemployment duration, it can be assumed that reintegration into the labour market is generally easier for persons with only a short duration of unemployment (‘negative duration dependence’). Therefore, short-term unemployed non-participants are expected to have a higher probability of receiving a job offer and hence the locking-in effects should be larger, which is in fact the case. For the other subgroups the graphs present a similar picture. We find the initial fall of the employment effects in the first months after programmes start and rising tendencies after the majority of participants has left the programmes.
414
MARCO CALIENDO ET AL.
The second point we want to discuss are the effects for these subgroups at the end of our observation period (December 2002). For most of the groups we do not find significant programme effects at this point in time, i.e. the employment rates of participants and matched non-participants do not differ nearly three years after programmes have started. That implies that programmes have neither improved nor worsened the employment chances of the participating individuals. However, for some of the groups we find significant differences in the employment rates. Long-term unemployed (more than 52 weeks) men (4.9%-points) and women (11.4%-points) in West Germany benefit from participation. These results indicate that JCS were able to improve the employment chances for this target group. Additionally, highly qualified men in West Germany benefit from participation (12.1%-points), whereas for less qualified persons and individuals without work experience no significant effects can be established. This is intuitively difficult to understand because the programmes should be designed for persons who are most in need of assistance. Another group that benefits from participation is that of older women in West Germany, whose employment rate is 13.6%-points higher than that for matched non-participants. This is an encouraging result because older unemployed persons in particular have only poor opportunities to return to the first (unsubsidised) labour market. Although for most groups we do not find any enhancement of the employment chances after participation, the results indicate a tendency for the programmes to actually be useful only for the most disadvantaged in terms of unemployment duration and age. Considering the results for the subgroups in East Germany reveals a somewhat different picture. Focussing on the male groups, we only find a significant negative effect (9.3%-points) for participants with a short unemployment duration before the programme. What has to be kept in mind is the relatively high share of individuals that we have lost due to the common support restriction. Hence, we are reluctant to emphasise the relevance of this effect. For all other groups no significant differences in the employment rates can be established. For women in East Germany the results are disappointing as well. Middle-aged (2.2%-points) as well as short-term unemployed women (7.1%-points) suffer from participation. Another group with clearly negative programme effects in December 2002 is high qualified women (8.2%-points). However, there is also one group (long-term unemployed women) for whom we find a small (2.4%-points) positive programme effect. For the other groups no significant differences can be established. Thus, the hypothesis stated above that programmes are
The Employment Effects of Job-Creation Schemes in Germany
415
actually only likely to work for the legally defined target groups, can only be supported for long-term unemployed women.
5.3. Sensitivity of the Results to Unobserved Heterogeneity Checking the sensitivity of the estimated treatment effects has become an increasingly important topic in the applied evaluation literature (see Caliendo & Kopeinig, 2008 for a recent survey of different methods to do so). We have already mentioned one approach by Lechner (2001) which allows the researcher to check the sensitivity due to failure of the common support. In this section we are going to test the sensitivity of our results in a different direction, i.e. a situation where the CIA does not hold. We have outlined in Section 3 that the estimation of treatment effects with matching estimators is based on the conditional independence assumption. However, if there are unobserved variables which affect assignment into treatment and the outcome variable simultaneously, a ‘hidden bias’ might arise (Rosenbaum, 2002) to which matching estimators are not robust. Since it is impossible to estimate the magnitude of selection bias with nonexperimental data, we address this problem with the bounding approach proposed by Rosenbaum (2002). The basic question to be answered is whether or not inference about programme effects may be altered by unobserved factors. In other words, we want to determine how strongly an unmeasured variable must influence the selection process in order to undermine the implications of the matching analysis. Recent applications of this approach can be found in Aakvik (2001), DiPrete and Gangl (2004), and Hujer et al. (2004), whereas Ichino, Mealli, and Nannicini (2006) present a slight variation of this approach with an application to Italian data. We outline this approach briefly; an extensive discussion can be found in Rosenbaum (2002) and Aakvik (2001). 5.3.1. The Model Let us assume that the participation probability is given by P(xi) ¼ P(Di=1|xi)=F(bxi+gui), where xi are the observed characteristics for individual i, ui is the unobserved variable and g is the effect of ui on the participation decision. Clearly, if the study is free of hidden bias, g will be zero and the participation probability will solely be determined by xi. However, if there is hidden bias, two individuals with the same observed covariates x have differing chances of receiving treatment. Let us assume that we have a matched pair of individuals i and j, and that F is the logistic
416
MARCO CALIENDO ET AL.
distribution. The odds that the two individuals receive treatment are then given by ðPðxi ÞÞ=ð1 Pðxi ÞÞ and ðPðxj ÞÞ=ð1 Pðxj ÞÞ, and the odds ratio is given by ðPðxi ÞÞ=ð1 Pðxi ÞÞ Pðxi Þð1 Pðxj ÞÞ expðbxj þ guj Þ ¼ ¼ ¼ exp½gðui uj Þ ðPðxj ÞÞ=ð1 Pðxj ÞÞ Pðxj Þð1 Pðxi ÞÞ expðbxi þ gui Þ (3) If both units have identical observed covariates – as implied by the matching procedure – the x vector is cancelled out. But still, the two individuals differ in their odds of receiving treatment by a factor that involves the parameter g and the difference in their unobserved covariates u. So, if there are either no differences in unobserved variables (ui ¼ uj) or if unobserved variables have no influence on the probability of participating (g ¼ 0), the odds ratio is 1, implying the absence of hidden or unobserved selection bias. It is now the task of the sensitivity analysis to evaluate how inference about the programme effect is altered by changing the values of g and (ui uj). We follow Aakvik (2001) and assume for the sake of simplicity that the unobserved covariate is a dummy variable with uiA{0,1}. Rosenbaum (2002) shows that Eq. (3) implies the following bounds on the odds ratio that either of the two matched individuals will receive treatment: 1 Pðxi Þð1 Pðxj ÞÞ eg eg Pðxj Þð1 Pðxi ÞÞ
(4)
Both matched individuals have the same probability of participating only if eg=1. Otherwise, if for example eg=2, individuals who appear to be similar (in terms of x) could differ in their odds of receiving the treatment by as much as a factor of 2. In this sense, eg is a measure of the degree of departure from a study that is free of hidden bias (Rosenbaum, 2002). 5.3.2. The Test Statistic Aakvik (2001) suggests using the Mantel and Haenszel (1959, MH) test statistic. The MH non-parametric test compares the successful – in our case employed – number of persons in the treatment group against the same expected number given the treatment effect is zero. As the test is based on random sampling, we first have to make treatment and comparison group as equal as possible which is basically done by our matching procedure. Aakvik (2001) notes that the MH test can be used to test for no treatment effect both within different strata of the sample and as a weighted average between strata. Let r1s be the number of successful participants, r0s the
The Employment Effects of Job-Creation Schemes in Germany
417
number of successful non-participants, and rs the number of total successes in stratum s. Additionally, let N1s and N0s denote the numbers of treated and non-treated individuals in stratum s, where Ns ¼ N0s+N1s. The expected number of employed participants in stratum s under the null hypothesis of no treatment effect is given by N1s(rs/Ns). If r1s lies above (or below) that value, the treatment d has some positive (or negative) effect. To be significant, the treatment effect has to cross some test statistic t(d, r1s). Under the null hypothesis of no treatment effect, the distribution of r1s is hypergeometric and the test statistic QMH ¼ ((r1sE(r1s))2/Var(r1s)) follows the chi-square distribution with one degree of freedom. It is given by S 2 P r1s ððN 1s rs Þ=N s s¼1 QMH ¼ S (5) P 2 ðN 1s N 0s rs ðN s rs ÞÞ=ðN s ðN s 1ÞÞ s¼1
As mentioned already, using such a test statistic requires making the treatment and comparison groups as equal as possible since this test is based on random sampling. Since this is done by our matching procedure, we can proceed to discuss the possible influences of eg>1. For fixed eg>1 and uA{0,1}, Rosenbaum (2002) shows that the test statistic QMH can be bounded by two known distributions. As noted already, if eg=1 the bounds are equal to the ‘base’ scenario of no hidden bias. With increasing eg, the bounds move apart reflecting uncertainty about the test statistics in the presence of unobserved selection bias. Two scenarios can be thought of. Let Qþ MH be the test statistic, given that we have overestimated the treatment effect and Q MH the case where we have underestimated the treatment effect. The two bounds are then given by S 2 P þ ðr1s E~ s s¼1 Qþ (6) MH ¼ S P þ VarðE~ s Þ s¼1
and Q MH ¼
S P
ðr1s E~ s
s¼1 S P
s¼1
2 (7)
VarðE~ s Þ
418
MARCO CALIENDO ET AL.
where E~ s and VarðE~ s Þ are the large sample approximations to the expectation and variance of the number of successful participants when u is binary and for given g.35
5.3.3. Sensitivity Analysis: Results Table 6 contains the results of the sensitivity analysis for the last month of our observation period (December 2002).36 First of all, Table 6 contains the matching impact estimates and the results of the MH test statistic for the situation free of hidden bias (eg=1). A w2-value above 3.84 (6.63) indicates that the treatment effect is significant at the 5% level (1% level). Table 6 shows furthermore the sensitivity of the test statistics for eg=1.1, eg=1.2 and eg=1.3. The two bounds in Table 6 can be interpreted in the following way: if we have a positive (unobserved) selection, in the sense that if those most likely to participate, given that they have the same x vector, also have higher employment rates, then the estimated treatment effects discussed in the previous sections overestimate the true treatment effect. The reported chi-square statistic is then too high and should be adjusted downwards. On the other hand, if we have negative (unobserved) selection, the statistic is too low and should be adjusted upwards (Aakvik, 2001). To give an example, let us look at the effect for long-term unemployed men in West Germany where we find a significant treatment effect of 0.049. If we have negative (unobserved) selection, we underestimate the true treatment effect. Therefore, changing eg will not change the significance of our results as the lower bound of the test statistic increases with increasing eg. However, if we have positive (unobserved) selection, we overestimate the true treatment effect. Hence, we have to look at the bounds of Qþ MH . It can be seen that the effect is still significant if eg=1.1, but becomes insignificant if eg=1.2. Hence, for estimated treatment effects that are positive (negative) one should look at Qþ MH QMH . For treatment effects that are insignificant at eg=1, the bounds tell us at which degree of unobserved positive or negative selection the effect would become significant. Whereas some of the estimated effects are quite stable with varying eg, others would become insignificant at low levels of unobserved heterogeneity. However, it should be noted that these are worst-case scenarios. Hence, a critical value of eg=1.1 does not mean that unobserved heterogeneity exists and that there is no effect of treatment on the outcome variable. This result only states that the confidence interval for the effect would include zero if an unobserved variable caused the odds ratio of treatment assignment to differ between the treatment and comparison groups by 1.1.
Sensitivity Analysis for Unobserved Heterogeneity in December 2002. Men
Group
Effect QMH for eg=1
QMH for eg=1.1
Women QMH for eg=1.2
QMH for eg=1.3
Effect QMH for eg=1
þ þ þ Q MH QMH QMH QMH QMH QMH
West Germany Total –0.001 Age (in years) o25 0.022 25–50 0.016 W50 0.026 Duration of unemployment (in weeks) o13 0.042 13–52 0.011 W52 0.049 Without occupational experience 0.008 Without occupational training 0.002 With high degreea 0.121 With placement restrictions 0.021 Rehabilitation attendant 0.029 East Germany Total Age (in years) o25 25–50 W50
2.09 26.96
0.09 44.96
6.39 4.16 12.35 0.071 3.17 28.14 10.88 0.054 0.00 7.85 0.41 0.136
1.79 4.59 8.77
0.84 8.80 10.93
3.13 0.03 1.74 19.51 6.92 15.28
6.35 0.22 10.03 0.01 32.12 1.83 4.20 19.70 2.43
4.34 0.21 10.75 2.18 18.56 0.030 1.61 2.80 7.02 8.15 14.72 0.032 4.42 22.45 0.79 33.84 0.02 0.114 0.07 2.21 1.15 4.79 3.15 0.029 0.81 8.08 7.02 18.92 17.27 0.046 2.46 7.40 1.08 10.18 0.34 0.016 0.06 4.03 0.33 7.41 1.64 0.137 0.05 1.66 0.07 2.95 0.48 0.081
0.37 0.91 13.77 0.25 2.39 0.07 5.89 0.74
1.17 2.41 18.54 0.01 4.82 0.39 7.76 1.07
0.02 3.61 0.12 6.81 9.76 28.67 0.78 0.30 0.81 11.12 0.01 1.59 4.33 11.64 0.49 1.76
0.45 0.49 4.41 2.40 0.05 0.54 2.15 0.17
0.54 0.99 1.13
0.01 4.01 2.16
1.91 1.12 0.00 14.25 0.44 4.76
1.97 0.24 7.83 0.04 0.01 3.49 0.51 0.25
0.53 0.09 12.24 0.45 1.19 4.75 1.39 0.62
1.33 2.40 2.45
þ þ þ Q MH QMH QMH QMH QMH QMH
11.82
1.97 11.91 13.49 29.20 31.66
0.048 0.018 0.015
QMH for eg=1.3
5.95
1.40
12.39
QMH for eg=1.2
0.050
0.01
0.029
QMH for eg=1.1
6.79 12.39 39.25 1.25 18.65 3.28 15.64 2.50
3.20
1.89 2.56 1.51 4.52 1.44 1.64 0.90 0.03
5.62 21.84
0.13 45.10
1.82 72.08 0.014
4.20
0.22 13.20
5.28 41.24 21.89 78.06
0.49 0.48 1.10
0.01 5.81 0.67 15.47 0.02 9.03
0.60 9.62 0.028 4.44 27.58 0.022 0.39 14.49 0.002
0.28 6.03 0.06
0.79 0.03 1.12 14.86 0.12 0.69
2.33 0.22 4.30 1.02 1.92 39.97 12.14 71.52 1.87 3.46 5.08 7.57
2.59 5.82 4.35
The Employment Effects of Job-Creation Schemes in Germany
Table 6.
419
420
Table 6. (Continued ) Men Group
Effect QMH for eg=1
QMH for eg=1.1
Women QMH for eg=1.2
QMH for eg=1.3
Effect QMH for eg=1
þ þ þ Q MH QMH QMH QMH QMH QMH
Duration of unemployment (in weeks) o13 0.093 13–52 0.016 W52 0.002 Without occupational experience 0.007 Without occupational training 0.012 With high degreea 0.000 With placement restrictions 0.016 Rehabilitation attendant 0.014
14.73 1.49 0.03 0.05 0.63 0.00 0.43 0.16
10.62 19.59 0.20 4.01 0.14 0.53 0.43 0.05 1.95 0.04 0.08 0.08 1.19 0.05 0.00 0.57
5.07 29.89 1.94 40.64 0.071 0.85 11.43 4.40 20.92 0.008 1.83 2.89 4.79 6.45 0.024 2.00 0.95 4.30 2.65 0.022 6.03 0.74 11.38 3.11 0.022 0.64 0.64 1.53 1.53 0.082 3.47 0.30 6.40 1.45 0.016 0.34 1.93 1.27 3.74 0.007
QMH for eg=1.2
QMH for eg=1.3
þ þ þ Q MH QMH QMH QMH QMH QMH
7.78 0.47 9.44 0.94 2.67 4.10 0.45 0.03
4.89 11.38 1.45 19.40 0.12 28.09 0.12 2.98 4.72 12.65 14.00 26.38 16.05 4.60 32.04 0.28 50.43 0.74 2.28 0.18 6.08 0.27 10.85 1.78 0.87 5.45 0.08 12.78 1.78 21.63 2.81 5.68 1.14 9.06 0.30 12.65 0.06 1.20 0.26 3.43 1.32 6.27 0.02 0.21 0.40 0.93 1.14 1.97
MARCO CALIENDO ET AL.
Note: Bold letters indicate significance on a 1% level. Italic letters refer to the 5% level. a Persons with a college or university certificate/diploma.
QMH for eg=1.1
The Employment Effects of Job-Creation Schemes in Germany
421
6. CONCLUSION The aim of this chapter is the evaluation of the effect of JCS on reintegration into regular (unsubsidised) employment for the participating individuals in Germany. Our analysis is based on a dataset from administrative sources of the FEA containing information on all participants (11,151) who started a JCS in February 2000 and a comparison group of 219,622 unemployed persons. Special attention is given to the possible occurrence of individual, i.e. group-specific, and regional effect heterogeneity. That is, we estimate the effects separately for men and women in West and East Germany (‘main groups’) as well as for 11 ‘subgroups’. Given the very informative dataset, we can apply a matching estimator to solve the problem of selection bias. Since the large number of relevant covariates makes exact matching infeasible, we use propensity score matching for the analysis and discuss the required implementation steps in detail. We estimate the effects from the beginning of the programmes in February 2000 until December 2002. Since JCS are usually supported for 12 months, we find large locking-in effects for all of the groups in the first months after the start of the programme. The locking-in effects are more pronounced in West Germany and less substantial in East Germany, which may be caused by the better employment opportunities for non-participants in the West. Regarding the effects for the main groups at the end of the observation period, we find a significant positive effect only for women in West Germany (5%-points), whereas the effect for men in West Germany is insignificant. For men (2.9%-points) and women (1.4%-points) in East Germany the effects are significantly negative. Hence, except for women in West Germany, it seems that the initial negative locking-in effect cannot be overcome during the observation period. For most of the subgroups we do not find significant effects at all. However, one exception has to be noted: Long-term unemployed men (4.9%-points) and women (11.4%-points) in West Germany as well as long-term unemployed women (2.4%-points) in East Germany benefit from participation. The positive findings for the long-term unemployed persons indicate that JCS might work for this problem group of the labour market. However, this result cannot be extended to other problem groups such as individuals with placement restrictions, individuals without work experience or less qualified persons. Even though we would have expected positive effects for these problem groups of the labour market, we did not find any. To some extent the effects reflect the different purposes of JCS in the two regions. Whereas they are used as a relief
422
MARCO CALIENDO ET AL.
programme in East Germany they are more strictly focussed on problem groups in the West leading to better effects. The overall picture is rather disappointing because most of the effects are insignificant or negative. Participation in programmes does not help most individuals to reintegrate into regular (unsubsidised) employment. Furthermore, if one thinks about the cumulated effects over the whole observation period and cost-benefit analysis, the few positive aspects are likely to vanish completely. The cumulated effect over the observation period of nearly three years is clearly negative for all of the analysed subgroups (including those for whom we find a positive effect at the end). The relatively high costs of JCS make it likely that these programmes have to be rated as failures from the cost-benefit side. Clearly, a thorough cost-benefit analysis would not only have to assess the detailed costs – which are not available at the moment – but also put a value on the output produced by the participants during the programme. This would clearly help to judge the performance of JCS at a deeper level. Our results are concordant with recent evaluation studies of JCS for other countries (see, e.g. Sianesi, 2004 for Sweden; Martin & Grubb, 2001 for an overview of OECD countries) that find large locking-in effects and overall negative effects. However, our results also emphasise the importance of effect heterogeneity when estimating treatment effects. We find positive effects at the end of our observation period for the long-term unemployed, who are usually one of the most problematic groups in the labour market. Hence, the positive result for them is promising and shows that JCS might work for this target group and might be an alternative for hard-to-place individuals. Clearly, one policy implication is to address programmes to this problem group more closely, which is at the moment, especially in East Germany, not the case. Limiting access to these programmes and tailoring them more for the ones who need them most might be a way to improve their overall efficiency and to offer a ‘last chance’ for hard-to-place individuals.
NOTES 1. The reform process on the German labour market is still ongoing, with the gradual introduction of additional reforms (see ‘Modern Services on the Labour Market’, Bundesministerium fu¨r Wirtschaft und Arbeit, 2003). Since we focus in our empirical analysis on the time period 2000–2002, we are not going to discuss the current reforms here.
The Employment Effects of Job-Creation Schemes in Germany
423
2. If the programme is targeted at people with ‘disadvantages’, there is always a risk that a possible employer takes participation in such schemes as a negative signal concerning their expected productivity or motivation. 3. There are basically two ways to put greater emphasis on specific variables. One can either find individuals in the comparison group who are identical with respect to these variables (see, e.g. Puhani, 2002) or carry out matching on subpopulations (see, e.g. Heckman et al., 1997a, 1998a). 4. Seasonally employed workers have a reduced contribution period of six months. 5. The set-up of UB and UA changed in January 2005. Since that time, UB are paid for at most 12 months. In addition, UA and the former social assistance are pooled in a new instrument called unemployment benefits II. 6. The legal basis for JCS is yy 260–271, 416 Social Code III since 1998. Since then, it was reformed gradually and amended twice (in 2002 and 2004). Our analysis is based on programmes starting in 2000; therefore we refrain from presenting the details of later changes of the law. 7. Exceptions are made for individuals whose unemployment spell has been interrupted, e.g. by long-term sickness, time spent in ALMP programmes or maternity leave. For these individuals the eligibility criteria is to be unemployed for at least 6 of the last 12 months. 8. It has to be noted that sanctions for rejecting an offer of a JCS were of minor importance during the time of our analysis. However, they have become increasingly important after 2002. 9. The final version of the MTG contains information on all ALMP programmes of the FEA, but was not available when our sample was drawn. 10. All individual characteristics are measured at the end of January 2000 and the classification into the subgroups is based on this point in time too. 11. These are individuals who had to quit their jobs due to health problems and have been out of the labour market for a longer period. They can apply for vocational rehabilitation which can cover a habituation with regular work, but also a (re-)training to obtain a new qualification. 12. Groups with fewer than 100 observations are excluded from the analysis. 13. Table A.1 in the supplementary appendix contains selected descriptive statistics for the main groups. 14. The potential outcome framework is variously attributed to Fisher (1935), Neyman (1935), Roy (1951), Quandt (1972, 1988) or Rubin (1974), but most often it is just called the Roy–Rubin model (RRM). 15. Heckman, LaLonde, and Smith (1999) discuss further parameters, like the proportion of participants who benefit from the programme or the distribution of gains at selected base state values. Additionally, Heckman, Smith, and Clements (1997b) discuss distributions of programme impacts. 16. See Heckman et al. (1999), Angrist and Krueger (1999), Blundell and Costa Dias (2002) or Caliendo and Hujer (2006) for overviews. 17. See Imbens (2004) or Smith and Todd (2005a) for recent overviews regarding matching methods. 18. Caliendo and Kopeinig (2008) provide an extensive overview of the issues arising when implementing matching estimators.
424
MARCO CALIENDO ET AL.
19. ‘Occupational training’ indicates the highest level of education or vocational training an individual has attained, e.g. vocational in-house training in a particular industry, university degree etc. ‘Occupational group’ refers to the sector or industry in which the individual worked previously, e.g. technical occupations or service occupations. ‘Occupational rank’ is a ranking assigned by the local placement officer at the Employment Office indicating whether the individual is a skilled worker, an unskilled worker, a white-collar worker in a basic-level position, or a white-collar worker in a high-level position. See Table 2 for more details. 20. Placement propositions are job offers by the caseworkers in the first place, but could cover places in ALMP programmes as well. Persons are not obliged to accept these offers, but rejections could be sanctioned. However, sanctioning job seekers for not accepting offers was rarely done during the time of our analysis. 21. This variable contains information about the last time when the unemployed individual had a counselling interview in the job centre. 22. Furthermore, it should be noted that using individuals who are observed to never participate in the programmes as the comparison group may invalidate the conditional independence assumption due to a conditioning on future outcomes (see the discussion in Fredriksson & Johansson, 2004). For further discussion about dynamic treatment effects, see also Abbring and van den Berg (2004) and Heckman and Navarro-Lozano (2005). 23. This is of particular interest with data where the distribution of the propensity score is very different in the treatment and the comparison group (see the discussion in Smith & Todd, 2005a). 24. When using multiple NN matching, one has to decide how many matching partners m should be chosen for each individual i and which weight should be assigned to them. We will use uniform weights, i.e. all the m comparison individuals within set Ai receive the weight 1/m, whereas all other individuals from the comparison group receive the weight zero. 25. Matching is implemented using the STATA code PSMATCH2 by Leuven and Sianesi (2003). 26. It should be noted, however, that their Monte Carlo simulation results do not show very substantive differences between the exact and the bootstrapped variance for the NN matching estimator. 27. This is mostly driven by the finding for women in West Germany, where imposing this caliper reduces the number of treated observations by approximately 4.6% of the sample. In turn, this means that if we do not impose this caliper, the distance in the propensity scores would be higher than 0.02 for 4.6%. For the other groups, imposing the caliper does not have much influence. 28. Lechner (2001) describes an approach to check the robustness of estimated treatment effects due to failure of common support, where he incorporates information from those individuals who failed the common support restriction, to calculate non-parametric bounds on the parameter of interest, if all individuals from the sample at hand would have been included. 29. However, there are a few variables where the balancing is less satisfying. The maxima of the remaining difference is 9.3% for men and women in West Germany, and 8.4% (4.8%) for men (women) in East Germany. Since the bias reduction in these variables is enormous, we think that this is of minor importance.
The Employment Effects of Job-Creation Schemes in Germany
425
30. See Smith and Todd (2005b) for alternative balancing tests such as the hotelling and regression tests. 31. The results of these estimations are available on request from the authors. 32. These ideas date back to Becker (1964) who makes the point that human capital investments are composed of an investment period, in which one incurs the opportunity cost of not working, and a pay-off period in which ones’ employment and/or wage are higher than they would have been without the investment. 33. Clearly, this does not apply to job search assistance programmes or very short programmes. 34. Tables with cumulated exit rates for the main and subgroups are available in the supplementary appendix (Tables A.5–A.8). 35. The large sample approximation of E~sþ is the unique root of the following 2 quadratic equation: E~ s ðeg 1Þ E~ S ½ðeg 1ÞðN 1s þ rs Þ þ N s þ eg rs N 1s , with the addition of fmaxð0; rs þ N 1s N s Þ E~ s minðrs ; N 1s Þg to decide which root to use. g g E~ S is determined by replacing e by 1=e . The large sample approximation of the 1 1 1 1 1 þ þ þ : variance is given by VarðE~ s Þ ¼ E~s rs E~s N 1s E~s N s rs N 1s þ E~s 36. The sensitivity analysis is implemented using the STATA code MHBOUNDS by Becker and Caliendo (2007).
ACKNOWLEDGMENTS The authors thank Sascha O. Becker, Barbara Sianesi, Jeff Smith, the editors and one anonymous referee for valuable comments. The chapter has also benefited from discussions at the SOLE/EALE world conference in San Francisco, the annual meetings of the EEA in Amsterdam and the AIEL in Rome and the Economics Research Seminar at the University of St. Gallen. A supplementary appendix to this chapter is available on request from the authors and can also be downloaded from http://www.caliendo.de/Papers/ advances_supplement.pdf. The usual disclaimer applies. This chapter emerged within the research project ‘Effects of Job Creation and Structural Adjustment Schemes’ financed by the Institute for Employment Research (IAB).
REFERENCES Aakvik, A. (2001). Bounding a matching estimator: The case of a Norwegian training program. Oxford Bulletin of Economics and Statistics, 63(1), 115–143. Abadie, A., & Imbens, G. (2006). On the failure of the bootstrap for matching estimators. Working Paper. Harvard University.
426
MARCO CALIENDO ET AL.
Abbring, J., & van den Berg, G. (2004). Analyzing the effect of dynamically assigned treatments using duration models, binary treatment models, and panel data models. Empirical Economics, 29(1), 5–20. Angrist, J. D., & Krueger, A. B. (1999). Empirical strategies in labor economics. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics (pp. 1277–1366). Amsterdam: Elsevier. Becker, G. (1964). Human capital. New York: Columbia University Press. Becker, S., & Caliendo, M. (2007). Sensitivity analysis for average treatment effects. Stata Journal, 7(1), 71–83. Black, D., & Smith, J. (2004). How robust is the evidence on the effects of the college quality? Evidence from matching. Journal of Econometrics, 121(1), 99–124. Blien, U., Hirschenauer, F., Arendt, M., Braun, H. J., Gunst, D.-M., Kilcioglu, S., Kleinschmidt, H., Musati, M., Ross, H., Vollkommer, D., & Wein, J. (2004). Typisierung von Bezirken der Agenturen der Arbeit. Zeitschrift fu¨r Arbeitsmarktforschung, 37(2), 146–175. Blundell, R., & Costa Dias, M. (2002). Alternative approaches to evaluation in empirical microeconomics. Portuguese Economic Journal, 1, 91–115. Blundell, R., Dearden, L., & Sianesi, B. (2005). Evaluating the impact of education on earnings in the UK: Models, methods and results from the NCDS. Journal of the Royal Statistical Society, Series A, 168(3), 473–512. Bryson, A., Dorsett, R., & Purdon, S. (2002). The use of propensity score matching in the evaluation of labour market policies. Working Paper no. 4. Department for Work and Pensions. Bundesministerium fu¨r Wirtschaft und Arbeit (2003). Moderne Dienstleistungen am Arbeitsmarkt. Bericht der Kommission zum Abbau der Arbeitslosigkeit und zur Umstrukturierung der Bundesanstalt fu¨r Arbeit. Caliendo, M., & Hujer, R. (2006). The microeconometric estimation of treatment effects – An overview. Journal of the German Statistical Society/Allgemeines Statistisches Archiv, 90(1), 197–212. Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. IZA Discussion Paper no. 1588. Forthcoming in Journal of Economic Surveys, 22(1). Dawid, A. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B, 41(1), 1–31. DiPrete, T., & Gangl, M. (2004). Assessing bias in the estimation of causal effects: Rosen–baum bounds on matching estimators and instrumental variables estimation with imperfect instruments. Sociological Methodology, 34, 271–310. Eichler, M., & Lechner, M. (2002). An evaluation of public sector sponsored continuous vocational training programs in the East German State of Sachsen Anhalt. Labour Economics, 9(2), 143–186. Fisher, R. (1935). Design of experiments. New York: Hafner. Fitzenberger, B., Osikominu, A., & Vo¨lter, R. (2005). Get training or wait? Long run employment effects of training programs for the unemployed in Germany, Working Paper. Goethe-University, Frankfurt. Fredriksson, P., & Johansson, P. (2004). Dynamic treatment assignment: The consequences for evaluations using observational data. Discussion Paper no. 1062. IZA.
The Employment Effects of Job-Creation Schemes in Germany
427
Fro¨lich, M. (2004). Finite-sample properties of propensity-score matching and weighting estimators. The Review of Economics and Statistics, 86(1), 77–90. Gerfin, M., & Lechner, M. (2002). A microeconometric evaluation of the active labour market policy in Switzerland. The Economic Journal, 112(482), 854–893. Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998a). Characterizing selection bias using experimental data. Econometrica, 66(5), 1017–1098. Heckman, J., Ichimura, H., & Todd, P. (1997a). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. Review of Economic Studies, 64(4), 605–654. Heckman, J., Ichimura, H., & Todd, P. (1998b). Matching as an econometric evaluation estimator. Review of Economic Studies, 65(2), 261–294. Heckman, J., LaLonde, R., & Smith, J. (1999). The economics and econometrics of active labor market programs. In: O. Ashenfelter & D. Card (Eds), Handbook of labor economics (Vol. III, pp. 1865–2097). Amsterdam: Elsevier. Heckman, J., & Navarro-Lozano, S. (2005). Dynamic discrete choice and dynamic treatment effects. Technical Working Paper no. 316. NBER. Heckman, J., & Smith, J. (1999). The pre-program earnings dip and the determinants of participation in a social program: Implications for simple program evaluation strategies. Economic Journal, 109(457), 313–348. Heckman, J., Smith, J., & Clements, N. (1997b). Making the most out of programme evaluations and social experiments: Accounting for heterogeneity in programme impacts. The Review of Economic Studies, 64(4), 487–535. Hu¨bler, O. (1997). Evaluation bescha¨ftigungspolitischer MaXnahmen in Ostdeutschland. Jahrbu¨cher fu¨r Nationalo¨konomie und Statistik, 216, 21–44. Hujer, R., Caliendo, M., & Thomsen, S. (2004). New evidence on the effects of job creation schemes in Germany: A matching approach with threefold heterogeneity. Research in Economics, 58(4), 257–302. Ichino, A., Mealli, F., & Nannicini, T. (2006). From temporary help jobs to permanent employment: What can we learn from matching estimators and their sensitivity. Discussion Paper no. 2149. IZA, Bonn. Imbens, G. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The Review of Economics and Statistics, 86(1), 4–29. Kraus, F., Puhani, P., & Steiner, V. (2000). Do public works programs work? Some unpleasant results from the East German experience. In: S. Polachek (Ed.), Research in labour economics. Amsterdam: JAI Press. Lechner, M. (2001). A note on the common support problem in applied evaluation studies. Discussion Paper no. 2001-01. University of St. Gallen, SIAW. Lechner, M., Miquel, R., & Wunsch, C. (2004). Long-run effects of public sector sponsored training in West Germany. Discussion Paper no. 1443. IZA, Bonn. Leuven, E., & Sianesi, B. (2003). PSMATCH2: Stata Module to Perform Full Mahalanobis and Propensity Score Matching, Common Support Graphing, and Covariate Imbalance Testing. Software, http://ideas.repec.org/c/boc/bocode/s432001.html Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748. Martin, P., & Grubb, D. (2001). What works and for whom: A review of OECD countries experiences with active labour market policies. Swedish Economic Policy Review, 8, 9–56.
428
MARCO CALIENDO ET AL.
Neyman, J. (1935). Statistical problems in agricultural experiments. The Journal of the Royal Statistical Society, 2(2), 107–180. Puhani, P. (2002). Advantage through training in Poland? A microeconometric evaluation of the employment effects of training and job subsidy programmes. Labour, 16, 569–608. Quandt, R. (1972). Methods for estimating switching regressions. Journal of the American Statistical Association, 67(338), 306–310. Quandt, R. (1988). The economics of disequilibrium. Oxford: Basil Blackwell. Rosenbaum, P., & Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–50. Rosenbaum, P., & Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistican, 39(1), 33–38. Rosenbaum, P. R. (2002). Observational studies. New York: Springer. Roy, A. (1951). Some thoughts on the distribution of earnings. Oxford Economic Papers, 3(2), 135–145. Rubin, D. (1974). Estimating causal effects to treatments in randomised and nonrandomised studies. Journal of Educational Psychology, 66, 688–701. Sianesi, B. (2001). Differential effects of Swedish active labour market programmes for unemployed adults during the 1990s. Working Paper no. 01/25. London: The Institute for Fiscal Studies. Sianesi, B. (2004). An evaluation of the Swedish system of active labour market programmes in the 1990s. The Review of Economics and Statistics, 86(1), 133–155. Smith, H. (1997). Matching with multiple controls to estimate treatment effects in observational studies. Sociological Methodology, 27, 325–353. Smith, J. (2000). A critical survey of empirical methods for evaluating active labor market policies. Schweizerische Zeitschrift fuer Volkswirtschaft und Statistik, 136(3), 1–22. Smith, J., & Todd, P. (2005a). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125(1–2), 305–353. Smith, J., & Todd, P. (2005b). Rejoinder. Journal of Econometrics, 125, 365–375. Steiner, V., & Kraus, F. (1995). Haben Teilnehmer an ArbeitsbeschaffungsmaXnahmen in Ostdeutschland bessere Wiederbescha¨ftigungschancen als Arbeitslose? In: L. Bellmann & V. Steiner (Eds), Mikroo¨konomik des Arbeitsmarktes (pp. 387–423). Nuremberg: IAB-Beitra¨ge zur Arbeitsmarkt- und Berufsforschung 192.